当前位置：

首页
/
IT
/
程序
/
Python
/
polyglot:Pipeline 多语言 NLP 工具

polyglot:Pipeline 多语言 NLP 工具

知道创宇 IA-Lab 岳永鹏

目前, 在 NLP 任务处理中, Python 支持英文处理的开源包有 NLTK,Scapy,StanfordCoreNLP,GATE,OPenNLP, 支持中文处理的开源工具包有 Jieba,ICTCLAS,THU LAC,HIT LTP, 但是这些工具大部分仅对特定类型的语言提供支持. 本文将介绍功能强大的支持 Pipeline 方式的多语言处理 Python 工具包: polyglot. 该项目最早是由 AboSamoor 在 2015 年 3 月 16 日在 GitHub 上开源的项目, 已经在 GitHub 收集 star 1021 个.

Free software: GPLv3 license
Documentation:  http://polyglot.readthedocs.org/
GitHub:  https://github.com/aboSamoor/polyglot

特征

语言检测 Language Detection (支持 196 种语言)

分句, 分词 Tokenization (支持 165 种语言)

实体识别 Named Entity Recognition (支持 40 种语言)

词性标注 Part of Speech Tagging(支持 16 种语言)

情感分析 Sentiment(支持 136 种语言)

词嵌入 Word Embeddings(支持 137 种语言)

翻译 Transliteration(支持 69 种语言)

管道 Pipelines

安装

从 PyPI 安装 / 升级

$ pip install polyglot

安装 polyglot 依赖于 numpy 和 libicu-dev, 在 Ubuntu / debian Linux 发行版中你可以通过执行以下命令来安装这样的包:

$ sudo apt-get install python-numpy libicu-dev

安装成功以后, 输入

$ import polyglot
$ polyglot.__version__
$ 16.07.04

数据

在随后的实例演示中, 将以中文, 英文或中英文混合语句作为测试数据.

text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."text_cn = u" 日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕. 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机."
text_mixed = text_cn + text_en

语言检测 Language Detection

polyglot 的语言检测依赖 https://pypi.org/project/pycld2/ 和 https://code.google.com/p/cld2/ , 其中 https://code.google.com/p/cld2/ 是 Google 开发的多语言检测应用.

Example

导入依赖

from polyglot.detect import Detector

语言类型检测

>>> Detector(text_cn).language
 name: Chinese     code: zh       confidence:  99.0 read bytes:  1996
>>>> Detector(text_en).language
 name: English     code: en       confidence:  99.0 read bytes:  1144
>>> Detector(text_mixed).language
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996

对中英文混合的 text_mixed , 其识别的语言是中文, 但置信度 (confidence) 仅有 50, 所有包含的语言类型检测

>>> for language in Detector(text_mixed):
>>>     print(language)
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996
 name: English     code: en       confidence:  49.0 read bytes:  1144
 name: un          code: un       confidence:   0.0 read bytes:     0

目前, https://code.google.com/p/cld2/ 支持的语言检测类型有

>>> Detector.supported_languages()
  1. Abkhazian                  2. Afar                       3. Afrikaans
  4. Akan                       5. Albanian                   6. Amharic
  7. Arabic                     8. Armenian                   9. Assamese
 10. Aymara                    11. Azerbaijani               12. Bashkir
 13. Basque                    14. Belarusian                15. Bengali
 16. Bihari                    17. Bislama                   18. Bosnian
 19. Breton                    20. Bulgarian                 21. Burmese
 22. Catalan                   23. Cebuano                   24. Cherokee
 25. Nyanja                    26. Corsican                  27. Croatian
 28. Croatian                  29. Czech                     30. Chinese
 31. Chinese                   32. Chinese                   33. Chinese
 34. Chineset                  35. Chineset                  36. Chineset
 37. Chineset                  38. Chineset                  39. Chineset
 40. Danish                    41. Dhivehi                   42. Dutch
 43. Dzongkha                  44. English                   45. Esperanto
 46. Estonian                  47. Ewe                       48. Faroese
 49. Fijian                    50. Finnish                   51. French
 52. Frisian                   53. Ga                        54. Galician
 55. Ganda                     56. Georgian                  57. German
 58. Greek                     59. Greenlandic               60. Guarani
 61. Gujarati                  62. Haitian_creole            63. Hausa
 64. Hawaiian                  65. Hebrew                    66. Hebrew
 67. Hindi                     68. Hmong                     69. Hungarian
 70. Icelandic                 71. Igbo                      72. Indonesian
 73. Interlingua               74. Interlingue               75. Inuktitut
 76. Inupiak                   77. Irish                     78. Italian
 79. Ignore                    80. Javanese                  81. Javanese
 82. Japanese                  83. Kannada                   84. Kashmiri
 85. Kazakh                    86. Khasi                     87. Khmer
 88. Kinyarwanda               89. Krio                      90. Kurdish
 91. Kyrgyz                    92. Korean                    93. Laothian
 94. Latin                     95. Latvian                   96. Limbu
 97. Limbu                     98. Limbu                     99. Lingala
100. Lithuanian               101. Lozi                     102. Luba_lulua
103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian
106. Malagasy                 107. Malay                    108. Malayalam
109. Maltese                  110. Manx                     111. Maori
112. Marathi                  113. Mauritian_creole         114. Romanian
115. Mongolian                116. Montenegrin              117. Montenegrin
118. Montenegrin              119. Montenegrin              120. Nauru
121. Ndebele                  122. Nepali                   123. Newari
124. Norwegian                125. Norwegian                126. Norwegian_n
127. Nyanja                   128. Occitan                  129. Oriya
130. Oromo                    131. Ossetian                 132. Pampanga
133. Pashto                   134. Pedi                     135. Persian
136. Polish                   137. Portuguese               138. Punjabi
139. Quechua                  140. Rajasthani               141. Rhaeto_romance
142. Romanian                 143. Rundi                    144. Russian
145. Samoan                   146. Sango                    147. Sanskrit
148. Scots                    149. Scots_gaelic             150. Serbian
151. Serbian                  152. Seselwa                  153. Seselwa
154. Sesotho                  155. Shona                    156. Sindhi
157. Sinhalese                158. Siswant                  159. Slovak
160. Slovenian                161. Somali                   162. Spanish
163. Sundanese                164. Swahili                  165. Swedish
166. Syriac                   167. Tagalog                  168. Tajik
169. Tamil                    170. Tatar                    171. Telugu
172. Thai                     173. Tibetan                  174. Tigrinya
175. Tonga                    176. Tsonga                   177. Tswana
178. Tumbuka                  179. Turkish                  180. Turkmen
181. Twi                      182. Uighur                   183. Ukrainian
184. Urdu                     185. Uzbek                    186. Venda
187. Vietnamese               188. Volapuk                  189. Waray_philippines
190. Welsh                    191. Wolof                    192. Xhosa
193. Yiddish                  194. Yoruba                   195. Zhuang
196. Zulu

分句, 分词 Tokenization

自然语言处理任务中, 任务可以分为字符级, 词语级, 句子级, 段落级和篇章级, Tokenization 就是实现切分字符, 词语, 句子和段落边界的功能. 分段可以用 \n , \n\r 作分割, 字符分割也比较容易实现, 分句和分词相对比较复杂一点.

Example

导入依赖

from polyglot.text import Text

分句

>>> Text(text_cn).sentences
 [Sentence("日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕."), Sentence("目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机.")]
>>> Text(text_en).sentences
 [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")]
>>> Text(text_mixed).sentences
 [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence(" 日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕."), Sentence(" 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机.")]

分词

>>> Text(text_cn).words

日本最后一家寻呼机服务营业商宣布 , 将于 2019 年 9 月结束服务 , 标志着日本寻呼业长达 50 年的历史正式落幕 . 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务 , 该公司在 20 年前就已停止生产寻呼机 .

>>> Text(text_en).words
 Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .
>>> Text(text_mixed).words

Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本最后一家寻呼机服务营业商宣布 , 将于 2019 年 9 月结束服务 , 标志着日本寻呼业长达 50 年的历史正式落幕 . 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务 , 该公司在 20 年前就已停止生产寻呼机 .

实体识别 Named Entity Recognition

实体识别是识别出文本中具有特定意义的实体, 其常有三种分类:

实体类: 人名, 地名, 机构名, 商品名, 商标名等等

时间类: 日期, 时间

数字类: 生日, 电话号码, QQ 号码等等

实体识别的方法也可以分为三种:

基于规则 Linguistic grammar-based techniques

基于语言语法的技术主要是用规则的方法, 在工程的实现方面上的应用就是写很多的正则表达(RegEx), 这种方式可以解决部分时间类, 和数字类命名实体的识别.

统计学习 Statistical models

统计的方法目前主要是 HMM 和 CRF 模型, 也是当前比较成熟的方式.

深度学习 Deep Learning models

深度学习的方法是目前最为流行的方式, 特别是 RNN 系列的 DL 模型, 其可以吸收到更多的文本语义信息, 其效果是当前最好的.

polyglot 实体识别的训练语料来源于维基百科 (WIKI), 其训练好的模型并没有初次安装, 需要下载相应的模型. polyglot 支持 40 种语言的实体类(人名, 地名, 机构名) 的识别.

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("ner2", 3))
 1. Polish                     2. Turkish                    3. Russian
 4. Indonesian                 5. Czech                      6. Arabic
 7. Korean                     8. Catalan; Valencian         9. Italian
10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog
13. Danish                    14. Finnish                   15. German
16. Persian                   17. Dutch                     18. Chinese
19. French                    20. Portuguese                21. Slovak
22. Hebrew (modern)           23. Malay                     24. Slovene
25. Bulgarian                 26. Hindi                     27. Japanese
28. Hungarian                 29. Croatian                  30. Ukrainian
31. Serbian                   32. Lithuanian                33. Norwegian
34. Latvian                   35. Swedish                   36. English
37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese
40. Estonian

模型下载

下载英文和中文实体识别的模型

$ python
>>> import polyglot
>>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en
[polyglot_data] Downloading package ner2.en to
[polyglot_data] Downloading package ner2.zh to
[polyglot_data] Downloading package embeddings2.zh to
[polyglot_data] Downloadinuserackage embeddings2.en to
[polyglot_data]  /home/user/polyglot_data...
Example

导入依赖

>>> from polyglot.text import Text

实体识别

>>> Text(text_cn).entities
 [I-ORG([u'东京'])]
>>> Text(text_en).entities)
 [I-LOC([u'Tokyo'])]
>>> Text(text_mixed).entities)
 [I-ORG([u'东京'])]

词性标注 Part of Speech Tagging

词性标注是对分词单元作相应的词性标记, 其常用的标记包括:

形容词 ADJ: adjective

介词 ADP: adposition

副词 ADV: adverb

辅助动词 AUX: auxiliary verb

连词 CONJ: coordinating conjunction

限定词 DET: determiner

感叹词 INTJ: interjection

名词 NOUN: noun

数字 NUM: numeral

代词 PRON: pronoun

名词代词 PROPN: proper noun

标点符号 PUNCT: punctuation

从属连词 SCONJ: subordinating conjunction

符号 SYM: symbol

动词 VERB: verb

其他 X: other

polyglot 训练词性标注的语料来源于 CONLL 数据集, 其支持 16 种语言, 不支持中文.

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("pos2"))
  1. German                     2. Italian                    3. Danish
  4. Czech                      5. Slovene                    6. French
  7. English                    8. Swedish                    9. Bulgarian
 10. Spanish; Castilian        11. Indonesian                12. Portuguese
 13. Finnish                   14. Irish                     15. Hungarian
 16. Dutch

模型下载

下载英文词性标注的模型

$ python
>>> import polyglot
>>> !polyglot download pos2.en
[polyglot_data] ownloading package pos2.en to
[polyglot_data]  /home/user/polyglot_data...
Example

导入依赖

from polyglot.text import Text

词性标注

>>> Text(text_en).pos_tags
 [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]

情感分析 Sentiment Analysis

polyglot 的情感分析是词级别的, 对每一个分词正面标记为 1, 中性标记为 0, 负面标记为 1. 其目前支持 136 种语言.

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("sentiment2"))
 1. Turkmen                    2. Thai                       3. Latvian
 4. Zazaki                     5. Tagalog                    6. Tamil
 7. Tajik                      8. Telugu                     9. Luxembourgish, Letzeb...
10. Alemannic                 11. Latin                     12. Turkish
13. Limburgish, Limburgan...  14. Egyptian Arabic           15. Tatar
16. Lithuanian                17. Spanish; Castilian        18. Basque
19. Estonian                  20. Asturian                  21. Greek, Modern
22. Esperanto                 23. English                   24. Ukrainian
25. Marathi (Marāhī)         26. Maltese                   27. Burmese
28. Kapampangan               29. Uighur, Uyghur            30. Uzbek
31. Malagasy                  32. Yiddish                   33. Macedonian
34. Urdu                      35. Malayalam                 36. Mongolian
37. Breton                    38. Bosnian                   39. Bengali
40. Tibetan Standard, Tib...  41. Belarusian                42. Bulgarian
43. Bashkir                   44. Vietnamese                45. Volapük
46. Gan Chinese               47. Manx                      48. Gujarati
49. Yoruba                    50. Occitan                   51. Scottish Gaelic; Gaelic
52. Irish                     53. Galician                  54. Ossetian, Ossetic
55. Oriya                     56. Walloon                   57. Swedish
58. Silesian                  59. Lombard language          60. Divehi; Dhivehi; Mald...
61. Danish                    62. German                    63. Armenian
64. Haitian; Haitian Creole   65. Hungarian                 66. Croatian
67. Bishnupriya Manipuri      68. Hindi                     69. Hebrew (modern)
70. Portuguese                71. Afrikaans                 72. Pashto, Pushto
73. Amharic                   74. Aragonese                 75. Bavarian
76. Assamese                  77. Panjabi, Punjabi          78. Polish
79. Azerbaijani               80. Italian                   81. Arabic
82. Icelandic                 83. Ido                       84. Scots
85. Sicilian                  86. Indonesian                87. Chinese Word
88. Interlingua               89. Waray-Waray               90. Piedmontese language
91. Quechua                   92. French                    93. Dutch
94. Norwegian Nynorsk         95. Norwegian                 96. Western Frisian
97. Upper Sorbian             98. Nepali                    99. Persian
100. Ilokano                  101. Finnish                  102. Faroese
103. Romansh                  104. Javanese                 105. Romanian, Moldavian, ...
106. Malay                    107. Japanese                 108. Russian
109. Catalan; Valencian       110. Fiji Hindi               111. Chinese
112. Cebuano                  113. Czech                    114. Chuvash
115. Welsh                    116. West Flemish             117. Kirghiz, Kyrgyz
118. Kurdish                  119. Kazakh                   120. Korean
121. Kannada                  122. Khmer                    123. Georgian
124. Sakha                    125. Serbian                  126. Albanian
127. Swahili                  128. Chechen                  129. Sundanese
130. Sanskrit (Saskta)      131. Venetian                 132. Northern Sami
133. Slovak                   134. Sinhala, Sinhalese       135. Bosnian-Croatian-Serbian
136. Slovene

模型下载

下载英文和中文情感分析模型

$ python
>>> import polyglot
>>> !polyglot download sentiment2.en sentiment2.zh
[polyglot_data] ownloading package sentiment2.en to
[polyglot_data] ownloading package sentiment2.zh to
[polyglot_data]  /home/user/polyglot_data...
Example

导入依赖

from polyglot.text import Text

情感分析

>>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.")
>>> print(text.words,text.polarity)
 (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0)
>>> print([(w,w.polarity) for w in text.words])
 [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)]
>>> text = Text("这部电影故事非常好, 演员也非常棒, 但是电影院环境非常差.")
>>> print(text.words,text.polarity)
 (WordList([这 部 电影 故事 非常 好 , 演员 也 非常 棒 , 但是 电影 院 环境 非常 差 .]), 0.0)
>>> print([(w,w.polarity) for w in text.words])
 [(u'\u8fd9', 0), (u'\u90e8', 0), (u'\u7535\u5f71', 0), (u'\u6545\u4e8b', 0), (u'\u975e\u5e38', 0), (u'\u597d', 1), (u'\uff0c', 0), (u'\u6f14\u5458', 0), (u'\u4e5f', 0), (u'\u975e\u5e38', 0), (u'\u68d2', 0), (u'\uff0c', 0), (u'\u4f46\u662f', 0), (u'\u7535\u5f71', 0), (u'\u9662', 0), (u'\u73af\u5883', 0), (u'\u975e\u5e38', 0), (u'\u5dee', -1), (u'\u3002', 0)]

词嵌入 Word Embeddings

Word Embedding 在 NLP 中是指一组语言模型和特征学习技术的总称, 把词汇表中的单词或者短语映射成由实数构成的向量上. 常见的 Word Embeddings 有两种方法: 离散表示和分布式表示. 离散的方法包括 one-hot 和 N-gram, 离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题. 分布式表示的思想是用一个词附近的其他词来表示该词 , 也就是大家所熟悉的 word2ec.word2ec 包含根据当前一个词预测前后 n n 个词 Skip-Gram Model 以及给定上下文的 n n 个词预测一个词的 CBOW Model. 目前训练好的英文词向量有 https://nlp.stanford.edu/projects/glove/ , 其提供了 50,100,200,300 维词向量, 以及前一段时间腾讯 AI Lab 开源的中文词向量 https://ai.tencent.com/ailab/nlp/embedding.html , 其提供 200 维的中文词向量. polyglot 支持从以下不同源读取词向量

Gensim word2vec objects: (from_gensim method)
Word2vec binary/text models: (from_word2vec method)
GloVe models (from_glove method)
polyglot pickle files: (load method)

其中, polyglot pickle files 支持 136 种语言的词向量.

>>> from polyglot.downloader import  downloader
>>> print(downloader.supported_languages_table("embeddings2"))
  1. Scots                      2. Sicilian                   3. Welsh
  4. Chuvash                    5. Czech                      6. Egyptian Arabic
  7. Kapampangan                8. Chechen                    9. Catalan; Valencian
 10. Slovene                   11. Sinhala, Sinhalese        12. Bosnian-Croatian-Serbian
 13. Slovak                    14. Japanese                  15. Northern Sami
 16. Sanskrit (Saskta)       17. Croatian                  18. Javanese
 19. Sundanese                 20. Swahili                   21. Swedish
 22. Albanian                  23. Serbian                   24. Marathi (Marāhī)
 25. Breton                    26. Bosnian                   27. Bengali
 28. Tibetan Standard, Tib...  29. Bulgarian                 30. Belarusian
 31. West Flemish              32. Bashkir                   33. Malay
 34. Romanian, Moldavian, ...  35. Romansh                   36. Esperanto
 37. Asturian                  38. Greek, Modern             39. Burmese
 40. Maltese                   41. Malagasy                  42. Spanish; Castilian
 43. Russian                   44. Mongolian                 45. Chinese
 46. Estonian                  47. Yoruba                    48. Sakha
 49. Alemannic                 50. Assamese                  51. Lombard language
 52. Yiddish                   53. Silesian                  54. Venetian
 55. Azerbaijani               56. Afrikaans                 57. Aragonese
 58. Amharic                   59. Hebrew (modern)           60. Hindi
 61. Quechua                   62. Haitian; Haitian Creole   63. Hungarian
 64. Bishnupriya Manipuri      65. Armenian                  66. Gan Chinese
 67. Macedonian                68. Georgian                  69. Khmer
 70. Panjabi, Punjabi          71. Korean                    72. Kannada
 73. Kazakh                    74. Kurdish                   75. Basque
 76. Pashto, Pushto            77. Portuguese                78. Gujarati
 79. Manx                      80. Irish                     81. Scottish Gaelic; Gaelic
 82. Upper Sorbian             83. Galician                  84. Arabic
 85. Walloon                   86. Urdu                      87. Norwegian Nynorsk
 88. Norwegian                 89. Dutch                     90. Chinese Character
 91. Nepali                    92. French                    93. Western Frisian
 94. Bavarian                  95. English                   96. Persian
 97. Polish                    98. Finnish                   99. Faroese
100. Italian                  101. Icelandic                102. Volapük
103. Ido                      104. Waray-Waray              105. Indonesian
106. Interlingua              107. Lithuanian               108. Uzbek
109. Latvian                  110. German                   111. Danish
112. Cebuano                  113. Ukrainian                114. Latin
115. Luxembourgish, Letzeb... 116. Divehi; Dhivehi; Mald... 117. Vietnamese
118. Uighur, Uyghur           119. Limburgish, Limburgan... 120. Zazaki
121. Ilokano                  122. Fiji Hindi               123. Malayalam
124. Tatar                    125. Kirghiz, Kyrgyz          126. Ossetian, Ossetic
127. Oriya                    128. Turkish                  129. Tamil
130. Tagalog                  131. Thai                     132. Turkmen
133. Telugu                   134. Occitan                  135. Tajik
136. Piedmontese language

模型下载

下载英文和中文词向量

$ python
>>> import polyglot
>>> !polyglot download embeddings2.zh embeddings2.en
[polyglot_data] Downloading package embeddings2.zh to
[polyglot_data] Downloadinuserackage embeddings2.en to
[polyglot_data]  /home/user/polyglot_data...
Example

导入依赖并加载词向量

>>> from polyglot.mapping import Embedding
>>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')

词向量查询

>>> print(embeddings.get("中国"))
[ 0.60831094  0.37644583 -0.67009342  0.43529209  0.12993187 -0.07703398
 -0.04931475 -0.42763838 -0.42447501 -0.0219319  -0.52271312 -0.57149178
 -0.48139745 -0.31942225  0.12747335  0.34054375  0.27137381  0.1362032
 -0.54999739 -0.39569679  1.01767457  0.12317979 -0.12878017 -0.65476489
  0.18644606  0.2178454   0.18150428  0.18464987  0.29027358  0.21979097
 -0.21173042  0.08130789 -0.77350897  0.66575652 -0.14730017  0.11383133
  0.83101833  0.01702038 -0.71277034  0.29339811  0.3320756   0.25922608
 -0.51986367  0.16533957  0.04327472  0.36460632  0.42984027  0.04811303
 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295
  0.72923231 -0.54835045 -0.48428732  0.65475166 -0.34853089  0.03206051
  0.2574054   0.07614037  0.32844698 -0.0087136 ]
>>> print(len(embeddings.get("中国")))
 64

相似词查询

>>> neighbors = embeddings.nearest_neighbors("中国")
>>> print(" ".join(neighbors))

上海美国韩国北京欧洲台湾法国德国天津广州

翻译 Transliteration

polyglot 翻译采用是无监督的方法( False-Friend Detection and Entity Matching via Unsupervised Transliteration paper https://arxiv.org/abs/1611.06722 ), 其支持 69 种语言.

>>> from polyglot.downloader import  downloader
>>> print(downloader.supported_languages_table("transliteration2"))
 1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese
 4. Telugu                     5. Croatian                   6. Hungarian
 7. Thai                       8. Kannada                    9. Tagalog
10. Armenian                  11. Hebrew (modern)           12. Turkish
13. Portuguese                14. Belarusian                15. Norwegian Nynorsk
16. Norwegian                 17. Dutch                     18. Japanese
19. Albanian                  20. Bulgarian                 21. Serbian
22. Swahili                   23. Swedish                   24. French
25. Latin                     26. Czech                     27. Yiddish
28. Hindi                     29. Danish                    30. Finnish
31. German                    32. Bosnian-Croatian-Serbian  33. Slovak
34. Persian                   35. Lithuanian                36. Slovene
37. Latvian                   38. Bosnian                   39. Gujarati
40. Italian                   41. Icelandic                 42. Spanish; Castilian
43. Ukrainian                 44. Urdu                      45. Indonesian
46. Khmer                     47. Galician                  48. Korean
49. Afrikaans                 50. Georgian                  51. Catalan; Valencian
52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian
55. Russian                   56. Azerbaijani               57. Chinese
58. Estonian                  59. Welsh                     60. Arabic
61. Bengali                   62. Amharic                   63. Irish
64. Malay                     65. Marathi (Marāhī)         66. Polish
67. Greek, Modern             68. Esperanto                 69. Maltese

模型下载

下载英文和中文翻译模型

$ python
>>> import polyglot
>>> !polyglot download transliteration2.zh transliteration2.en
[polyglot_data] Downloading package transliteration2.zh to
[polyglot_data] Downloadinuserackage transliteration2.en to
[polyglot_data]  /home/user/polyglot_data...
Example

导入依赖

>>> from polyglot.text import Text

英文翻译中文

>>> text = Text(text_en)
>>> print(text_en)
  Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.
>>> print("".join([t for t in text.transliterate("zh")]))

拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯

中英文翻译的结果显示其效果还是比较差, 在此不做过多的介绍.

管道 Pipelines

Pipelines 的方式是指以管道的方式顺序执行多个 NLP 任务, 上一个任务的输出作为下一个任务的输入. 比如在实体识别和实体关系识别中, Pipeline 方式就是先识别出实体, 然后再识别这些实体的关系, 另外一种是 Join, 将实体识别和关系识别放在一起.

Exmaple

先分词, 然后统计词频数大于 2 的单词.

>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2
 in  10
the 6
.   6
-   5
,   4
of  3
and 3
by  3
South       2
5   2
2007        2
Bermuda     2
which       2
score       2
against     2
Mitchell    2
as  2
West        2
India       2
beat        2
Afghanistan 2
Indies      2

来源: http://www.tuicool.com/articles/FR7Vzue

与本文相关文章

暂无,快来抢沙发吧！