知道创宇 IA-Lab 岳永鹏
目前, 在 NLP 任务处理中, Python 支持英文处理的开源包有 NLTK,Scapy,StanfordCoreNLP,GATE,OPenNLP, 支持中文处理的开源工具包有 Jieba,ICTCLAS,THU LAC,HIT LTP, 但是这些工具大部分仅对特定类型的语言提供支持. 本文将介绍功能强大的支持 Pipeline 方式的多语言处理 Python 工具包: polyglot. 该项目最早是由 AboSamoor 在 2015 年 3 月 16 日在 GitHub 上开源的项目, 已经在 GitHub 收集 star 1021 个.
- Free software: GPLv3 license
- Documentation: http://polyglot.readthedocs.org/
- GitHub: https://github.com/aboSamoor/polyglot
特征
语言检测 Language Detection (支持 196 种语言)
分句, 分词 Tokenization (支持 165 种语言)
实体识别 Named Entity Recognition (支持 40 种语言)
词性标注 Part of Speech Tagging(支持 16 种语言)
情感分析 Sentiment(支持 136 种语言)
词嵌入 Word Embeddings(支持 137 种语言)
翻译 Transliteration(支持 69 种语言)
管道 Pipelines
安装
从 PyPI 安装 / 升级
$ pip install polyglot
安装 polyglot 依赖于 numpy 和 libicu-dev, 在 Ubuntu / debian Linux 发行版中你可以通过执行以下命令来安装这样的包:
$ sudo apt-get install python-numpy libicu-dev
安装成功以后, 输入
- $ import polyglot
- $ polyglot.__version__
- $ 16.07.04
数据
在随后的实例演示中, 将以中文, 英文或中英文混合语句作为测试数据.
- text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."text_cn = u" 日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕. 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机."
- text_mixed = text_cn + text_en
语言检测 Language Detection
polyglot 的语言检测依赖 https://pypi.org/project/pycld2/ 和 https://code.google.com/p/cld2/ , 其中 https://code.google.com/p/cld2/ 是 Google 开发的多语言检测应用.
Example
导入依赖
from polyglot.detect import Detector
语言类型检测
- >>> Detector(text_cn).language
- name: Chinese code: zh confidence: 99.0 read bytes: 1996
- >>>> Detector(text_en).language
- name: English code: en confidence: 99.0 read bytes: 1144
- >>> Detector(text_mixed).language
- name: Chinese code: zh confidence: 50.0 read bytes: 1996
对中英文混合的 text_mixed , 其识别的语言是中文, 但置信度 (confidence) 仅有 50, 所有包含的语言类型检测
- >>> for language in Detector(text_mixed):
- >>> print(language)
- name: Chinese code: zh confidence: 50.0 read bytes: 1996
- name: English code: en confidence: 49.0 read bytes: 1144
- name: un code: un confidence: 0.0 read bytes: 0
目前, https://code.google.com/p/cld2/ 支持的语言检测类型有
- >>> Detector.supported_languages()
- 1. Abkhazian 2. Afar 3. Afrikaans
- 4. Akan 5. Albanian 6. Amharic
- 7. Arabic 8. Armenian 9. Assamese
- 10. Aymara 11. Azerbaijani 12. Bashkir
- 13. Basque 14. Belarusian 15. Bengali
- 16. Bihari 17. Bislama 18. Bosnian
- 19. Breton 20. Bulgarian 21. Burmese
- 22. Catalan 23. Cebuano 24. Cherokee
- 25. Nyanja 26. Corsican 27. Croatian
- 28. Croatian 29. Czech 30. Chinese
- 31. Chinese 32. Chinese 33. Chinese
- 34. Chineset 35. Chineset 36. Chineset
- 37. Chineset 38. Chineset 39. Chineset
- 40. Danish 41. Dhivehi 42. Dutch
- 43. Dzongkha 44. English 45. Esperanto
- 46. Estonian 47. Ewe 48. Faroese
- 49. Fijian 50. Finnish 51. French
- 52. Frisian 53. Ga 54. Galician
- 55. Ganda 56. Georgian 57. German
- 58. Greek 59. Greenlandic 60. Guarani
- 61. Gujarati 62. Haitian_creole 63. Hausa
- 64. Hawaiian 65. Hebrew 66. Hebrew
- 67. Hindi 68. Hmong 69. Hungarian
- 70. Icelandic 71. Igbo 72. Indonesian
- 73. Interlingua 74. Interlingue 75. Inuktitut
- 76. Inupiak 77. Irish 78. Italian
- 79. Ignore 80. Javanese 81. Javanese
- 82. Japanese 83. Kannada 84. Kashmiri
- 85. Kazakh 86. Khasi 87. Khmer
- 88. Kinyarwanda 89. Krio 90. Kurdish
- 91. Kyrgyz 92. Korean 93. Laothian
- 94. Latin 95. Latvian 96. Limbu
- 97. Limbu 98. Limbu 99. Lingala
- 100. Lithuanian 101. Lozi 102. Luba_lulua
- 103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian
- 106. Malagasy 107. Malay 108. Malayalam
- 109. Maltese 110. Manx 111. Maori
- 112. Marathi 113. Mauritian_creole 114. Romanian
- 115. Mongolian 116. Montenegrin 117. Montenegrin
- 118. Montenegrin 119. Montenegrin 120. Nauru
- 121. Ndebele 122. Nepali 123. Newari
- 124. Norwegian 125. Norwegian 126. Norwegian_n
- 127. Nyanja 128. Occitan 129. Oriya
- 130. Oromo 131. Ossetian 132. Pampanga
- 133. Pashto 134. Pedi 135. Persian
- 136. Polish 137. Portuguese 138. Punjabi
- 139. Quechua 140. Rajasthani 141. Rhaeto_romance
- 142. Romanian 143. Rundi 144. Russian
- 145. Samoan 146. Sango 147. Sanskrit
- 148. Scots 149. Scots_gaelic 150. Serbian
- 151. Serbian 152. Seselwa 153. Seselwa
- 154. Sesotho 155. Shona 156. Sindhi
- 157. Sinhalese 158. Siswant 159. Slovak
- 160. Slovenian 161. Somali 162. Spanish
- 163. Sundanese 164. Swahili 165. Swedish
- 166. Syriac 167. Tagalog 168. Tajik
- 169. Tamil 170. Tatar 171. Telugu
- 172. Thai 173. Tibetan 174. Tigrinya
- 175. Tonga 176. Tsonga 177. Tswana
- 178. Tumbuka 179. Turkish 180. Turkmen
- 181. Twi 182. Uighur 183. Ukrainian
- 184. Urdu 185. Uzbek 186. Venda
- 187. Vietnamese 188. Volapuk 189. Waray_philippines
- 190. Welsh 191. Wolof 192. Xhosa
- 193. Yiddish 194. Yoruba 195. Zhuang
- 196. Zulu
分句, 分词 Tokenization
自然语言处理任务中, 任务可以分为字符级, 词语级, 句子级, 段落级和篇章级, Tokenization 就是实现切分字符, 词语, 句子和段落边界的功能. 分段可以用 \n , \n\r 作分割, 字符分割也比较容易实现, 分句和分词相对比较复杂一点.
Example
导入依赖
from polyglot.text import Text
分句
- >>> Text(text_cn).sentences
- [Sentence("日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕."), Sentence("目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机.")]
- >>> Text(text_en).sentences
- [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")]
- >>> Text(text_mixed).sentences
- [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence(" 日本最后一家寻呼机服务营业商宣布, 将于 2019 年 9 月结束服务, 标志着日本寻呼业长达 50 年的历史正式落幕."), Sentence(" 目前大约还有 1500 名用户使用东京电信通信公司提供的寻呼服务, 该公司在 20 年前就已停止生产寻呼机.")]
分词
>>> Text(text_cn).words
日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 2019 年 9 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 . 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 .
- >>> Text(text_en).words
- Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .
- >>> Text(text_mixed).words
Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 2019 年 9 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 . 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 .
实体识别 Named Entity Recognition
实体识别是识别出文本中具有特定意义的实体, 其常有三种分类:
实体类: 人名, 地名, 机构名, 商品名, 商标名等等
时间类: 日期, 时间
数字类: 生日, 电话号码, QQ 号码等等
实体识别的方法也可以分为三种:
基于规则 Linguistic grammar-based techniques
基于语言语法的技术主要是用规则的方法, 在工程的实现方面上的应用就是写很多的正则表达(RegEx), 这种方式可以解决部分时间类, 和数字类命名实体的识别.
统计学习 Statistical models
统计的方法目前主要是 HMM 和 CRF 模型, 也是当前比较成熟的方式.
深度学习 Deep Learning models
深度学习的方法是目前最为流行的方式, 特别是 RNN 系列的 DL 模型, 其可以吸收到更多的文本语义信息, 其效果是当前最好的.
polyglot 实体识别的训练语料来源于维基百科 (WIKI), 其训练好的模型并没有初次安装, 需要下载相应的模型. polyglot 支持 40 种语言的实体类(人名, 地名, 机构名) 的识别.
- >>> from polyglot.downloader import downloader
- >>> print(downloader.supported_languages_table("ner2", 3))
- 1. Polish 2. Turkish 3. Russian
- 4. Indonesian 5. Czech 6. Arabic
- 7. Korean 8. Catalan; Valencian 9. Italian
- 10. Thai 11. Romanian, Moldavian, ... 12. Tagalog
- 13. Danish 14. Finnish 15. German
- 16. Persian 17. Dutch 18. Chinese
- 19. French 20. Portuguese 21. Slovak
- 22. Hebrew (modern) 23. Malay 24. Slovene
- 25. Bulgarian 26. Hindi 27. Japanese
- 28. Hungarian 29. Croatian 30. Ukrainian
- 31. Serbian 32. Lithuanian 33. Norwegian
- 34. Latvian 35. Swedish 36. English
- 37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese
- 40. Estonian
模型下载
下载英文和中文实体识别的模型
- $ python
- >>> import polyglot
- >>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en
- [polyglot_data] Downloading package ner2.en to
- [polyglot_data] Downloading package ner2.zh to
- [polyglot_data] Downloading package embeddings2.zh to
- [polyglot_data] Downloadinuserackage embeddings2.en to
- [polyglot_data] /home/user/polyglot_data...
- Example
导入依赖
>>> from polyglot.text import Text
实体识别
- >>> Text(text_cn).entities
- [I-ORG([u'东京'])]
- >>> Text(text_en).entities)
- [I-LOC([u'Tokyo'])]
- >>> Text(text_mixed).entities)
- [I-ORG([u'东京'])]
词性标注 Part of Speech Tagging
词性标注是对分词单元作相应的词性标记, 其常用的标记包括:
形容词 ADJ: adjective
介词 ADP: adposition
副词 ADV: adverb
辅助动词 AUX: auxiliary verb
连词 CONJ: coordinating conjunction
限定词 DET: determiner
感叹词 INTJ: interjection
名词 NOUN: noun
数字 NUM: numeral
代词 PRON: pronoun
名词代词 PROPN: proper noun
标点符号 PUNCT: punctuation
从属连词 SCONJ: subordinating conjunction
符号 SYM: symbol
动词 VERB: verb
其他 X: other
polyglot 训练词性标注的语料来源于 CONLL 数据集, 其支持 16 种语言, 不支持中文.
- >>> from polyglot.downloader import downloader
- >>> print(downloader.supported_languages_table("pos2"))
- 1. German 2. Italian 3. Danish
- 4. Czech 5. Slovene 6. French
- 7. English 8. Swedish 9. Bulgarian
- 10. Spanish; Castilian 11. Indonesian 12. Portuguese
- 13. Finnish 14. Irish 15. Hungarian
- 16. Dutch
模型下载
下载英文词性标注的模型
- $ python
- >>> import polyglot
- >>> !polyglot download pos2.en
- [polyglot_data] ownloading package pos2.en to
- [polyglot_data] /home/user/polyglot_data...
- Example
导入依赖
from polyglot.text import Text
词性标注
- >>> Text(text_en).pos_tags
- [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]
情感分析 Sentiment Analysis
polyglot 的情感分析是词级别的, 对每一个分词正面标记为 1, 中性标记为 0, 负面标记为 1. 其目前支持 136 种语言.
- >>> from polyglot.downloader import downloader
- >>> print(downloader.supported_languages_table("sentiment2"))
- 1. Turkmen 2. Thai 3. Latvian
- 4. Zazaki 5. Tagalog 6. Tamil
- 7. Tajik 8. Telugu 9. Luxembourgish, Letzeb...
- 10. Alemannic 11. Latin 12. Turkish
- 13. Limburgish, Limburgan... 14. Egyptian Arabic 15. Tatar
- 16. Lithuanian 17. Spanish; Castilian 18. Basque
- 19. Estonian 20. Asturian 21. Greek, Modern
- 22. Esperanto 23. English 24. Ukrainian
- 25. Marathi (Marāhī) 26. Maltese 27. Burmese
- 28. Kapampangan 29. Uighur, Uyghur 30. Uzbek
- 31. Malagasy 32. Yiddish 33. Macedonian
- 34. Urdu 35. Malayalam 36. Mongolian
- 37. Breton 38. Bosnian 39. Bengali
- 40. Tibetan Standard, Tib... 41. Belarusian 42. Bulgarian
- 43. Bashkir 44. Vietnamese 45. Volapük
- 46. Gan Chinese 47. Manx 48. Gujarati
- 49. Yoruba 50. Occitan 51. Scottish Gaelic; Gaelic
- 52. Irish 53. Galician 54. Ossetian, Ossetic
- 55. Oriya 56. Walloon 57. Swedish
- 58. Silesian 59. Lombard language 60. Divehi; Dhivehi; Mald...
- 61. Danish 62. German 63. Armenian
- 64. Haitian; Haitian Creole 65. Hungarian 66. Croatian
- 67. Bishnupriya Manipuri 68. Hindi 69. Hebrew (modern)
- 70. Portuguese 71. Afrikaans 72. Pashto, Pushto
- 73. Amharic 74. Aragonese 75. Bavarian
- 76. Assamese 77. Panjabi, Punjabi 78. Polish
- 79. Azerbaijani 80. Italian 81. Arabic
- 82. Icelandic 83. Ido 84. Scots
- 85. Sicilian 86. Indonesian 87. Chinese Word
- 88. Interlingua 89. Waray-Waray 90. Piedmontese language
- 91. Quechua 92. French 93. Dutch
- 94. Norwegian Nynorsk 95. Norwegian 96. Western Frisian
- 97. Upper Sorbian 98. Nepali 99. Persian
- 100. Ilokano 101. Finnish 102. Faroese
- 103. Romansh 104. Javanese 105. Romanian, Moldavian, ...
- 106. Malay 107. Japanese 108. Russian
- 109. Catalan; Valencian 110. Fiji Hindi 111. Chinese
- 112. Cebuano 113. Czech 114. Chuvash
- 115. Welsh 116. West Flemish 117. Kirghiz, Kyrgyz
- 118. Kurdish 119. Kazakh 120. Korean
- 121. Kannada 122. Khmer 123. Georgian
- 124. Sakha 125. Serbian 126. Albanian
- 127. Swahili 128. Chechen 129. Sundanese
- 130. Sanskrit (Saskta) 131. Venetian 132. Northern Sami
- 133. Slovak 134. Sinhala, Sinhalese 135. Bosnian-Croatian-Serbian
- 136. Slovene
模型下载
下载英文和中文情感分析模型
- $ python
- >>> import polyglot
- >>> !polyglot download sentiment2.en sentiment2.zh
- [polyglot_data] ownloading package sentiment2.en to
- [polyglot_data] ownloading package sentiment2.zh to
- [polyglot_data] /home/user/polyglot_data...
- Example
导入依赖
from polyglot.text import Text
情感分析
- >>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.")
- >>> print(text.words,text.polarity)
- (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0)
- >>> print([(w,w.polarity) for w in text.words])
- [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)]
- >>> text = Text("这部电影故事非常好, 演员也非常棒, 但是电影院环境非常差.")
- >>> print(text.words,text.polarity)
- (WordList([这 部 电影 故事 非常 好 , 演员 也 非常 棒 , 但是 电影 院 环境 非常 差 .]), 0.0)
- >>> print([(w,w.polarity) for w in text.words])
- [(u'\u8fd9', 0), (u'\u90e8', 0), (u'\u7535\u5f71', 0), (u'\u6545\u4e8b', 0), (u'\u975e\u5e38', 0), (u'\u597d', 1), (u'\uff0c', 0), (u'\u6f14\u5458', 0), (u'\u4e5f', 0), (u'\u975e\u5e38', 0), (u'\u68d2', 0), (u'\uff0c', 0), (u'\u4f46\u662f', 0), (u'\u7535\u5f71', 0), (u'\u9662', 0), (u'\u73af\u5883', 0), (u'\u975e\u5e38', 0), (u'\u5dee', -1), (u'\u3002', 0)]
词嵌入 Word Embeddings
Word Embedding 在 NLP 中是指一组语言模型和特征学习技术的总称, 把词汇表中的单词或者短语映射成由实数构成的向量上. 常见的 Word Embeddings 有两种方法: 离散表示和分布式表示. 离散的方法包括 one-hot 和 N-gram, 离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题. 分布式表示的思想是 用一个词附近的其他词来表示该词 , 也就是大家所熟悉的 word2ec.word2ec 包含根据当前一个词预测前后 n n 个词 Skip-Gram Model 以及给定上下文的 n n 个词预测一个词的 CBOW Model. 目前训练好的英文词向量有 https://nlp.stanford.edu/projects/glove/ , 其提供了 50,100,200,300 维词向量, 以及前一段时间 腾讯 AI Lab 开源的中文词向量 https://ai.tencent.com/ailab/nlp/embedding.html , 其提供 200 维的中文词向量. polyglot 支持从以下不同源读取词向量
Gensim word2vec objects: (from_gensim method) Word2vec binary/text models: (from_word2vec method) GloVe models (from_glove method) polyglot pickle files: (load method)
其中, polyglot pickle files 支持 136 种语言的词向量.
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("embeddings2")) 1. Scots 2. Sicilian 3. Welsh 4. Chuvash 5. Czech 6. Egyptian Arabic 7. Kapampangan 8. Chechen 9. Catalan; Valencian 10. Slovene 11. Sinhala, Sinhalese 12. Bosnian-Croatian-Serbian 13. Slovak 14. Japanese 15. Northern Sami 16. Sanskrit (Saskta) 17. Croatian 18. Javanese 19. Sundanese 20. Swahili 21. Swedish 22. Albanian 23. Serbian 24. Marathi (Marāhī) 25. Breton 26. Bosnian 27. Bengali 28. Tibetan Standard, Tib... 29. Bulgarian 30. Belarusian 31. West Flemish 32. Bashkir 33. Malay 34. Romanian, Moldavian, ... 35. Romansh 36. Esperanto 37. Asturian 38. Greek, Modern 39. Burmese 40. Maltese 41. Malagasy 42. Spanish; Castilian 43. Russian 44. Mongolian 45. Chinese 46. Estonian 47. Yoruba 48. Sakha 49. Alemannic 50. Assamese 51. Lombard language 52. Yiddish 53. Silesian 54. Venetian 55. Azerbaijani 56. Afrikaans 57. Aragonese 58. Amharic 59. Hebrew (modern) 60. Hindi 61. Quechua 62. Haitian; Haitian Creole 63. Hungarian 64. Bishnupriya Manipuri 65. Armenian 66. Gan Chinese 67. Macedonian 68. Georgian 69. Khmer 70. Panjabi, Punjabi 71. Korean 72. Kannada 73. Kazakh 74. Kurdish 75. Basque 76. Pashto, Pushto 77. Portuguese 78. Gujarati 79. Manx 80. Irish 81. Scottish Gaelic; Gaelic 82. Upper Sorbian 83. Galician 84. Arabic 85. Walloon 86. Urdu 87. Norwegian Nynorsk 88. Norwegian 89. Dutch 90. Chinese Character 91. Nepali 92. French 93. Western Frisian 94. Bavarian 95. English 96. Persian 97. Polish 98. Finnish 99. Faroese 100. Italian 101. Icelandic 102. Volapük 103. Ido 104. Waray-Waray 105. Indonesian 106. Interlingua 107. Lithuanian 108. Uzbek 109. Latvian 110. German 111. Danish 112. Cebuano 113. Ukrainian 114. Latin 115. Luxembourgish, Letzeb... 116. Divehi; Dhivehi; Mald... 117. Vietnamese 118. Uighur, Uyghur 119. Limburgish, Limburgan... 120. Zazaki 121. Ilokano 122. Fiji Hindi 123. Malayalam 124. Tatar 125. Kirghiz, Kyrgyz 126. Ossetian, Ossetic 127. Oriya 128. Turkish 129. Tamil 130. Tagalog 131. Thai 132. Turkmen 133. Telugu 134. Occitan 135. Tajik 136. Piedmontese language
模型下载
下载英文和中文词向量
$ python >>> import polyglot >>> !polyglot download embeddings2.zh embeddings2.en [polyglot_data] Downloading package embeddings2.zh to [polyglot_data] Downloadinuserackage embeddings2.en to [polyglot_data] /home/user/polyglot_data... Example
导入依赖并加载词向量
>>> from polyglot.mapping import Embedding >>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')
词向量查询
>>> print(embeddings.get("中国")) [ 0.60831094 0.37644583 -0.67009342 0.43529209 0.12993187 -0.07703398 -0.04931475 -0.42763838 -0.42447501 -0.0219319 -0.52271312 -0.57149178 -0.48139745 -0.31942225 0.12747335 0.34054375 0.27137381 0.1362032 -0.54999739 -0.39569679 1.01767457 0.12317979 -0.12878017 -0.65476489 0.18644606 0.2178454 0.18150428 0.18464987 0.29027358 0.21979097 -0.21173042 0.08130789 -0.77350897 0.66575652 -0.14730017 0.11383133 0.83101833 0.01702038 -0.71277034 0.29339811 0.3320756 0.25922608 -0.51986367 0.16533957 0.04327472 0.36460632 0.42984027 0.04811303 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295 0.72923231 -0.54835045 -0.48428732 0.65475166 -0.34853089 0.03206051 0.2574054 0.07614037 0.32844698 -0.0087136 ] >>> print(len(embeddings.get("中国"))) 64
相似词查询
>>> neighbors = embeddings.nearest_neighbors("中国") >>> print(" ".join(neighbors))
上海 美国 韩国 北京 欧洲 台湾 法国 德国 天津 广州
翻译 Transliteration
polyglot 翻译采用是无监督的方法( False-Friend Detection and Entity Matching via Unsupervised Transliteration paper https://arxiv.org/abs/1611.06722 ), 其支持 69 种语言.
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("transliteration2")) 1. Haitian; Haitian Creole 2. Tamil 3. Vietnamese 4. Telugu 5. Croatian 6. Hungarian 7. Thai 8. Kannada 9. Tagalog 10. Armenian 11. Hebrew (modern) 12. Turkish 13. Portuguese 14. Belarusian 15. Norwegian Nynorsk 16. Norwegian 17. Dutch 18. Japanese 19. Albanian 20. Bulgarian 21. Serbian 22. Swahili 23. Swedish 24. French 25. Latin 26. Czech 27. Yiddish 28. Hindi 29. Danish 30. Finnish 31. German 32. Bosnian-Croatian-Serbian 33. Slovak 34. Persian 35. Lithuanian 36. Slovene 37. Latvian 38. Bosnian 39. Gujarati 40. Italian 41. Icelandic 42. Spanish; Castilian 43. Ukrainian 44. Urdu 45. Indonesian 46. Khmer 47. Galician 48. Korean 49. Afrikaans 50. Georgian 51. Catalan; Valencian 52. Romanian, Moldavian, ... 53. Basque 54. Macedonian 55. Russian 56. Azerbaijani 57. Chinese 58. Estonian 59. Welsh 60. Arabic 61. Bengali 62. Amharic 63. Irish 64. Malay 65. Marathi (Marāhī) 66. Polish 67. Greek, Modern 68. Esperanto 69. Maltese
模型下载
下载英文和中文翻译模型
$ python >>> import polyglot >>> !polyglot download transliteration2.zh transliteration2.en [polyglot_data] Downloading package transliteration2.zh to [polyglot_data] Downloadinuserackage transliteration2.en to [polyglot_data] /home/user/polyglot_data... Example
导入依赖
>>> from polyglot.text import Text
英文翻译中文
>>> text = Text(text_en) >>> print(text_en) Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years. >>> print("".join([t for t in text.transliterate("zh")]))
拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯
中英文翻译的结果显示其效果还是比较差, 在此不做过多的介绍.
管道 Pipelines
Pipelines 的方式是指以管道的方式顺序执行多个 NLP 任务, 上一个任务的输出作为下一个任务的输入. 比如在实体识别和实体关系识别中, Pipeline 方式就是先识别出实体, 然后再识别这些实体的关系, 另外一种是 Join, 将实体识别和关系识别放在一起.
Exmaple
先分词, 然后统计词频数大于 2 的单词.
>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2 in 10 the 6 . 6 - 5 , 4 of 3 and 3 by 3 South 2 5 2 2007 2 Bermuda 2 which 2 score 2 against 2 Mitchell 2 as 2 West 2 India 2 beat 2 Afghanistan 2 Indies 2
来源: http://www.tuicool.com/articles/FR7Vzue