本文永久链接: https://github.com/xitu/gold-miner/blob/master/TODO1/natural-language-processing-is-fun.md
译者: https://github.com/lihanxiang
校对者: https://github.com/FesonX , https://github.com/leviding , https://github.com/sakila1012
计算机如何理解人类的语言
计算机擅长处理结构化的数据, 像电子表格和数据库表之类的. 但是我们人类的日常沟通是用词汇来表达的, 而不是表格, 对计算机而言, 这真是件棘手的事.
遗憾的是, 我们并不是生活在处处都是结构化数据的时代.
这个世界上的许多信息都是非结构化的 -- 不仅仅是英语或者其他人类语言的原始文本. 我们该如何让一台计算机去理解这些非结构化的文本并且从中提取信息呢?
自然语言处理, 简称 NLP, 是人工智能领域的一个子集, 目的是为了让计算机理解并处理人类语言. 让我们来看看 NLP 是如何工作的, 并且学习一下如何用 Python 写出能够从原始文本中提取信息的程序.
注意: 如果你不关心 NLP 是如何工作的, 只想剪切和粘贴一些代码, 直接跳过至 "用 Python 处理 NLP 管道" 部分.
计算机能理解语言吗?
自从计算机诞生以来, 程序员们就一直尝试去写出能够理解像英语这样的语言的程序. 这其中的原因显而易见 -- 几千年来, 人类都是用写的方式来记录事件, 如果计算机能够读取并理解这些数据将会对人类大有好处.
目前, 计算机还不能像人类那样完全了解英语 -- 但它们已经能做许多事了! 在某些特定领域, 你能用 NLP 做到的事看上去就像魔法一样. 将 NLP 技术应用到你的项目上能够为你节约大量时间.
更好的是, 在 NLP 方面取得的最新进展就是可以轻松地通过开源的 Python 库比如 spaCy https://spacy.io/ , http://textacy.readthedocs.io/en/latest/ 和 https://github.com/huggingface/neuralcoref 来进行使用. 你需要做的只是写几行代码.
从文本中提取含义是很难的
读取和理解英语的过程是很复杂的 -- 即使在不考虑英语中的逻辑性和一致性的情况下. 比如, 这个新闻的标题是什么意思呢?
环境监管机构盘问了非法烧烤的业主.("Environmental regulators grill business owner over illegal coal fires.")
环境监管机构就非法燃烧煤炭问题对业主进行了询问? 或者按照字面意思, 监管机构把业主烤了? 正如你所见, 用计算机来解析英语是非常复杂的一件事.
在机器学习中做一件复杂的事通常意味着建一条管道. 这个办法就是将你的问题分成细小的部分, 然后用机器学习来单独解决每一个细小的部分. 再将多个相互补充的机器学习模型进行链接, 这样你就能搞定非常复杂的事.
而且这正是我们将要对 NLP 所使用的策略. 我们将理解英语的过程分解为多个小块, 并观察每个小块是如何工作的.
一步步构建 NLP 管道
让我们看一段来自维基百科的文字:
伦敦是英格兰首都, 也是英国的人口最稠密的城市. 伦敦位于英国大不列颠岛东南部泰晤士河畔, 两千年来一直是一个主要定居点. 它是由罗马人建立的, 把它命名为伦蒂尼恩.(London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.)
- "伦敦是英国的首都, 也是英格兰和整个联合王国人口最稠密的城市.(London is the capital and most populous city of England and the United Kingdom.)"
- "位于泰晤士河流域的伦敦, 在此后两个世纪内为这一地区最重要的定居点之一.(Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia.)"
- # 安装 spaCy
- pip3 install -U spacy
- # 下载针对 spaCy 的大型英语模型
- python3 -m spacy download en_core_web_lg
- # 安装同样大有用处的 textacy
- pip3 install -U textacy
- import spacy
- # 加载大型英语模型
- nlp = spacy.load('en_core_web_lg')
- # 我们想要检验的文本
- text = """London is the capital and most populous city of England and
- the United Kingdom. Standing on the River Thames in the south east
- of the island of Great Britain, London has been a major settlement
- for two millennia. It was founded by the Romans, who named it Londinium.
- """
- # 用 spaCy 解析文本. 在整个管道运行.
- doc = nlp(text)
- # 'doc' 现在包含了解析之后的文本. 我们可以用它来做我们想做的事!
- # 比如, 这将会打印出所有被检测到的命名实体:
- for entity in doc.ents:
- print(f"{entity.text} ({entity.label_})")
- 复制代码
- 如果你运行了这条语句, 你就会得到一个关于文档中被检测出的命名实体和实体类型的表:
- London (GPE)
- England (GPE)
- the United Kingdom (GPE)
- the River Thames (FAC)
- Great Britain (GPE)
- London (GPE)
- two millennia (DATE)
- Romans (NORP)
- Londinium (PERSON)
- 复制代码
- 你可以查看每一个实体代码的含义 https://spacy.io/usage/linguistic-features#entity-types .
- 需要注意的是, 它误将 "Londinium" 作为人名而不是地名. 这可能是因为在训练数据中没有与之相似的内容, 不过它做出了最好的猜测. 如果你要解析具有专业术语的文本, 命名实体的检测通常需要做一些微调 https://spacy.io/usage/training#section-ner .
- 让我们把这实体检测的思想转变一下, 来做一个数据清理器. 假设你正在尝试执行新的 GDPR 隐私条款 https://medium.com/@ageitgey/understand-the-gdpr-in-10-minutes-407f4b54111f 并且发现你所持有的上千个文档中都有个人身份信息, 例如名字. 现在你的任务就是移除文档中的所有名字.
- 如果将上千个文档中的名字手动去除, 需要花上好几年. 但如果用 NLP, 事情就简单了许多. 这是一个移除检测到的名字的数据清洗器:
- import spacy
- # 加载大型英语 NLP 模型
- nlp = spacy.load('en_core_web_lg')
- # 如果检测到名字, 就用 "REDACTED" 替换
- def replace_name_with_placeholder(token):
- if token.ent_iob != 0 and token.ent_type_ == "PERSON":
- return "[REDACTED]"
- else:
- return token.string
- # 依次解析文档中的所有实体并检测是否为名字
- def scrub(text):
- doc = nlp(text)
- for ent in doc.ents:
- ent.merge()
- tokens = map(replace_name_with_placeholder, doc)
- return "".join(tokens)
- s = """In 1950, Alan Turing published his famous article"Computing Machinery and Intelligence". In 1957, Noam Chomsky's
- Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
- """
- print(scrub(s))
- 复制代码
- 如果你运行了这个, 就会看到它如预期般工作:
- In 1950, [REDACTED] published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED]
- Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
- 复制代码
- 信息提取
- 开箱即用的 spaCy 能做的事实在是太棒了. 但你也可以用 spaCy 解析的输出来作为更复杂的数据提取算法的输入. 这里有一个叫做 http://textacy.readthedocs.io/en/stable/ 的 python 库, 它实现了多种基于 spaCy 的通用数据提取算法. 这是一个良好的开端.
- 它实现的算法之一叫做半结构化语句提取 https://textacy.readthedocs.io/en/stable/api_reference.html#textacy.extract.semistructured_statements . 我们用它来搜索解析树, 查找主体为 "伦敦" 且动词是 "be" 形式的简单语句. 这将会帮助我们找到有关伦敦的信息.
- 来看看代码是怎样的:
- import spacy
- import textacy.extract
- # 加载大型英语 NLP 模型
- nlp = spacy.load('en_core_web_lg')
- # 需要检测的文本
- text = """London is the capital and most populous city of England and the United Kingdom.
- Standing on the River Thames in the south east of the island of Great Britain,
- London has been a major settlement for two millennia. It was founded by the Romans,
- who named it Londinium.
- """
- # 用 spaCy 来解析文档
- doc = nlp(text)
- # 提取半结构化语句
- statements = textacy.extract.semistructured_statements(doc, "London")
- # 打印结果
- print("Here are the things I know about London:")
- for statement in statements:
- subject, verb, fact = statement
- print(f"- {fact}")
- 复制代码
- 它打印出了这些:
- Here are the things I know about London:
- - the capital and most populous city of England and the United Kingdom.
- - a major settlement for two millennia.
- 复制代码
- 也许这不会太令人印象深刻. 但如果你将这段代码用于维基百科上关于伦敦的整篇文章上, 而不只是这三个句子, 就会得到令人印象十分深刻的结果:
- Here are the things I know about London:
- - the capital and most populous city of England and the United Kingdom
- - a major settlement for two millennia
- - the world's most populous city from around 1831 to 1925
- - beyond all comparison the largest town in England
- - still very compact
- - the world's largest city from about 1831 to 1925
- - the seat of the Government of the United Kingdom
- - vulnerable to flooding
- - "one of the World's Greenest Cities" with more than 40 percent green space or open water
- - the most populous city and metropolitan area of the European Union and the second most populous in Europe
- - the 19th largest city and the 18th largest metropolitan region in the world
- - Christian, and has a large number of churches, particularly in the City of London
- - also home to sizeable Muslim, Hindu, Sikh, and Jewish communities
- - also home to 42 Hindu temples
- - the world's most expensive office market for the last three years according to world property journal (2015) report
- - one of the pre-eminent financial centres of the world as the most important location for international finance
- - the world top city destination as ranked by TripAdvisor users
- - a major international air transport hub with the busiest city airspace in the world
- - the centre of the National Rail network, with 70 percent of rail journeys starting or ending in London
- - a major global centre of higher education teaching and research and has the largest concentration of higher education institutes in Europe
- - home to designers Vivienne Westwood, Galliano, Stella McCartney, Manolo Blahnik, and Jimmy Choo, among others
- - the setting for many works of literature
- - a major centre for television production, with studios including BBC Television Centre, The Fountain Studios and The London Studios
- - also a centre for urban music
- - the "greenest city" in Europe with 35,000 acres of public parks, woodlands and gardens
- - not the capital of England, as England does not have its own government
- 复制代码
- 现在事情变得有趣了起来! 我们自动收集了大量的信息.
- 为了让事情变得更有趣, 试试安装 https://github.com/huggingface/neuralcoref 库并且添加共指解析到你的管道. 这将为你提供更多的信息, 因为它会捕捉含有 "它" 的而不是直接表示 "伦敦" 的句子.
- 我们还能做什么?
- 看看这个 spaCy 文档 https://spacy.io/api/doc 和 textacy 文档 http://textacy.readthedocs.io/en/latest/ , 你会发现很多能够用于解析文本的方法示例. 目前我们所看见的只是一个小示例.
- 这里有另外一个实例: 想象你正在构建一个能够向用户展示我们在上一个例子中提取出的全世界城市的信息的网站.
- 如果你的网站有搜索功能, 能像谷歌那样能够自动补全常规的查询就太好了:
- 谷歌对于 "伦敦" 的自动补全建议
- 如果这么做, 我们就需要一个可能提供给用户的建议列表. 我们可以使用 NLP 来快速生成这些数据.
- 这是从文档中提取常用名词块的一种方式:
- import spacy
- import textacy.extract
- # 加载大型英语 NLP 模型
- nlp = spacy.load('en_core_web_lg')
- # 需要检测的文档
- text = """London is the capital and most populous city of England and the United Kingdom.
- Standing on the River Thames in the south east of the island of Great Britain,
- London has been a major settlement for two millennia. It was founded by the Romans,
- who named it Londinium.
- """
- # 用 spaCy 解析文档
- doc = nlp(text)
- # 提取半结构化语句
- statements = textacy.extract.semistructured_statements(doc, "London")
- # 打印结果
- print("Here are the things I know about London:")
- for statement in statements:
- subject, verb, fact = statement
- print(f"- {fact}")
- westminster abbey
- natural history museum
- west end
- east end
- st paul's cathedral
- royal albert hall
- london underground
- great fire
- british museum
- london eye
- .... etc ....
来源: https://juejin.im/post/5b6d08e2f265da0f9c67cf0b