一, 腾讯云 NLP 服务解决的问题
具备自然语言处理 (NLP) 能力是企业日趋紧迫的一个需求, 例如电商网站需从用户评论中分析出产品偏好, 金融企业需对产品进行舆论分析等. 企业如果自研 NLP 相关能力, 不仅需要投入专业的技术人员, 收集或购买大量的语料, 还必须经历漫长的技术周期, 最终效果往往还达不到预期.
腾讯云 NLP 服务深度整合了内部顶级的 NLP 技术, 并依托千亿级的中文余料积累, 提供了包括词法分析在内的 16 项智能文本处理能力. 这些能力开箱即用, 无需购买或运维服务器, 省去了企业大了的人物和物力投入. 本文结合腾讯云云函数服务, 通过一个简化的示例介绍如何基于腾讯云生态快速打造词法分析服务.
二, 腾讯云 NLP 词法分析接口
腾讯云 NLP 词法分析相关接口包括 2 个: 相似词和智能词法分析. 本文基于词法分析接口, 介绍电商网站如何对收集的用户评论进行分词, 词性标注以及命名实体识别, 从而构建词法分析系统.
词法分析接口主要功能包括(具体接口说明可参见:):
- 分词: 将连续的语句划分成合理的词汇序列
- 词性标注: 为词汇标注对应的词性, 消除词汇的歧义等, 便于后续深层次的语义处理
- 命名实体识别: 识别语句中的实体, 如地点, 人名, 时间等, 为后续识别实体间的关系做准备
该词法分析系统的业务场景如下所示:
1, 网站业务系统持续收集用户评论, 定期产生评论的文本文件, 上传到 COS 桶中;
2,COS 服务自动触发腾讯云云函数服务, 词法分析云函数会调用 NLP 的词法分析接口, 并获取分词, 词性标注以及命名实体识别结果;
3, 词法分析云函数将分析结果送入 kafka, 并由下游的服务消费写入 MySQL 或 ES 等服务, 供进一步的处理.
分词系统业务场景
三, 具体实现步骤
本系统的核心在于词法分析云函数的实现, 默认 kafka 及下游的 ES,MySQL 都已创建.
1, 创建词法分析云函数
该函数主要实现三个功能:
- 接收 COS 的触发信息, 根据触发信息下载用户评论文本
- 调用 NLP 词法分析接口, 对文本进行处理
- 将分析的结果送入 kafka
词法分析云函数的代码如下:
- # -*- coding: utf8 -*-
- from tencentcloud.common import credential
- from tencentcloud.common.profile.client_profile import ClientProfile
- from tencentcloud.common.profile.http_profile import HttpProfile
- from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
- from tencentcloud.nlp.v20190408 import nlp_client, models
- from qcloud_cos import CosConfig
- from qcloud_cos import CosS3Client
- from qcloud_cos import CosClientError
- from qcloud_cos import CosServiceError
- from pykafka.exceptions import ConsumerStoppedException
- from pykafka.client import KafkaClient
- from pykafka.common import OffsetType
- import sys
- import logging
- SECRET_ID = "xxxxxxxxxxxxxxxxxxx"
- SECRET_KEY = "xxxxxxxxxxxx"
- APP_ID = "xxxxxxxxxxxx"
- REGION = "ap-guangzhou"
- ENDPOINT = "nlp.tencentcloudapi.com"
- # kafka 相关配置, 根据实际情况填写
- TOPIC = "comment_seg"
- CONSUMER_GROUP = "comment_seg_1"
- KAFKA_ADDRESS = "192.168.0.4:9092,192.168.0.5:9092"
- ZK_ADDRESS = "192.168.0.4:2181,192.168.0.5:2181"
- kafka_client = KafkaClient(hosts=KAFKA_ADDRESS)
- topic = kafka_client.topics[TOPIC]
- def main_handler(event, context):
- try:
- # 创建 nlp 客户端
- cred = credential.Credential(SECRET_ID, SECRET_KEY)
- httpProfile = HttpProfile()
- httpProfile.endpoint = ENDPOINT
- clientProfile = ClientProfile()
- clientProfile.httpProfile = httpProfile
- client = nlp_client.NlpClient(cred, REGION, clientProfile)
- # 创建 cos 客户端
- config = CosConfig(Region=REGION, SecretId=SECRET_ID, SecretKey=SECRET_KEY, Token=None, Scheme="https")
- cos_client = CosS3Client(config)
- logger = logging.getLogger()
- # 从 cos 将文件下载到 tmp 文件夹
- for record in event['Records']:
- try:
- bucket = record['cos']['cosBucket']['name'] + '-' + str(APP_ID)
- key = record['cos']['cosObject']['key']
- key = key.replace('/' + str(APP_ID) + '/' + record['cos']['cosBucket']['name'] + '/', '', 1)
- logger.info("Key is" + key)
- logger.info("Get from [%s] to download file [%s]" % (bucket, key))
- download_path = '/tmp/{}'.format(key)
- try:
- response = cos_client.get_object(Bucket=bucket, Key=key)
- response['Body'].get_stream_to_file(download_path)
- except CosServiceError as e:
- print(e.get_error_code())
- print(e.get_error_msg())
- print(e.get_resource_location())
- return "Fail"
- logger.info("Download file [%s] Success" % key)
- except Exception as e:
- print(e)
- print('Error getting object {} from bucket {}.'.format(key, bucket))
- raise e
- return "Fail"
- # 读取文件内容
- f = open(download_path)
- for line in f.readlines():
- logger.info("Line:[%s]" % line)
- req = models.LexicalAnalysisRequest() # 调用词法分析接口
- params = '{"Flag":1,"Text":"' + line + '"}'
- req.from_json_string(params)
- resp = client.LexicalAnalysis(req)
- print(resp.to_json_string())
- # 将原始的文本和词法分析结果发送到 kafka
- to_kafka('{"Text":%s,"Results":%s}' % (line, resp.to_json_string()))
- f.close()
- except Exception as ex:
- print("======================")
- print(ex)
- def to_kafka(msg):
- with topic.get_sync_producer() as producer:
- producer.produce(msg)
云函数部署配置文件如下:
- Resources:
- default:
- Type: TencentCloud::Serverless::Namespace
- lexical-demo:
- Type: TencentCloud::Serverless::Function
- Properties:
- CodeUri: ./
- Type: Event
- Description: This is a template function
- Role: SCF_QcsRole
- Environment:
- Variables:
- ENV_FIRST: env1
- ENV_SECOND: env2
- Handler: index.main_handler
- MemorySize: 128
- Runtime: Python2.7
- Timeout: 3
- #VpcConfig:
- # VpcId: 'vpc-qdqc5k2p'
- # SubnetId: 'subnet-pad6l61i'
- #Events:
- # timer:
- # Type: Timer
- # Properties:
- # CronExpression: '*/5 * * * *'
- # Enable: True
- # cli-appid.cos.ap-beijing.myqcloud.com: # full bucket name
- # Type: COS
- # Properties:
- # Bucket: cli-appid.cos.ap-beijing.myqcloud.com
- # Filter:
- # Prefix: filterdir/
- # Suffix: .jpg
- # Events: cos:ObjectCreated:*
- # Enable: True
- # topic: # topic name
- # Type: CMQ
- # Properties:
- # Name: qname
- # hello_world_apigw: # ${FunctionName} + '_apigw'
- # Type: APIGW
- # Properties:
- # StageName: release
- # ServiceId:
- # HttpMethod: ANY
- Globals:
- Function:
- Timeout: 10
在本地通过 SCF CLI 部署:
scf deploy -f --cos-bucket temp-code-1300312696
函数成功部署:
词法分析函数部署
2, 配置词法分析云函数触发器
在词法分析云函数的 "触发器管理" 界面中配置用户评论文本存储的 bucket 以及事件类型, 点击提交.
四, 效果展示
上传到 COS 桶的文件每行一条评论, 内容示例如下:
店家发货送了双白色袜子, 穿起来好舒服
鞋已收到试穿了下, 还挺合适, 明天去球场上验证下战靴, 看下实战怎么样
有点硬邦邦的, 第一次买球鞋, 感觉还不错
当有文件上传到 user-comment 桶时, 词法分析云函数将会自动被触发, 通过云函数的日志查询功能可查看调用记录. 其中, 词法分析的结果示例如下:
- {
- "NerTokens": null,
- "PosTokens": [{
- "Length": 2,
- "Word": "店家",
- "BeginOffset": 0,
- "Pos": "n"
- }, {
- "Length": 2,
- "Word": "发货",
- "BeginOffset": 2,
- "Pos": "v"
- }, {
- "Length": 1,
- "Word": "送",
- "BeginOffset": 4,
- "Pos": "v"
- }, {
- "Length": 1,
- "Word": "了",
- "BeginOffset": 5,
- "Pos": "u"
- }, {
- "Length": 1,
- "Word": "双",
- "BeginOffset": 6,
- "Pos": "m"
- }, {
- "Length": 2,
- "Word": "白色",
- "BeginOffset": 7,
- "Pos": "n"
- }, {
- "Length": 2,
- "Word": "袜子",
- "BeginOffset": 9,
- "Pos": "n"
- }, {
- "Length": 1,
- "Word": ",",
- "BeginOffset": 11,
- "Pos": "w"
- }, {
- "Length": 1,
- "Word": "穿",
- "BeginOffset": 12,
- "Pos": "v"
- }, {
- "Length": 2,
- "Word": "起来",
- "BeginOffset": 13,
- "Pos": "v"
- }, {
- "Length": 1,
- "Word": "好",
- "BeginOffset": 15,
- "Pos": "a"
- }, {
- "Length": 3,
- "Word": "好舒服",
- "BeginOffset": 15,
- "Pos": "a"
- }, {
- "Length": 2,
- "Word": "舒服",
- "BeginOffset": 16,
- "Pos": "a"
- }],
- "RequestId": "5597cfb6-64f5-42d0-8727-866c400d9778"
- }
五, 总结
本文展示了如何基于腾讯云生态, 快速搭建一套词法分析系统. 对于企业来讲, 其无需投入 NLP 专业人员即可在短时间内构建起一套这样的系统. 实际上, 结合 NLP 服务的其它服务, 如文本分类, 情感分析等, 还可以打造功能更为多样语义分析能力, 帮助企业实现从数据到商业洞察的飞跃.
来源: https://www.qcloud.com/developer/article/1622663