python 全文检索引擎详解

这里有新鲜出炉的 Python 入门，程序狗速度看过来！

Python 编程语言

Python 是一种面向对象、解释型计算机程序设计语言，由 Guido van Rossum 于 1989 年底发明，第一个公开发行版发行于 1991 年。Python 语法简洁而清晰，具有丰富和强大的类库。它常被昵称为胶水语言，它能够把用其他语言制作的各种模块（尤其是 C/C++）很轻松地联结在一起。

这篇文章主要介绍了 python 全文检索引擎详解的相关资料, 需要的朋友可以参考下

python 全文检索引擎详解

最近一直在探索着如何用 Python 实现像百度那样的关键词检索功能。说起关键词检索，我们会不由自主地联想到正则表达式。正则表达式是所有检索的基础，python 中有个 re 类，是专门用于正则匹配。然而，光光是正则表达式是不能很好实现检索功能的。

python 有一个 whoosh 包，是专门用于全文搜索引擎。

whoosh 在国内使用的比较少，而它的性能还没有 sphinx/coreseek 成熟，不过不同于前者，这是一个纯 python 库，对 python 的爱好者更为方便使用。具体的代码如下

安装

输入命令行 pip install whoosh

需要导入的包有:

fromwhoosh.index import create_in
fromwhoosh.fields import *
fromwhoosh.analysis import RegexAnalyzer
fromwhoosh.analysis import Tokenizer,
Token

中文分词解析器

class ChineseTokenizer(Tokenizer) : """
  中文分词解析器
  """def __call__(self, value, positions = False, chars = False, keeporiginal = True, removestops = True, start_pos = 0, start_char = 0, mode = '', **kwargs) : assert isinstance(value, text_type),
"%r is not unicode " % value t = Token(positions, chars, removestops = removestops, mode = mode, **kwargs) list_seg = jieba.cut_for_search(value) for w in list_seg: t.original = t.text = w t.boost = 0.5
if positions: t.pos = start_pos + value.find(w) if chars: t.startchar = start_char + value.find(w) t.endchar = start_char + value.find(w) + len(w) yield t
def chinese_analyzer() : return ChineseTokenizer()

构建索引的函数

@staticmethod def create_index(document_dir) : analyzer = chinese_analyzer() schema = Schema(titel = TEXT(stored = True, analyzer = analyzer), path = ID(stored = True), content = TEXT(stored = True, analyzer = analyzer)) ix = create_in("./", schema) writer = ix.writer() for parents,
dirnames,
filenames in os.walk(document_dir) : for filename in filenames: title = filename.replace(".txt", "").decode('utf8') print title content = open(document_dir + '/' + filename, 'r').read().decode('utf-8') path = u "/b"writer.add_document(titel = title, path = path, content = content) writer.commit()

检索函数

@staticmethod def search(search_str) : title_list = [] print 'here'ix = open_dir("./") searcher = ix.searcher() print search_str,
type(search_str) results = searcher.find("content", search_str) for hit in results: print hit['titel'] print hit.score print hit.highlights("content", top = 10) title_list.append(hit['titel']) print 'tt',
title_list
return title_list

来源: http://www.phperz.com/article/17/0617/334875.html

与本文相关文章

暂无,快来抢沙发吧！