通过 Nutch 扩展点开发插件 (添加自定义索引字段到 solr)

爬虫系统：通过 Nutch 扩展点开发插件 (添加自定义索引字段到 solr)

准备工作

爬虫环境 —— nutch2.3.1 + solr4.10.3 + hbase0.98
开发环境 —— Eclipse Mars.2 Release（4.5.2）
所需 jar 包 —— apache-nutch-2.3.jar、hadoop-common-2.6.0.jar、slf4j-api-1.7.9.jar

什么是 Nutch 扩展点

好的爬虫系统应该同时具备高扩展性 (scalability) 和高伸缩性 (extensibility) 的特点。Nutch 爬虫系统不仅是采用动态加载插件形式设计(可扩展性)，而且允许利用 hadoop 集群进行分布式爬取(可伸缩性)。用户可以根据自己的需求开发特定的爬虫系统，同时不需要过多担心业务量剧增会影响爬虫性能。
Nutch 提供了扩展接口【 Parser、ParseFilter 等】，用户通过实现这些接口进行插件开发。

本文插件的意义

利用 nutch 原始插件爬取 web 数据，然后把数据放入 solr 建索引，此时 solr 中的索引字段只有默认配置的几个，如果我们需要加入额外的字段，则需要利用 Nutch 的扩展接口进行索引插件开发。

如何利用 Nutch 扩展点开发插件

1、确定实现哪个扩展点？
      本文需要在 solr 中新增索引字段，所以需要利用索引阶段的扩展点【IndexWriter、IndexingFilter】，Nutch 已经实现了 indexer-solr 插件用于创建 solr 索引，我们可以重新实现 indexer-solr 插件替换掉原来的；或者利用索引过滤往索引对象【NutchDocument】中加入需要进行索引的字段。本文使用 IndexingFilter。
2、如何开发插件？
      2.1 eclipse 新建 java 项目；
      2.2 实现 IndexingFilter 接口；

public class ExtraIndexer implements IndexingFilter {
    private static final Logger LOGGER = LoggerFactory.getLogger(ExtraIndexer.class);
    private Configuration conf;
    private String CRAWLID_VALUE;
 
    /**
         * NutchDocument为索引数据对象
         * WebPage为爬虫持久层数据表
         */
    public NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException {
 
        // just in case
        if (doc == null) return doc;
        addCrawlId(doc, url, page);
        addFetchTime(doc, url, page);
        return doc;
    }
 
    private NutchDocument addFetchTime(NutchDocument doc, String url, WebPage page) {
        String fetchTime = page.getFetchTime().toString();
        LOGGER.info(">>>>>>>>>>add fetchtime: " + fetchTime);
        doc.add("fetchTime", fetchTime);
        return doc;
    }
 
    private NutchDocument addCrawlId(NutchDocument doc, String url, WebPage page) {
        doc.add("crawlId", this.CRAWLID_VALUE);
        LOGGER.info(">>>>>>>>>>add crawlId: " + this.CRAWLID_VALUE);
        return doc;
    }
 
    public void setConf(Configuration conf) {
        this.conf = conf;
        this.CRAWLID_VALUE = conf.get(Nutch.CRAWL_ID_KEY);
        LOGGER.info(">>>>>>>>>>crawlID for indexing set to: " + this.CRAWLID_VALUE);
    }
 
    public Configuration getConf() {
        return this.conf;
    }
 
    @Override public Collection < Field > getFields() {
        return null;
    }
}

2.3 编写配置文件；

plugin.xml：插件信息提供给 Nutch 识别.

id="index-extra"
   name="Extra Indexing Filter"
   version="1.0.0"
   provider-name="nutch.org">
 
   
       name="index-extra.jar">
          name="*"/>
      
   
   
       plugin="nutch-extensionpoints"/>
   
    id="org.apache.nutch.indexer.extra"
              name="Nutch Extra Indexing Filter"
              point="org.apache.nutch.indexer.IndexingFilter">
       id="ExtraIndexer"
                      class="org.apache.nutch.indexer.extra.ExtraIndexer"/>

build.xml：给 ant 提供编译信息

name = "index-extra"
default = "jar-core" > file = "../build-plugin.xml" / >

ivy.xml：描述插件的相关依赖，给 ivy 提供信息方便管理这些依赖

version="1.0">
   organisation="org.apache.nutch" module="${ant.project.name}">
     name="Apache 2.0"/>
     name="Apache Nutch Team" url="http://nutch.apache.org"/>
    
        Apache Nutch
    
  
 
  
     file="../../../ivy/ivy-configurations.xml"/>
  
 
  
    
     conf="master"/>

2.4 在 nutch 安装目录下编译插件；

a、把 index-extra 源码加入到 {NUTCH-HOME}/src/plugin
b、修改 {NUTCH-HOME}/src/plugin/build.xml 文件
c、修改 {NUTCH-HOME}/build.xml 文件
d、编辑 {NUTCH-HOME}/conf 中的相关配置文件
e、在 {NUTCH-HOME} 目录下运行：ant runtime

3、如何利用开发好的插件？

3.1 修改 nutch-site 配置文件；

plugin.includes
   protocol-httpclient|urlfilter-regex|index-(extra|basic|anchor|more|metadata)|indexer-solr|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|text|metatags)

3.2 修改 schema.xml 配置文件；

name="crawlId" type="string" stored="true" indexed="false" />
 name="fetchTime" type="string" stored="true" indexed="false" />

3.3 修改 solrindex-mapping.xml【nutch 和 solr 索引映射】配置文件；

dest="fetchTime" source="fetchTime" />
 dest="crawlId" source="crawlId" />

3.4 修改 solr 配置文件；

{NUTCH-HOME}/runtime/local/conf 下的 schema.xml 复制到 solr 实例的 conf 目录下 {SOLR-HOME}/collection1/conf/，并重启 solr 服务器

4、运行爬虫命令验证结果？

一站式命令：nohup bin/crawl urls/ craw-name http:// ××××:8080/solr/ 3
创建索引命令：nohup bin/nutch solrindex http:// ××××:8080/solr/ -all -crawlId craw-name &

小结

本文抛砖引玉，主要实现了在 nutch 索引阶段，通过扩展插件的手段，添加自定义的索引字段到 NutchDocument 索引数据对象中，从而在随后 solr 的 CRUD 阶段【indexer-solr】把添加的字段提交到 solr。
注：博主水平有限，望批评指正！以求共勉！

来源: http://www.cnblogs.com/chanfee/p/8033702.html

与本文相关文章

暂无,快来抢沙发吧！