Nutch中搜索时把自定义的字段(filed)加入检索条件

    技术2022-05-19  19

    1、问题原因分析

    错误org.apache.nutch.searcher.QueryException: Not a known field name:publishUrl原因

    原因分析:

    NutchBeanmain()方法中

    final NutchBean bean = new NutchBean(conf);

    声明NutchBean 得到一个bean,在他的构造函数中使用LuceneSearchBean()来实现searchBean

    searchBean = new LuceneSearchBean(conf, indexDir, indexesDir);

    LuceneSearchBean的构造函数调用自己的init()方法进行初始化。其中searcher的使用IndexSearher()进行实例化。

    this.searcher = new IndexSearcher(indexDir, this.conf);

    IndexSearcher的构造函数中调中自己的()方法来初实例化queryFilters

    this.queryFilters = new QueryFilters(conf);

    QueryFIlters中有三个field

      private QueryFilter[] queryFilters; //加载的filter

      private HashSet<String> FIELD_NAMES ;   //字段名

      private HashSet<String> RAW_FIELD_NAMES;//字段名

    默认加载LanguageQueryFilteDefaultQueryFilter

    UrlQueryFilterSiteQueryFilter四个插件

    FIELD_NAMES中的值为[site, , lang, DEFAULT, url]

    RAW_FIELD_NAMES中的值为[site, , lang]

    QueryFilter()的构造函数中会将FIELD_NAMESRAW_FIELD_NAMES存入到ObjectCahche

    FIELD_NAMES.addAll(fieldNames);

    FIELD_NAMES.addAll(rawFieldNames);

    objectCache.setObject("FIELD_NAMES", FIELD_NAMES);

    RAW_FIELD_NAMES.addAll(rawFieldNames);

    objectCache.setObject("RAW_FIELD_NAMES", RAW_FIELD_NAMES);

     

    当执行到 final Hits hits = bean.search(query);时会调用searchBean.search(query);LuceneSearchBean的方法-》IndexSearcher中的Hits search(Query query)this.queryFilters.filter(query); -》QueryFiltes中的BooleanQuery filter(Query input)

    input的值为:[处理, output, -url:www, -publishUrl:qqq]

      public BooleanQuery filter(Query input) throws QueryException {

        // first check that all field names are claimed by some plugin

        Clause[] clauses = input.getClauses();

        for (int i = 0; i < clauses.length; i++) {

          Clause c = clauses[i];

          if (!isField(c.getField())) //因为自定义字段publishUrl不在 FIELD_NAMES中,所以此处报错。

            throw new QueryException("Not a known field name:"+c.getField());

        }

     

        // then run each plugin

        BooleanQuery output = new BooleanQuery();

        for (int i = 0; i < this.queryFilters.length; i++) {

          output = this.queryFilters[i].filter(input, output);

        }

        return output;

      }

     

      public boolean isField(String name) {

        return FIELD_NAMES.contains(name);//判断该字段是否在FIELD—NAMES

      }

    2、解决方案

    Nutch中已经实现了 一个CustomFieldQueryFilter()的插件用于将自定义字段名加到QueryFilter中。

    首先修改customs-field.xml文件,如下:

      <entry key="publishUrl.name">publishUrl</entry>

      <entry key="publishUrl.indexed">yes</entry>

      <entry key="publishUrl.stored">yes</entry>

      <entry key="publishUrl.tokenized">yes</entry>

      <entry key="publishUrl.boost">1.0</entry>

      <entry key="publishUrl.multi">false</entry>

     

      <entry key="publishTitle.name">publishTitle</entry>

      <entry key="publishTitle.indexed">yes</entry>

      <entry key="publishTitle.stored">yes</entry>

      <entry key="publishTitle.tokenized">yes</entry>

      <entry key="publishTitle.boost">1.0</entry>

      <entry key="publishTitle.multi">false</entry>

    然后修改plugin/query-custom/plugin.xml

    把里面的<parameter name="fields" value="publishUrl,publishTitle" />

    value改成你自己的字段名

    最后在nutch-default.xml中把query-custom加到plugin.includes


    最新回复(0)