Weka学习之Filter(2)-StringToWordVector

技术2022-05-20 36

为了更具体地展示Filter的用法和原理，我们分析一个名为StringToWordVector的Filter。它是我们在文本挖掘中用得比较普遍的一个类。作用是把字符串属性转换成一个个词属性，属性的值可以在参数中指定，比如0-1变量（代表这个词是否在该实例中出现），词频变量，log（1+词频）或者TF-IDF值。

下面是StringToWordVector的input方法源码：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 /** * Input an instance for filtering. Filter requires all * training instances be read before producing output. * * @param instance the input instance. * @return true if the filtered instance may now be * collected with output(). * @throws IllegalStateException if no input structure has been defined. */ public boolean input(Instance instance) throws Exception { if (getInputFormat() == null) { throw new IllegalStateException("No input instance format defined"); } if (m_NewBatch) { resetQueue(); m_NewBatch = false; } if (isFirstBatchDone()) { FastVector fv = new FastVector(); int firstCopy = convertInstancewoDocNorm(instance, fv); Instance inst = (Instance)fv.elementAt(0); if (m_filterType != FILTER_NONE) { normalizeInstance(inst, firstCopy); } push(inst); return true; } else { bufferInput(instance); return false; } }

这个方法支持增量输入数据实例，对于首批的数据首先执行的是bufferInput方法。这个方法就是把实例instance加入到inputFormat的数据集dataset中。所有的instance添加完毕后，我们就开始进入到batchFinished方法中：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 /** * Signify that this batch of input to the filter is finished. * If the filter requires all instances prior to filtering, * output() may now be called to retrieve the filtered instances. * * @return true if there are instances pending output. * @throws IllegalStateException if no input structure has been defined. */ public boolean batchFinished() throws Exception { if (getInputFormat() == null) { throw new IllegalStateException("No input instance format defined"); } // We only need to do something in this method // if the first batch hasn't been processed. Otherwise // input() has already done all the work. if (!isFirstBatchDone()) { // Determine the dictionary from the first batch (training data) determineDictionary(); // Convert all instances w/o normalization FastVector fv = new FastVector(); int firstCopy=0; for(int i=0; i < m_NumInstances; i++) { firstCopy = convertInstancewoDocNorm(getInputFormat().instance(i), fv); } // Need to compute average document length if necessary if (m_filterType != FILTER_NONE) { m_AvgDocLength = 0; for(int i=0; i < fv.size(); i++) { Instance inst = (Instance) fv.elementAt(i); double docLength = 0; for(int j=0; j < inst.numValues(); j++) { if(inst.index(j)>=firstCopy) { docLength += inst.valueSparse(j) * inst.valueSparse(j); } } m_AvgDocLength += Math.sqrt(docLength); } m_AvgDocLength /= m_NumInstances; } // Perform normalization if necessary. if (m_filterType == FILTER_NORMALIZE_ALL) { for(int i=0; i < fv.size(); i++) { normalizeInstance((Instance) fv.elementAt(i), firstCopy); } } // Push all instances into the output queue for(int i=0; i < fv.size(); i++) { push((Instance) fv.elementAt(i)); } } // Flush the input flushInput(); m_NewBatch = true; m_FirstBatchDone = true; return (numPendingOutput() != 0); }

注意到我们的determineDictionary()方法，这个方法的主要作用是：

1.确认停用词表；

2.对那些需要进行转换的字符串型属性值，按给定的tokenizer进行分词，记录单词对应的类词频和类文件数（即属于该类的文件中有多少文件包含这个单词的）；

3.根据最小词频数（m_minTermFreq）和每个类最多保留单词数(m_WordsToKeep)过滤单词；

4.收集未转换属性作为新属性；

5.把第二步中符合条件的单词收集起来作为新属性；

6.计算每个单词在多少个文档中出现过，保存在m_DocsCounts数组中；

7.TreeMap类型成员变量m_Dictionary记录<word,新属性index>对；

8.设置outputFormat的新属性的结构体。

24-26行对每个实例调用convertInstancewoDocNorm(Instance instance, FastVector v)方法。该方法进行以下操作：

1.记录所有未参加转换的非0属性值到<新属性index,属性值> –> contained[a TreeMap type]，变量firstCopy=未参加转换的属性个数+1；

2.对所有参见转换的属性值，

1）tokenize；

2）转换成小写，去词根；

3）把<新属性index, 词频[或者0-1变量，用于表征单词是否出现,如果设置变量m_OutputCounts==false的话]>—>contained，这个词频值在本次迭代中完成统计。

3.如果设置变量m_TFTransform为真，更新contained中Key大于等于firstCopy的值为val = Math.log(val+1)，也即把原先记录的词频fij变成log（fij + 1），注意如果要达到这个效果只有把m_TFTransform 以及m_OutputCounts同时设置成true。

4.如果设置变量m_IDFTransform为真，更新更新contained中Key大于等于firstCopy的值为val=val*Math.log( m_NumInstances /(double) m_DocsCounts[index.intValue()] )，也即把原先记录的词频fij变成fij*log（文档数/该单词在多少个文档中出现过），就是我们用的TF-IDF。注意如果要达到这个效果只有把m_IDFTransform 以及m_OutputCounts同时设置成true，并保持m_TFTransform为false（否则的话就是两个log相乘了）。

5.把上面搜集的新属性的对应index和值，也就contained转换成values和indices数组，生成一个SparseInstance，添加到vector中，返回firstCopy。

然后convertInstancewoDocNorm迭代完成。firstCopy记录了第一个转换得到的新属性的index，而fv中包括了所有的已经转换完毕的SparseInstance。

29行开始判断是否进行对文本长度进行归一。最后设置m_NewBatch和m_FirstBatchDone为真，并把所有转换完毕的SparseInstance加入到一个队列中，返回该队列长度。

batchFinished结束。

我们再回到第一个调用batchFinished的地方，即静态函数useFiler中，这里简单地把SparseInstance实例加入到outputFormat的dataset中。然后就返回这个dataset。

专利

最新回复(0)