Information Extration: Quality Matter. A paper review

技术2024-08-01 71

Paper Reivew Xu Chu January 30, 2011 1 1.1 A Quality-Aware Optimizer for Information Basics An IE system takes input a document, and output a set of tuples. IE has to define a extration pattern first. IE system has a few decision threshold, or we call ”knob”. For instance, the number of rules, minimum confidence level of the extracted tuples, etc.. We generally use θ to denote the parameters. Document Retrieval strategies also affect the output quality of the IE system since the documents retrieved by the strategy is processed later by the IE system. There are three kinds of document retrievel strategies. The first is scan, the second is filtered scan, the third is Automatic Query Evaluation(QXtract). We use output quality, which means the number of good tuples and bad tuples, as well as execution time to model the quality of an IE system coupled with a document retrievel strategy. 1.2 Characterize Output Quality One traditional method is to use precison-recall curves. Precision characterize perfectness, while recall characterize completeness. However, precision depends heavily on the distribution of good and bad documents in the test set. So, we use ROC curve instead, which graphically summarize the trade-off be- tween different types of errors. True positive rate is the fraction of positives correctly classified as positives, and false positive rate is fraction of negatives incorrectly classfied as positives. The next step is to develop an algorithm to generate ROC curve for a stand- alone IE system. There are three problems we need to deal with. The first is that the computation of fp rate needs to know the total number of bad tuples, which is hard to count since in principle it is infinite. The solution is to use a pooling-based approach. The second problem is that after computing each point in the curve, i.e. < tp(θ), f p(θ) >, we don’t have a confidence bound for the point. The solution is to use ten-fold cross valiation. The third problem is that while we use θ to characterize the IE settings, the fact is that in practice there are mutiple ”knobs”. The solution is that we only choose pareto-optimal settings. 1 Finally, we need to incorporate document retrieval strategies into our ROC curve analysis. We now define the quality curve(E,R,P). In which E is the IE system, R is document retrieval strategy, and P is the time point. 1.3 1.3.1 Estimating Output Quality Statistics Backgrounds hypergeometric distribution. http://en.wikipedia.org/wiki/Hypergeometric_ distribution power-law distribution. http://en.wikipedia.org/wiki/Power_law binomial distribution. http://en.wikipedia.org/wiki/Binomial_distribution 1.4 Estimating Model Parameters Retrieval-strategy-specific parameters can be typically estimated using a simple testing phase. Database-specific parameters relies on the general principle of Maximum- Likelihoond Estimation(MLE). 1.4.1 Assuming that we know a observed tuple is good or not First, we estimate |Dg |, |Db |, |De |, using |Dr |, |Dgp |, |Dbp |, |Dep |. Second, we estimate tuple related parameters. Estimaing power-law distribution parameters for gd(t), bd(t). Finally, we estimate |Tgood |, |Tbad | 1.4.2 Don’t know a tuple is good or not Technique1:Rejection-Sampling-Based MLE-Partitioning Approach, similar to Monte-Carlo. Technique2:Uncertainty-Preserving Approach. 2 Join Optimization of IE output: Quality Mat- ters 3 An Algebraic Approach to Rule-Based Infor- mation Extraction Rule-based information extraction systems suffer from scalability problems. This paper proposes a algebraic approach to address such problems. 2 3.1 Data Model and Algebraic Operators span is the basic data struct, a finite sequence of spans constitute a tuple, and a set of tuples of the same width constitute a relation. In addition to the traditional relational operators, this paper introduces another two kinds of operators, i.e. Span extraction operators and Span aggragation op- erators. • εre :Standard Regular Expression Matcher. • εd :Dictionary Matcher • Ωo :Overlap Consolidation is used to produce new spans by merging over- lapping spans • Ωc :Containment Consolidation is used to discard annotation spans that are wholly contained within other annotation spans. • β:Block identifies a large span of text enclosing a set of input spans that no two successive spans are more than a certain distance apart. 3.2 Optimization In addition to regular relational operator optimization techniques such as select pushing down, this paper proposes three optimzation techniques. Shared Dictionary Matching(SDM) The goal is to avoid multiple scan of the same dictionary. Furthermore, a single scan over a document can also match multiple dictionaries simultaneously. Conditional Evalution(CE) The goal is avoid some subqueries if we already know that this subquery wouldn’t produce any result because of some restriction imposed earlier. Restricted Span Extraction(RSE) The idea is to avoid scan the entire text, when taking advantage of some previous known facts. There are some minor complications we need to deal with. Please refer to the original paper. 3

最新回复(0)