Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang.

Similar presentations


Presentation on theme: "Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang."— Presentation transcript:

1 Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang

2 Statistical Information Extraction (IE) is increasingly used. –For example, MSR Academic Search, Ali Baba (HU Berlin), MPI YAGO –isWiki at HP Labs Text Corpora evolve! –An issue: difficult to keep IE results up to date –Current approach: rerun from scratch, which can be too slow Our Goal: Improve statistical IE runtime on evolving corpora by recycling previous IE results. –We focus on a popular statistical model for IE – conditional random fields (CRFs), and build CRFlex –Show 10x speedup is possible for repeated extractions One-Slide Summary

3 Background

4 Document Token sequence Trellis graph Label sequence Table P: PersonA: Affiliation David DeWittMicrosoft P: PersonA: Affiliation David DeWitt is working at Microsoft. y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 PPOOOA x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 DavidDeWittisworkingatMicrosoft x:x: y:y: Background 1: CRF-based IE Programs 1 2 3 4 5 6 P: Person A: Affiliation O: Other

5 weight 0.2 Token sequence  Label Sequence (CRF Labeling) (I) Computing Feature Functions (Applying Rules) (II) Constructing Trellis Graph (Dot Product) (III) Viterbi Inference (Dynamic Programming) –A version of standard shortest path algorithm weight w = v ∙ λ = 0.2 feature v (0, 1) Background 2: CRF Inference Steps model λ (0.5, 0.2) 1 2 3 4 5 6 P: Person A: Affiliation O: Other x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 DavidDeWittisworkingatMicrosoft x f(O, A, x, 6) = 0 g(O, A, x, 6) = 1

6 Challenges How to do CRF inference incrementally w/ exactly same results as re-run –no straight-forward solutions for each step How to trade off savings and overhead –intermediate results (feature values & trellis graph) are much larger than input (tokens) & output (labels) DavidDeWittisworkingatMicrosoft (I)Computing Feature Functions f 1 f 2 f K... (II)Computing Trellis Graph Feature Values Token Sequences (III)Perform Inference Trellis Graph Label Sequences

7 Technical Contributions

8 (I) Computing Feature Functions (Applying Rules) –(Cyclex) Efficient Information Extraction over Evolving Text Data, F. Chen, et al. ICDE-08 (II) Constructing Trellis Graph (Dot Product) –In a position, unchanged features  unchanged trellis (III) Viterbi Inference (Dynamic Programming) –Auxiliary information needed to localize dependencies –Modified version for recycling Recycling Each Inference Step StepInputOutput IToken SequenceFeature Values IIFeature ValuesTrellis Graph IIITrellis GraphLabel Sequence

9 Performance Trade-off Materialization decision in each inference step –A new trade-off thanks to the large amount of intermediate representation of statistical methods –CPU computation varies from task to task Keep output?ProsCons Yes More recycling chance (Low CPU time) High I/O time NoLow I/O time Less recycling chance (High CPU time)

10 Optimization Binary choices for 2 intermediate outputs  2 2 = 4 plans More plans possible –If partial materialization in a step No plan is always fastest  cost-based optimizer –CPU time per token, I/O time per token – task-dependent –Changes between consecutive snapshots – dataset-dependent –Measure by running on a subset at first few snapshots Keep output?ProsCons Yes More recycling chance (Low CPU time) High I/O time NoLow I/O time Less recycling chance (High CPU time)

11 Experiments

12 Repeated Extraction Evaluation Dataset –Wikipedia English w/ Entertainment tag, 16 snapshots (once every three weeks), 3000+ pages per snapshot on average IE Task: Named Entity Recognition Features –Cheap: token-based regular expressions –Expensive: approximate matching over dictionaries ~10X Speed-up Statistics Collection

13 Conclusion Concerning real-world deployment of statistical IE programs, we: –Devised a recycling framework without loss of correctness –Explored a performance trade-off, CPU vs. I/O –Demonstrated that up to about 10X speed-up on a real-world dataset is possible Future Directions –More graphical models and inference algorithms –In parallel settings

14 Only the fastest 3 (out of 8) are plotted –No plan is always within top 3 Importance of Optimizer

15 Per Snapshot Comparisons

16 Only the fastest 3 and Rerun are plotted –IO can be more in the slow plans Runtime Decomposition

17 Scoping Details Per-document IE –No breakable assumptions for a document –Repeatedly crawling using a fixed set of URLs Focus on the most popular model in IE –Linear-chain CRF –Viterbi inference Optimize inference process with a pre- trained model Exact results as rerun, no approximation Recycle granularity is token (or position)

18 Recycle Each Step new factors previous factors factor recompute regions factor copy regions Factor Recycler previous feature values vector match region Vector Diff new feature values Factor Copier 1 2 a b 3 (b)Step II new labels previous labels inference recompute regions inference copy regions Inference Recycler previous factors factor match region Factor Diff new factors Label Copier 1 2 a b 3 previous Viterbi context & Viterbi context & Viterbi context (c)Step III new feature values previous feature values feature recompute regions feature copy regions Feature Recyclers previous token sequence token match region Unix Diff new token sequence Feature Copier 1 2 b 3 (a)Step I a


Download ppt "Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang."

Similar presentations


Ads by Google