Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.
Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial.
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.
Webpage Understanding: an Integrated Approach
AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management.
Tutorial 14 Working with Forms and Regular Expressions.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Hipikat: A Project Memory for Software Development The CISC 864 Analysis By Lionel Marks.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.
Integration of Search and Learning Algorithms Eugene Fink.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
1 CS 1430: Programming in C++ Turn in your Quiz1-1.
Exploiting Relevance Feedback in Knowledge Graph Search
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
Presented by: AKHIL GADA CSCI 572 University of Southern California Full Text Indexing Based On Lexical Relations An Application :Software Library by YS.
Trajectory Simplification: On Minimizing the Direction-based Error
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Software Architecture Architecture represents different things from use cases –Use cases deal primarily with functional properties –Architecture deals.
Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Dense-Region Based Compact Data Cube
Review: Tree search Initialize the frontier using the starting state
CS 330 Class 7 Comments on Exam Programming plan for today:
Hybrid BDD and All-SAT Method for Model Checking
Resource Elasticity for Large-Scale Machine Learning
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Apollo Weize Sun Feb.17th, 2017.
A Framework for Testing Query Transformation Rules
Precise Condition Synthesis for Program Repair
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Early Midterm Some Study Reminders.
Presentation transcript:

Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of Wisconsin-Madison 2 Texas State University-San Marcos 3 Duke University 4 Yahoo! Research

2 Information Extraction (IE) Many solutions in database/Web/AI communities with significant progress But most solutions have considered only static text corpora Meetings room time CS 3104pm CS 1052pm …… Group Meeting Schedule Jun 21: We’ll discuss CIM and IR in room CS 310 at 4pm. Jun 14: Meet in CS 105 at 2pm. IE program

3 Evolving Text Corpora Are Pervasive IBM –find the latest information from enterprise intranets of Washington and –keep extracted knowledge consistent with the Wikipedia pages of Wisconsin –monitor community information url 1 url 2 url 3 … snapshot1 snapshot2 snapshot3

4 IE over Evolving Text Data room time CS 3104pm CS 1052pm …… Group Meeting Schedule Jun 14: Meet in CS 105 at 2pm. Group Meeting Schedule Jun 21: We’ll discuss CIM and IR in room CS 310 at 4pm. Jun 14: Meet in CS 105 at 2pm. Group Meeting Schedule Jun 28: Seminar in CS 354 at 4pm. Jun 21: We’ll discuss CIM and IR in room CS 310 at 4pm. Jun 14: Meet in CS 105 at 2pm. room time CS 3544pm CS 3104pm CS 1052pm …… room time CS 1052pm …… E E E

5 Cyclex[ICDE08]: Match, Reuse and Extract room time CS 1052pm CS 3104pm room time CS 1052pm Group Meeting Schedule Jun 14: Meet in CS 105 at 2pm. Reuser E Group Meeting Schedule Jun 14: Meet in CS 105 at 2pm. Jun 21: We’ll discuss CIM and IR in room CS 310 at 4pm. Matcher E room time CS 1052pm

6 Cyclex[ICDE08]: Properties of Extractors for Correct Reuse Cyclex[ICDE08]: Properties of Extractors for Correct Reuse Scope: max length of any mention extracted by E Context: length of “text windows” surrounding a mention – E only exams the text windows to extract the mention Example : E extracts telephone numbers using regular expression “be reached at \d{7}” scope = 7 chars context = 14 chars

7 Limitations of Cyclex[ICDE08] extractRoom Meetings(room, time) extractMeeting extractTime Meetings(room, time) # Only look for conference names in the top 20 lines of the file my $maxLines=20; my $topOfFile=getTopOfFile($file,$maxLines); # Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($topOfFile=~/(.*?)$pattern/is) { my ($prefix,$name)=($1,$2); # If it matches, do a sanity check and clean up the match # Get the first letter # Verify that the first letter is a capital letter or number if(!($name=~/^\W*?[A-Z0-9]/)) { return (); } # If there is an abbreviation, cut off whatever comes after that if($name=~/^(.*?$abbreviations)/s) { $name=$1; } # If the name is too long, it probably isn't a conference if(scalar($name=~/[^\s]/g) > 100) { return (); } # Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]"); " $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there is only one word in name my $lastLetter=$1; if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter Cyclex treats an IE program as a blackbox Real-world IE programs are complex –Avatar: 25+ blackboxes –DBLife: 45+ blackboxes data page

8 Delex: Delex: Decompose and Recycle extractRoom Meetings(room, time) extractMeeting extractTime data page Exploit the composition nature of IE programs Delex cuts the runtime of Cyclex by 50-71%

9 Delex on Consecutive Snapshots Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Captured results of snapshot n Snapshot n Snapshot n+1 Captured results of snapshot n+1

10 A Baseline Solution of Capturing Results for Multi-Blackbox IE Programs Capturing results in Cyclex A baseline solution: applying the solution of Cyclex to each blackbox E {(mention extracted, docID, positions within the doc)} F E {(mention extracted, docID, positions within the doc)}

11 The Baseline Solution May Miss Mentions rmTitle exLocation Locations(location) Midwest DB Courses Year 2009 CS101 (Wisc) CS301 (Illinois) Wisc rmTitle exLocation Locations(location) Midwest reuser Midwest DB Courses: CS101 (Wisc) CS301 (Illinois) Illinois no mention is copied missed !

12 Capture and Store IE Results in Delex rmTitle exLocation Locations(location) Wisc Midwest DB Courses: CS101 (Wisc) CS301 (Illinois) Illinoins “CS101 … Illinois)” I exLocation O exLocation I exLocation q r I exLocation (q) I exLocation (r) O exLocation O exLocation (q) O exLocation (r) q Wisc Illinoins … … …

13 Delex on Consecutive Snapshots Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Captured Results Captured Results Snapshot n Snapshot n+1

14 Challenge in Efficiently Reuse Captured Results rmTitle 1  n S )( 1 qI U )( 2 qI U )( 1 qI V )( 1 qO U )( 2 qO U )( 2 qI V )( 1 qO V )( 2 qO V          1 p 2 p 1 q 2 q n S V O V I U O U I exLocation U V

15 Overall Processing Algorithm U V )( 2 qI U )( 2 qO U )( 1 qI V )( 2 qI V   )( 2 qO V n S n+1 S  2 p 2 q )( 1 qO V )( 1 qO U )( 1 qI U U I U O V I V O

16 Delex on Consecutive Snapshots Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Matchers Cost Model Select a good reuse plan Execute the plan Capture IE results Captured Results Captured Results Snapshot n Snapshot n+1

17 Find a Good Reuse Plan Plan space –assign a matcher to each IE blackbox –# of plans is exponential in # of IE blackboxes Use a text-specific cost model to estimate the completion time of each plan Searching for good plans –optimization is not “decomposable” –a greedy solution that efficiently finds a good plan

18 Experiment Setup Datasets IE Programs : Rule-based and Learning-based IE Programs Data SetsDBLifeWikipedia # Snapshots15 Time between snapshots2 days21 days Avg # Page per Snapshot Avg Size per Snapshot180M35M DBLife (Rule-based) Wikipedia (Rule-based) Wikipedia (Learning -based) talkchairadviseblockbusterplayawardactor # of IE “Blackboxes”

19 Runtime Comparison Delex drastically cuts runtime of cyclex by 50-71% (See paper for more experiments)

20 Related Work IE over evolving text data –[Doan et al, ICDE-08] –only considers a single IE blackbox Optimizing IE programs –[Gravano et al, SIGMOD-06] [Gravano et al, ICDE-07] [Doan et al, VLDB- 07] [Reiss et al, ICDE-08] –only consider static text corpora Incremental View Maintenance –[Gupta&Mumick][&Widom et al, SIGMOD-95][Garcia-Molina&Widom et al, VLDB-91]… –only consider relational operators with well defined semantics –assume that changes to the inputs are readily available

21 Conclusion and Future Work First in-depth solution to optimizing complex IE over evolving text Defined challenges and provided initial solutions –capture intermediate IE results for correct reuse –efficiently coordinate matching, extraction, and copying for multiple IE blackboxes –cost-based decisions in choosing a good reuse plan Future work –reuse across URLs –handle extractors that extract mentions across multiple pages

22 Capture and Store IE Results in Delex rmTitle exLocation Locations(location) Wisc Midwest DB Courses: CS764 (Wisc) CS511 (Illinois) Illinoins “CS764 … Illinois)” I exLocation O exLocation I exLocation q r I exLocation (q) I exLocation (r) O exLocation O exLocation (q) O exLocation (r) q “CS764 … Illinois)” Wisc Illinoins … … …