1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department

2 Information Extraction Example Information extraction systems represent text in structured form May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Disease Outbreaks in The New York Times Information Extraction System

3 How can information extraction help? … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis Structured Relation

4 Goal: Detect, Monitor, Predict Outbreaks Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, … 911 Calls Traffic accidents, … Historical news, breaking news stories, wire, alerts, … IE Sys 4 IE Sys 3 IE Sys 2 IE Sys 1 Data Integration, Data Mining, Trend Analysis Detection, Monitoring, Prediction

5 Challenges in Information Extraction Portability  Reduce effort to tune for new domains and tasks  MUC systems: experts would take 8-12 weeks to tune Scalability, Efficiency, Access  Enable information extraction over large collections  1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” )  Snowball: Partially Supervised Information Extraction  Querying Large Text Databases for Efficient Information Extraction

6 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Inferring and analyzing social networks  Utility-based extraction tuning  Multi-modal information extraction and data mining  Authority/trust/confidence estimation

7 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

8 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

9 What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

10 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

11 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

12 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * *

13 IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models

14 Information Extraction Tasks Extracting entities and relations  Entities Named (e.g., Person) Generic (e.g., disease name)  Relations Entities related in a predefined way (e.g., Location of a Disease outbreak) Discovered automatically Common information extraction steps:  Preprocessing: sentence chunking, parsing, morphological analysis  Rules/extraction patterns: manual, machine learning, and hybrid  Applying extraction patterns to extract new information Postprocessing and complex extraction: not covered  Co-reference resolution  Combining Relations into Events, Rules, …

15 Two kinds of IE approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Machine Learning use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re- annotation of the entire training corpus annotators are cheap (but you get what you pay for!)

16 Extracting Entities from Text Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGINENDBEGINEND BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse?

17 Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior)... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

18 IE with Hidden Markov Models Yesterday Lawrence Saul spoke this example sentence. Person name: Lawrence Saul Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name:

19 HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on 450k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:

20 Relation Extraction Extract structured relations from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System Disease Outbreaks in The New York Times

21 Relation Extraction Typically require Entity Tagging as preprocessing Knowledge Engineering  Rules defined over lexical items “ located in ”  Rules defined over parsed text “((Obj ) (Verb located) (*) (Subj ))”  Proteus, GATE, … Machine Learning-based  Learn rules/patterns from examples Dan Roth 2005, Cardie 2006, Mooney 2005, …  Partially-supervised: bootstrap from “seed” examples Agichtein & Gravano 2000, Etzioni et al., 2004, … Recently, hybrid models [Feldman2004, 2006]

22 Comparison of Approaches Use “language-engineering” environments to help experts create extraction patterns  GATE [2002], Proteus [1998] Train system over manually labeled data  Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996] Exploit large amounts of unlabeled data  DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000]  Etzioni et al. (’04): KnowItAll: extracting unary relations  Yangarber et al. (’00, ’02): Pattern refinement, generalized names detection significant effort substantial effort minimal effort

23 The Snowball System: Overview Snowball OrganizationLocationConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco0.7 157th StreetManhattan0.52 15th Party Congress China0.3 15th Century Europe Dark Ages0.1 3 2................ 1

24 Snowball: Getting User Input User input: a handful of example instances integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ACM DL 2000 OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara

25 Can use any full-text search engine Snowball: Finding Example Occurrences Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text Search Engine OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp The Armonk-based IBM introduced a new line… Change of guard at IBM Corporation’s headquarters near Armonk, NY...

26 Named entity taggers can recognize Dates, People, Locations, Organizations, … MITRE’s Alembic, IBM’s Talent, LingPipe, … Snowball: Tagging Entities Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text Computer servers at Microsoft ’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp The Armonk -based IBM introduced a new line… Change of guard at IBM Corporation‘s headquarters near Armonk, NY...

27 Snowball: Extraction Patterns General extraction pattern model: acceptor 0, Entity, acceptor 1, Entity, acceptor 2 Acceptor instantiations:  String Match (accepts string “’s headquarters in”)  Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] )  Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) ) HMMs, Sparse sequences, Conditional Random Fields, … Computer servers at Microsoft’s headquarters in Redmond…

28 Snowball: Generating Patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text 1 Represent occurrences as vectors of tags and terms LOCATION ORGANIZATION {,, } LOCATION ORGANIZATION {, } LOCATION ORGANIZATION {,, } LOCATION ORGANIZATION {, } 2 Cluster similar occurrences.

29 Snowball: Generating Patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text LOCATION ORGANIZATION {, } LOCATION ORGANIZATION {, } Create patterns as filtered cluster centroids 1 Represent occurrences as vectors of tags and terms 2 Cluster similar occurrences. 3

30 Vector Space Clustering

31 Google 's new headquarters in Mountain View are … Snowball: Extracting New Tuples Match tagged text fragments against patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ORGANIZATION {, } LOCATION {, P1 P2 P3 Match=0.8 Match=0.4 Match=0 ORGANIZATION LOCATION V ORGANIZATION{,, } { } LOCATION

32 Snowball: Evaluating Patterns Automatically estimate pattern confidence: Conf(P4)= Positive / Total = 2/3 = 0.66 Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text IBM, Armonk, reported… Positive Intel, Santa Clara, introduced... Positive “Bet on Microsoft”, New York -based analyst Jane Smith said... Negative LOCATIONORGANIZATION { } P4 OrganizationHeadquarters IBMArmonk IntelSanta Clara MicrosoftRedmond Current seed tuples  

33 Snowball: Evaluating Tuples Automatically evaluate tuple confidence: Conf(T) = A tuple has high confidence if generated by high-confidence patterns. Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text P4: 0.66 3COM Santa Clara {, } P3: 0.95 0.4 Conf(T): 0.83 0.8 LOCATION ORGANIZATION { } LOCATION ORGANIZATION

34 Snowball: Evaluating Tuples Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text OrganizationHeadquartersConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco0.7 157th StreetManhattan0.52 15th Party Congress China0.3 15th Century Europe Dark Ages0.1.................. Keep only high-confidence tuples for next iteration

35 Snowball: Evaluating Tuples Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text OrganizationHeadquartersConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco0.7 Start new iteration with expanded example set Iterate until no new tuples are extracted

36 Pattern-Tuple Duality A “good” tuple:  Extracted by “good” patterns  Tuple weight  goodness A “good” pattern:  Generated by “good” tuples  Extracts “good” new tuples  Pattern weight  goodness Edge weight:  Match/Similarity of tuple context to pattern

37 How to Set Node Weights Constraint violation (from before)  Conf(P) = Log(Pos) Pos/(Pos+Neg)  Conf(T) = HITS [Hassan et al., EMNLP 2006]  Conf(P) = ∑Conf(T)  Conf(T) = ∑Conf(P) URNS [Downey et al., IJCAI 2005] EM-Spy [Agichtein, SDM 2006]  Unknown tuples = Neg  Compute Conf(P), Conf(T)  Iterate

38 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm  “Hide” labels for some seed tuples  Iterate EM algorithm to convergence on tuple/pattern confidence values  Set threshold t such that (t > 90% of spy tuples)  Re-initialize Snowball using new seed tuples OrganizationHeadquartersInitialFinal MicrosoftRedmond11 IBMArmonk10.8 IntelSanta Clara10.9 AG EdwardsSt Louis00.9 Air CanadaMontreal00.8 7th LevelRichardson00.8 3Com CorpSanta Clara00.8 3DORedwood City00.7 3MMinneapolis00.7 MacWorldSan Francisco00.7 0 0 157th StreetManhattan00.52 15th Party Congress China00.3 15th Century Europe Dark Ages00.1 …..

39 Adapting Snowball for New Relations Large parameter space  Initial seed tuples (randomly chosen, multiple runs)  Acceptor features: words, stems, n-grams, phrases, punctuation, POS  Feature selection techniques: OR, NB, Freq, ``support’’, combinations  Feature weights: TF*IDF, TF, TF*NB, NB  Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values:  Estimate operating parameters based on occurrences of seed tuples  Run cross-validation on hold-out sets of seed tuples for optimal perf.  Seed occurrences that do not have close “neighbors” are discarded

40 Example Task: DiseaseOutbreaks Proteus: 0.409 Snowball: 0.415 SDM 2006

41 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

42 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Inferring and analyzing social networks  Utility-based extraction tuning  Multi-modal information extraction and data mining  Authority/trust/confidence estimation

43 Extracting A Relation From a Large Text Database Brute force approach: feed all docs to information extraction system Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? Information Extraction System Structured Relation ] Expensive for large collections

44 An Abstract View of Text-Centric Tasks Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve documents from database Tasktuple Information ExtractionRelation Tuple Database SelectionWord (+Frequency) Focused CrawlingWeb Page about a Topic [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

45 Executing a Text-Centric Task Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

46 Scan Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (details in paper)

47 Iterative Query Expansion Output tuples … Extraction System Text Database 3.Extract tuples from docs 2.Process retrieved documents 1.Query database with seed tuples Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Query Generation 4.Augment seed tuples with new tuples Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? (e.g., [Ebola AND Zaire]) (e.g., )

48 Extracted Relation QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples Queries Promising Documents DiseaseNameLocationDate MalariaEthiopiaJan. 1995 EbolaZaireMay 1995 Mad Cow DiseaseThe U.K.July 1995 PneumoniaThe U.S.Feb. 1995 DiseaseNameLocationDate MalariaEthiopiaJan. 1995 EbolaZaireMay 1995 Query Generation Information Extraction System Problem: Learn keyword queries to retrieve “promising” documents

49 Learning Queries to Retrieve Promising Documents 1.Get document sample with “likely negative” and “likely positive” examples. 2.Label sample documents using information extraction system as “oracle.” 3.Train classifiers to “recognize” useful documents. 4.Generate queries from classifier model/rules. Queries Query Generation Information Extraction System Seed Sampling Classifier Training User-Provided Seed Tuples

50 Training Classifiers to Recognize “Useful” Documents diseasereportedepidemicexpectedarea virusreportedexpectedinfectedpatients productsmadeusedexportedfar pastoldhomerunsponsoredevent + + - - Ripper SVM disease AND reported => USEFUL virus3 infected2 sponsored Okapi (IR) disease infected reported virus epidemic products used far exported Document features: words D1 D2 D3 D4

51 SVM Generating Queries from Classifiers disease and reported epidemic virus QCombined virus infected epidemic virus disease AND reported RipperOkapi (IR) disease AND reported => USEFUL disease infected reported virus epidemic products used far exported virus3 infected2 sponsored

52 SIGMOD 2003 Demonstration

53 An Even Simpler Querying Strategy: “Tuples” DiseaseNameLocationDate EbolaZaireMay 1995 “Ebola” and “Zaire” Information Extraction System MalariaEthiopiaJan. 1995 hemorrhagic feverAfricaMay 1995 1. Convert given tuples into queries 2. Retrieve matching documents 3. Extract new tuples from documents and iterate Search Engine

54 Comparison of Document Access Methods QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database Tuples strategy: Recall at most 46%

55 Predicting Recall of Tuples Strategy Seed Tuple SUCCESS!FAILURE  Can we predict if Tuples will succeed? WebDB 2003 Seed Tuple

56 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tuples as queries (estimates time ) Number of tuples that appear in the retrieved documents (estimates recall ) To estimate these we need to compute the: Degree distribution of the tuples discovered by retrieving documents Degree distribution of the documents retrieved by the tuples (Not the same as the degree distribution of a randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees) tuplesDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

57 Information Reachability Graph t 2, t 3, and t 4 “reachable” from t 1 t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TuplesDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

58 Connected Components Reachable Tuples, do not retrieve tuples in Core Tuples that retrieve other tuples and themselves Tuples that retrieve other tuples but are not reachable

59 Sizes of Connected Components Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tuples are in largest Core + Out? Conjecture:  Degree distribution in reachability graphs follows “power-law.”  Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out

60 NYT Reachability Graph: Outdegree Distribution MaxResults=10MaxResults=50 Matches the power-law distribution

61 NYT: Component Size Distribution MaxResults=10MaxResults=50 C G / |T| = 0.297C G / |T| = 0.620 Not “reachable”“reachable”

62 Connected Components Visualization DiseaseOutbreaks, New York Times 1995

63 Estimating Reachability In a power-law random graph G a giant component C G emerges* if d (the average outdegree) > 1, and: Estimate: Reachability ~ C G / |T| Depends only on d (average outdegree) * For  < 3.457 Chung and Lu, Annals of Combinatorics, 2002

64 Estimating Reachability Algorithm 1. Pick some random tuples 2. Use tuples to query database 3. Extract tuples from matching documents to compute reachability graph edges 4. Estimate average outdegree 5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002 Tuples Documents t1t1 t2t2 t3t3 t4t4 d1d1 d2d2 d3d3 d4d4 t1t1 t3t3 t2t2 t2t2 t4t4 d =1.5

65 Estimating Reachability of NYT.46 Approximate reachability is estimated after ~ 50 queries. Can be used to predict success (or failure) of a Tuples querying strategy.

66 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Adaptive information extraction and tuning  Authority/trust/confidence estimation  Inferring and analyzing social networks  Multi-modal information extraction and data mining

67 Goal: Detect, Monitor, Predict Outbreaks Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, … 911 Calls Traffic accidents, … Historical news, breaking news stories, wire, alerts, … IE Sys 4 IE Sys 3 IE Sys 2 IE Sys 1 Data Integration, Data Mining, Trend Analysis Detection, Monitoring, Prediction

68 Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text  Physician notes, patient narrative, call transcripts Call transcripts: a difficult extraction problem  Not grammatical, dialogue, speech  text unreliable, …  Use partially supervised techniques to learn extraction patterns One approach:  Link together (when possible) call transcript and patient record (e.g., by time, address, and patient name)  Correlate patterns in transcript with diagnosis/symptoms  Fine-grained learning: can automatically train for each symptom or group of patients, etc.

69 Authority, Trust, Confidence How reliable are signals emitted by information extraction? Dimensions of trust/confidence:  Source reliability: diagnosis vs. notes vs. 911 calls  Tuple extraction confidence  Source extraction difficulty

70 Source Confidence Estimation Task “easy” when context term distributions diverge from background Quantify as relative entropy (Kullback-Liebler divergence) After calibration, metric predicts task is “easy” or “hard” CIKM 2005 President George W Bush’s three-day visit to India

71 Inferring Social Networks Explicit networks  Patient records: family, geographical entities in structured and unstructured portions Implicit connections  Extract events (e.g., “went to restaurant X yesterday”)  Extract relationships (e.g., “I work in Kroeger’s in Toco Hills”

72 Modeling Social Networks for Epidemiology, security, … Email exchange mapped onto cubicle locations.

73 Improve Prediction Accuracy Suppose we managed to  Automatically identify people currently sick or about to get sick  Automatically infer (part of) their social network Can we improve prediction for dynamics of an outbreak?

74 Multimodal Information Extraction and Data Mining Develop joint models over structured data  E.g., lab results and symptoms extracted from text One approach: mutual reinforcement  Co-training: train classifier on redundant views of data (e.g., structured & unstructured)  Bootstrap on examples proposed by both views More generally: graphical models

75 Summary Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Adaptive information extraction and tuning  Authority/trust/confidence estimation  Inferring and analyzing social networks  Multi-modal information extraction and data mining

76 Thank You Details: papers, other talk slides: http://www.mathcs.emory.edu/~eugene/

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

Similar presentations

Presentation on theme: "1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

Similar presentations

Presentation on theme: "1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback