Download presentation
Presentation is loading. Please wait.
Published byDerek Philip Booth Modified over 9 years ago
1
weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum DB & IR: Both Sides Now in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald
2
2/44 DB and IR: Two Parallel Universes canonical application: accountinglibraries data type: numbers, short strings text foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists) Database SystemsInformation Retrieval parallel universes forever ? market leaders: Oracle, IBM DB2, MS SQL Server, etc. Google, Yahoo!, MSN, Verity, Fast, etc.
3
3/44 Why DB&IR Now? – Application Needs Global health-care management for monitoring epidemics News archives for journalists, press agencies, etc. Product catalogs for houses, cars, vacation places, etc. Customer support & CRM in insurances, telcom, retail, software, etc. Bulletin boards for social communities Enterprise search for projects, skills, know-how, etc. Personalized & collaborative search in digital libraries, Web, etc. Comprehensive archive of blogs with time-travel search Simplify life for application areas like: Typical data: Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … ) Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …) Typical query: symptoms of tropical virus diseases and reported anomalies with young patients in central Europe in the last two weeks
4
4/44 Why DB&IR Now? – Platform Desiderata Platform desiderata (from app developer‘s viewpoint): Flexible ranking on text, categorical, numerical attributes cope with „too many answers“ and „no answers“ High update rate concurrently with high query load Ontologies (dimensions, facets) for products, locations, org‘s, etc. for query rewriting (relaxation, strengthening) Complex queries combining text & structured attributes XPath/XQuery Full-Text with ranking Structured data (records)Unstructured data (documents) Unstructured search (keywords) Structured search (SQL,XQuery) DB Systems IR Systems Search Engines Keyword Search on Relational Graphs (IIT Bombay, UCSD, MSR, Hebrew U, CU Hong Kong, Duke U,...) Querying entities & relations from IE (MSR Beijing, UW Seattle, IBM Almaden, UIUC, MPI, … ) Integrated DB&IR Platform
5
5/44 Why DB&IR Forever? Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! Data enrichment at very large scale Text and speech are key sources of knowledge production (publications, patents, conferences, meetings,...) 2000 2007 indexed Web 2 Bio. 20 Bio. Flickr photos --- 100 Mio. digital photos ? 150 Bio. Wikipedia 8 000 1.8 Mio. OECD researchers 7.4 Mio. 8.4 Mio. patents world-wide ? 60 Mio. US Library of Congres 115 Mio. 134 Mio. Google Scholar --- 500 Mio.
6
6/44 Outline Past Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
7
7/44 A: USA B: England C: Netherlands D: Germany E: Singapore F: Indonesia Quiz Time Gerard Salton: in which country was he born (and did grow up) ? D: Germany Gerard Salton 1927 – 1995 Prof. Cornell Univ. 1965 – 1995
8
8/44 Parallel Universes: A Closer Look MatterAntimatter user = programmer query = precise spec. of info request interaction via API user = your kids query = approximation of user‘s real info needs interaction process via GUI strength: indexing, QP weakness: user model strength: ranking model weakness: interoperability eval. measure: efficiency (throughput, response time, TPC-H, XMark, …) eval. measure: effectiveness (precision, recall, F1, MAP, NDCG, TREC & INEX benchmarks, …
9
weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum DB & IR: Both Sides Now DBDB IRIR 1990199520002005 VAGUE (Motro) Proximal Nodes (Baeza-Yates et al.) Web Entity Search: Libra, Avatar, ExDB … Faceted Search: Flamenco … 1st Gen. XML IR: XXL, XIRQL, Elixir, JuruXML Multimedia IR WHIRL (Cohen) Web Query Languages: W3QS, WebOQL, Araneus … Semistructured Data: Lore, Xyleme … 2nd Gen. XML IR: XRank,Timber, TIJAH, XSearch, FleXPath, CoXML, TopX, MarkLogic, Fast … Prob. Datalog (Fuhr et al.) Uncertain & Prob. Relations: Mystiq, Trio … Struct. Docs Deep Web Search INEX XPath Full-Text Digital Libraries Graph IR Prob. DB (Cavallo&Pittarelli) Prob. Tuples (Barbara et al.)
10
10/44 WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment Title Plot … Year Movies Title Comment … Rating Reviews Matrix Hero Matrix 1 Matrix Reloaded Matrix Eigenvalues Ying xiong aka. Hero Shrek 2 … matrix spectrum … orthonormal … … fight for peace … … sword fight … dramatic colors … … In ancient China … fights … sword fight … fights Broken Sword … In the near future … computer hacker Neo … … fight training … … cool fights … new techniques … … fights … and more fights … … fairly boring … 1999 2002 2004 In Far Far Away … our lovely hero fights with cat killer … 4 1 5 5 Scoring and ranking: s (, q: A~B) = cosine (x.A, y.B) s (, q 1 … q m ) = x j ~ tf (word j in x) idf (word j) with dampening & normalization DB&IR for query-time data integration More recent work: MinorThird, Spider, DBLife, etc. But scoring models fairly ad hoc
11
11/44 Professor Name: Gerhard Weikum Address... City: SB Country: Germany Teaching Research Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of Heterogeneous XML Data Funding: EU... Name: Ralf Schenkel Lecturer Address: Max-Planck Institute for Informatics, Germany Activities Seminar Contents: Ranked retrieval … Literature: … Scientific Name: INEX task coordinator (Initiative for the Evaluation of XML …) Other Sponsor: EU … XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Union of heterogeneous sources without global schema Similarity-aware XPath: // ~ Professor [//* = ” ~ SB“] [// ~ Course [//* = ” ~ IR“] ] [// ~ Research [//* = ” ~ XML“] ] Similarity-aware XPath: // ~ Professor [//* = ” ~ SB“] [// ~ Course [//* = ” ~ IR“] ] [// ~ Research [//* = ” ~ XML“] ]
12
12/44 Professor Name: Gerhard Weikum Address... City: SB Country: Germany Teaching Research Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of Heterogeneous XML Data Funding: EU... Name: Ralf Schenkel Lecturer Address: Max-Planck Institute for Informatics, Germany Activities Seminar Contents: Ranked retrieval … Literature: … Scientific Name: INEX task coordinator (Initiative for the Evaluation of XML …) Other Sponsor: EU … XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Scoring and ranking: tf*idf for content condition ontological similarity for relaxed tag condition score aggregation with probabilistic independence Wu&Palmer: |path| through lca(x,y) Dice coeff.: 2 #(x,y) / (#x + #y) on Web Similarity-aware XPath: // ~ Professor [//* = ” ~ Saarbruecken“] [// ~ Course [//* = ” ~ IR“] ] [// ~ Research [//* = ” ~ XML“] ] Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Motivation: Union of heterogeneous sources has no schema query expansion model: disjunction of tags magician wizard intellectual artist alchemist director primadonna professor teacher scholar academic, academician, faculty member scientist researcher HYPONYM (0.749) investigator mentor RELATED (0.48) lecturer
13
13/44 The Past: Lessons Learned DB & IR : added flexible ranking to (semi) structured querying to cope with schema and instance diversity but ranking seems „ad hoc“ and not consistently good in benchmarks to win benchmark: tuning needed, but tuning is easier if ranking is principled ! ontologies are mixed blessing: quality diverse, concept similarity subtle, danger of topic drift ontology-based query expansion (into large disjunctions) poses efficiency challenge precision recall // ~Professor [...] // { Professor, Researcher, Lecturer, Scientist, Scholar, Academic,... }[...] element gold produce Golden Delicious entity food substance solid edible fruit apple pome
14
14/44 Outline Past Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
15
15/44 A: Yahoo! Answers B: INEX benchmark C: Derwent WPI D: Elsevier Scopus E: 51.com F: Traffic violations in EU Quiz Time Which is the largest XML data collection in the universe? C: Derwent WPI
16
16/44 TopX: 2nd Generation XML IR ”Semantic“ XPath Full-Text query: /Article [ftcontains(//Person, ” Max Planck“)] [ftcontains(//Work, ” quantum physics“)] //Children[@Gender = ”female“]//Birthdates supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/ http://topx.sourceforge.net Exploit tags & structure for better precision Can relax tag names & structure for better recall Principled ranking by probabilistic IR (Okapi BM25 for XML) Efficient top-k query processing (using improved TA) Robust ontology integration (self-throttling to avoid topic drift) Efficient query expansion (on demand, by extended TA) Relevance feedback for automatic query rewriting [Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal]
17
17/44 Commercial Break [Martin Theobald, Ralf Schenkel, GW: VLDB’95] TopX demo today 3:30 – 5:30
18
18/44 Principled Ranking by Probabilistic IR odds for item d with terms d i being relevant for query q = {q 1, …, q m } binary features, conditional independence of features [Robertson & Sparck-Jones 1976] Now estimate p i and q i values from relevance feedback, pseudo-relevance feedback, corpus statistics by MLE (with statistical smoothing) and store precomputed p i, q i in index Relationship to tf*idf „God does not play dice.“ (Einstein) IR does. with related to but different from statistical language models led to Okapi BM25 (wins TREC tasks) adapted and extended to XML in TopX,...
19
19/44 Probabilistic Ranking for SQL SQL queries that return many answers need ranking Examples: Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …) Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“) Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …) Select * From Movies Where Genre = ”Romance“ And Era = ”90s“ odds for tuple d with attributes X Y relevant for query q: X 1 =x 1 … X m =x m Estimate prob‘s, exploiting workload W: [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06] Example: frequent queries … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“ … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“ boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“
20
20/44 From Tables and Trees to Graphs Example: Conferences (CId, Title, Location, Year)Journals (JId, Title) CPublications (PId, Title, CId)JPublications (PId, Title, Vol, No, Year) Authors (PId, Person)Editors (CId, Person) Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95 Schema-agnostic keyword search over multiple tables: graph of tuples with foreign-key relationships as edges [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Result is connected tree with nodes that contain as many query keywords as possible Ranking: with nodeScore based on tf*idf or prob. IR and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Related use cases: XML beyond trees RDF graphs ER graphs (e.g. from IE) social networks Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
21
21/44 The Present: Observations & Opportunities Probabilistic IR and statistical language models yield principled ranking and high effectiveness (related to prob. relational models (Suciu, Getoor, …) but different) Structural similarity and ranking based on tree edit distance (FleXPath, Timber, …) Aim for comprehensive XML ranking model capturing content, structure, ontologies Aim to generate structure skeleton in XPath query from user feedback Good progress on performance but still many open efficiency issues actor movie plotdirector movie actor director plot ”life physicist Max Planck“ //article[//person ”Max Planck“] [//category ”physicist“] //biography
22
22/44 Outline Past Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
23
23/44 Quiz Time Who said: Information is not knowledge. Knowledge is not wisdom. Wisdom is not truth. Truth is not beauty. Beauty is not love. Love is not music. Music is the best. A: Richard Feynman B: Sigmund Freud C: Larry Page D: Frank Zappa E: Marie Curie F: Lao-tse ? D: Frank Zappa
24
24/44 Knowledge Queries Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king proteins that inhibit both protease and some other enzyme connection between Thomas Mann and Goethe differences in Rembetiko music from Greece and from Turkey neutron stars with Xray bursts > 10 40 erg s -1 & black holes in 10‘‘ market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! Answer „knowledge queries“ such as:
25
25/44 Three Roads to Knowledge Handcrafted High-Quality Knowledge Bases (Semantic-Web-style ontologies, encyclopedias, etc.) Large-scale Information Extraction & Harvesting: (using pattern matching, NLP, statistical learning, etc. for product search, Web entity/object search,...) Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu,...) Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu,...)
26
26/44 High-Quality Knowledge Sources universal „common-sense“ ontologies: SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts) domain-specific ontologies: UMLS (Unified Medical Language System): 1 Mio. biomedical concepts 135 categories, 54 relations (e.g. virus causes disease | symptom) GeneOntology, etc. thesauri and concept networks: WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations can be cast into OWL-lite (or typed graph with statistical weights) lexical sources: Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc. hand-tagged natural-language corpora: TEI (Text Encoding Initiative) markup of historic encyclopedia FrameNet: sentences classified into frames with semantic roles growing with strong momentum
27
27/44 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells;...) => macromolecule, supermolecule... => organic compound -- (any compound of carbon and another element or a radical)... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation;...) can be cast into OWL-lite or into graph, with weights for relation strengths (derived from co-occurrence statistics)
28
28/44 High-Quality Knowledge Sources Wikipedia and other lexical sources
29
29/44 {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]] [[Humboldt-Universität zu Berlin]] [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]] … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources
30
30/44 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, GW: WWW 2007] Turn Wikipedia into explicit knowledge base (semantic DB) Exploit hand-crafted categories and templates Represent facts as explicit knowledge triples: relation (entity1, entity2) (in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.) Map (and disambiguate) relations into WordNet concept DAG entity1entity2 relation Max_PlanckKiel bornIn Kiel City isInstanceOf Examples:
31
31/44 YAGO Knowledge Representation Entity Max_PlanckApril 23, 1858 Person CityCountry subclass Location subclass instanceOf subclass bornOn “Max Planck” means “Dr. Planck” means subclass October 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass concepts individuals words Knowledge Base # Facts KnowItAll 30 000 SUMO 60 000 WordNet 200 000 OpenCyc 300 000 Cyc 5 000 000 YAGO 6 000 000 Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/http://www.mpi-inf.mpg.de/~suchanek/yago/ Accuracy: 97%
32
32/44 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] queries with regular expressions Ling$xscientist isa hasFirstName | hasLastName $yZhejiang locatedIn * worksFor conjunctive queries Beng Chin Ooi (coAuthor | advisor) * Kiel$xscientist isa bornIn Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness statistical language model for result graphs
33
33/44 Ranking Factors Confidence: Prefer results that are likely to be correct Certainty of IE Authenticity and Authority of Sources Informativeness: Prefer results that are likely important May prefer results that are likely new to user Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log Compactness: Prefer results that are tightly connected Size of answer graph bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Einstein vegetarian BohrNobel Prize Tom Cruise 1962 isa bornIn diedIn won
34
34/44 Information Extraction (IE): Text to Records Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace... Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Planck‘s constant 6.226 10 23 Js Constant Value Dimension combine NLP, pattern matching, lexicons, statistical learning
35
35/44 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: all cities, all basketball players, all composers headquarters of companies, CEOs of companies, synonyms of proteins birthdates of people, capitals of countries, rivers in cities which musician plays which instruments who discovered or invented what which enzyme catalyzes which biochemical reaction Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …): almost-unsupervised pattern matching and learning: seeds (known facts) patterns (in text) (extraction) rule (new) facts
36
36/44 city(Beijing) plays(Coltrane, sax) city(Beijing) old center of Beijing plays(Coltrane, sax) sax player Coltrane city(Beijing) old center of Beijingold center of X plays(Coltrane, sax) sax player ColtraneY player X Methods for Web-Scale Fact Extration Example: city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other towns plays (Zappa, guitar) playing guitar: … Zappa plays (Davis, trumpet) Davis … blows trumpet seeds text rules new facts Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y in downtown Beijingcity(Beijing) Coltrane blows saxplays(C., sax) Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person plays(head(NP), NN)
37
37/44 Performance of Web-IE State-of-the-art precision/recall results: Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) relationprecision recall corpus systems countries80% 90% Web KnowItAll cities80% ??? Web KnowItAll scientists60% ??? WebKnowItAll headquarters90% 50% News Snowball, LEILA birthdates80% 70% Wikipedia LEILA instanceOf40% 20% WebText2Onto, LEILA Open IE80% ??? WebTextRunner precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%
38
38/44 Beyond Surface Learning with LEILA Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), ( , [0..9]* ), … Paris was founded on an island in the Seine (Paris, Seine) SsPvMVpDs Js DG Js MVp NPVP PPNP PPNP Cologne lies on the banks of the Rhine SsMVpDMcMpDg JsJp NPPPVPNPPPNP People in Cairo like wine from the Rhine valley MpJsOs SpMvpDs Js AN NP PPVPPPNP Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06] LEILA outperforms other Web-IE methods in terms of precision, recall, F1, but: dependency parser is slow one relation at a time “Al Gore funded more work for a better basis of the Internet”
39
39/44 IE Efficiency and Accuracy Tradeoffs precision vs. recall: two-stage processing (filter pipeline) 1)recall-oriented harvesting 2)precision-oriented scrutinizing preprocessing indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? exploit ontologies? exploit usage logs ? turn crawl&extract into set-oriented query processing candidate finding efficient phrase, pattern, and proximity queries optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] IE is cool, but what‘s in it for DB folks? [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]
40
40/44 The Future: Challenges Generalize YAGO approach (Wikipedia + WordNet) Methods for comprehensive, highly accurate mappings across many knowledge sources cross-lingual, cross-temporal scalable in size, diversity, number of sources Pursue DB support towards efficient IE (and NLP) Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) with > 90% accuracy and Wikipedia-like coverage Integrate handcrafted knowledge with NLP/ML-based IE Incorporate social tagging and human computing
41
41/44 Outline Past Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
42
42/44 Major Trends in DB and IR malleable schema (later)deep NLP, adding structure record linkageinfo extraction graph miningentity-relationship graph IR ontologies ranking Database SystemsInformation Retrieval statistical language models data uncertainty programmabilitysearch as Web Service dataspacesWeb objects Web 2.0
43
43/44 Conclusion DB&IR integration agenda: models − ranking, ontologies, prob. SQL ?, graph IR ? languages and APIs − XQuery Full-Text++ ? systems − drop SQL, go light-weight ? − combine with P2P, Deep Web,... ? Rethink progress measures and experimental methodology Address killer app(s) and grand challenge(s): from data to knowledge (Web, products, enterprises) integrate knowledge bases, info extraction, social wisdom cope with uncertainty; ranking as first-class principle Bridge cultural differences between DB and IR: co-locate SIGIR and SIGMOD
44
44/44 DB&IR: Both Sides Now Joni Mitchell (1969): Both Sides Now … I've looked at life from both sides now, From up and down, and still somehow It's life's illusions i recall. I really don't know life at all. Thank You !
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.