Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana.

Similar presentations


Presentation on theme: "Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana."— Presentation transcript:

1 weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald

2 Gerhard Weikum ADBIS Oct 1, 2007 2/53 Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0...) into the world‘s most comprehensive knowledge base (semantic DB) Challenge: seize opportunity and make it happen! Approach: combine and exploit synergies of hand-crafted, high-quality knowledge sources  Semantic Web automatic knowledge extraction  Statistical Web social networks and human computing  Social Web

3 Gerhard Weikum ADBIS Oct 1, 2007 3/53 Proof of Relevance There is a growing mountain of research. … A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. Vannevar Bush: As We May Think, 1945.

4 Gerhard Weikum ADBIS Oct 1, 2007 4/53 Proof of Relevance Tim Berners-Lee: In the Semantic Web information is given well-defined meaning. Jim Gray: … system can answer questions about the text as precisely and quickly as a human expert. Brewster Kahle: The goal of universal access to our cultural heritage is within our grasp. Jimmy Wales: Our big-picture vision is to share knowledge with all of humanity. Al Gore: The future will be better tomorrow.

5 Gerhard Weikum ADBIS Oct 1, 2007 5/53 Proof of Relevance ? A journey of a thousand miles begins with a single step. You cannot open a book without learning something. To know that we know what we know, and that we do not know what we do not know, that is true knowledge. Confucius, 551-479 BC Denis Diderot, 1713-1784 Ignorance is less remote from the truth than prejudice. Sentences are like sharp nails, which force truth upon our memories. When science, art, literature, and philosophy are simply the manifestation of personality, they can make a man's name live for thousands of years.

6 Gerhard Weikum ADBIS Oct 1, 2007 6/53 Why Google and Wikipedia Are Not Enough Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king proteins that inhibit both protease and some other enzyme connection between Thomas Mann and Goethe differences in Rembetiko music from Greece and from Turkey neutron stars with Xray bursts > 10 40 erg s -1 & black holes in 10‘‘ market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB/graph“) ! Answer „knowledge queries“ such as:

7 Gerhard Weikum ADBIS Oct 1, 2007 7/53 Outline Introduction: Search for Knowledge Conclusion Harvesting Knowledge Leibniz Approach Planck Approach Darwin Approach

8 Gerhard Weikum ADBIS Oct 1, 2007 8/53 Three Roads to Knowledge Leibniz Approach: Handcrafted High-Quality Knowledge Sources (Semantic Web) Planck Approach: Large-scale Information Extraction & Harvesting (Statistical Web) Darwin Approach: Social Wisdom from Web 2.0 Communities (Social Web)

9 Gerhard Weikum ADBIS Oct 1, 2007 9/53 Leibniz Approach (Semantic Web) Handcrafted High-Quality Knowledge: Ontologies and other Lexical Sources Build on Rigorous Knowledge Atoms („Characteristica Universalis“) Gottfried Wilhelm Leibniz (1646 - 1716)

10 Gerhard Weikum ADBIS Oct 1, 2007 10/53 High-Quality Knowledge Sources General-purpose ontologies for Semantic Web: SUMO, Cyc, etc.

11 Gerhard Weikum ADBIS Oct 1, 2007 11/53 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

12 Gerhard Weikum ADBIS Oct 1, 2007 12/53 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells;...) => macromolecule, supermolecule... => organic compound -- (any compound of carbon and another element or a radical)... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation;...) 200 000 concepts and relations; can be cast into description logics or graph, with weights for relation strengths (derived from co-occurrence statistics)

13 Gerhard Weikum ADBIS Oct 1, 2007 13/53 High-Quality Knowledge Sources Domain ontologies (UMLS, GeneOntology, etc.) 1 Mio. biomedical concepts, 135 categories, 54 relationships (e.g. virus causes (disease | symptom) )

14 Gerhard Weikum ADBIS Oct 1, 2007 14/53 High-Quality Knowledge Sources Wikipedia and other lexical sources 2 Mio. articles 40 Mio. hyperlinks many 1000‘s of categories and lists more than 100 languages growing very fast

15 Gerhard Weikum ADBIS Oct 1, 2007 15/53 {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]] [[Humboldt-Universität zu Berlin]] [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]] … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources

16 Gerhard Weikum ADBIS Oct 1, 2007 16/53 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, G. Weikum: WWW‘07] Turn Wikipedia into explicit knowledge base (semantic DB) Exploit hand-crafted categories and templates Represent facts as explicit knowledge triples: relation (entity1, entity2) (in FOL, compatible with RDF, OWL-lite, XML, etc.) Map (and disambiguate) relations into WordNet concept DAG entity1entity2 relation Max_PlanckKiel bornIn Kiel City isInstanceOf Examples:

17 Gerhard Weikum ADBIS Oct 1, 2007 17/53 YAGO Knowledge Representation Entity Max_PlanckApril 23, 1858 Person CityCountry subclass Location subclass instanceOf subclass bornOn “Max Planck” means “Dr. Planck” means subclass October 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass concepts individuals words Knowledge Base # Facts KnowItAll 30 000 SUMO 60 000 WordNet 200 000 OpenCyc 300 000 Cyc 5 000 000 YAGO 6 000 000 Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/http://www.mpi-inf.mpg.de/~suchanek/yago/ Accuracy  97%

18 Gerhard Weikum ADBIS Oct 1, 2007 18/53 YAGO Disambiguation & Uncertainty additional harvesting of relations from natural-language texts by info-extraction tools Entity Paris(Myth.)Paris(France)France Person CityCountry subclass Mythological Figure instanceOf Location subclass1.0 0.8instanceOf0.9 subclass1.0 subclass 1.0 instanceOf1.0 locatedIn 0.95 “Paris” means0.1 means0.7 “France” means0.9 subclass1.0 “La Grande Nation” means 0.2 capture confidence value for each fact Paris Hilton means0.05 Celebrity instanceOf 0.4 subclass 0.7

19 Gerhard Weikum ADBIS Oct 1, 2007 19/53 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] queries with regular expressions Ling$xscientist isa hasFirstName | hasLastName $yZhejiang locatedIn * worksFor discovery queries connectedness queries Beng Chin Ooi (coAuthor | advisor) * Thomas MannGoethe * German novelist isa Kiel$xscientist isa bornIn Graph-based search on YAGO-style knowledge bases with built-in ranking based on statistical language model $x Nobel prize hasWon $a diedOn $y hasSon $b diedOn >

20 Gerhard Weikum ADBIS Oct 1, 2007 20/53 NAGA: Searching Knowledge q: Fisher isa scientist Fisher isa $x mathematician_109635652 —subClassOf—> scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_109165182 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182 $@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304 $@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 …

21 Gerhard Weikum ADBIS Oct 1, 2007 21/53 NAGA: Searching & Ranking Knowledge q: Fisher isa scientist Fisher isa $x Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist_109871938 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 … Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/http://www.mpi-inf.mpg.de/~kasneci/naga/

22 Gerhard Weikum ADBIS Oct 1, 2007 22/53 Ranking Factors Confidence: Prefer results that are likely to be correct  Certainty of IE  Authenticity and Authority of Sources Informativeness: Prefer results that are likely important May prefer results that are likely new to user  Frequency in answer  Frequency in corpus (e.g. Web)  Frequency in query log Compactness: Prefer results that are tightly connected  Size of answer graph bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Einstein vegetarian BohrNobel Prize Tom Cruise 1962 isa bornIn diedIn won

23 Gerhard Weikum ADBIS Oct 1, 2007 23/53 Summary of Leibniz Approach Hand-crafted knowledge sources are great assets, but expensive, partial, and isolated Great mileage even from informal & semiformal sources Connecting & reconciling different sources gives added value (and sometimes is not even that hard) Challenge: Develop methods for comprehensive, highly accurate mappings across many knowledge sources Cross-lingual Cross-temporal Scalable

24 Gerhard Weikum ADBIS Oct 1, 2007 24/53 Planck Approach (Statistical Web) Information Extraction & Harvesting: Gather Entities, Relations, Facts Live with Uncertainty Max Planck (1858 - 1947)

25 Gerhard Weikum ADBIS Oct 1, 2007 25/53 Information Extraction (IE): Text to Records Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace... Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Planck‘s constant 6.226  10 23 Js Constant Value Dimension combine NLP, pattern matching, lexicons, statistical learning

26 Gerhard Weikum ADBIS Oct 1, 2007 26/53 IE Technology: Rules, Patterns, Learning For natural-language text and for heterogeneous sources: NLP techniques (parser, PoS tagging) for tokenization identify patterns (e.g. regular expressions) as features train statistical learners for segmentation and labeling use learned model to automatically tag newly seen input Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07. NPVBNNNPNNINDTPPINNP ADJDTINCD Training data: The WWW conference takes place in Banff in Canada. Today‘s keynote speaker is Dr. Berners-Lee from W3C. The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, … …

27 Gerhard Weikum ADBIS Oct 1, 2007 27/53 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: all cities, all basketball players, all composers headquarters of companies, CEOs of companies, synonyms of proteins birthdates of people, capitals of countries, rivers in cities which musician plays which instruments who discovered or invented what which enzyme catalyzes which biochemical reaction Existing approaches and tools use almost-unsupervised pattern matching and learning: seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts

28 Gerhard Weikum ADBIS Oct 1, 2007 28/53 Methods for Web-Scale Fact Extration Example: city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other towns plays (Zappa, guitar) playing guitar: … Zappa plays (Davis, trumpet) Davis … blows trumpet seeds  text  rules  new facts Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y in downtown Delhicity(Delhi) Coltrane blows saxplays(C., sax) city(Delhi) plays(Coltrane, sax) city(Delhi) old center of Delhi plays(Coltrane, sax) sax player Coltrane city(Delhi) old center of Delhiold center of X plays(Coltrane, sax) sax player ColtraneY player X Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(head(NP))=person  plays (head(NP), NN)

29 Gerhard Weikum ADBIS Oct 1, 2007 29/53 Performance of Web-IE State-of-the-art precision/recall results: Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) relationprecision recall corpus systems countries80% 90% Web KnowItAll cities80% ??? Web KnowItAll scientists60% ??? WebKnowItAll CEOs80% 50% News Snowball, LEILA birthdates80% 70% Wikipedia LEILA instanceOf40% 20% WebText2Onto, LEILA precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%

30 Gerhard Weikum ADBIS Oct 1, 2007 30/53 Beyond Surface Learning with LEILA Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), ( ,  [0..9]*  ), … Paris was founded on an island in the Seine (Paris, Seine) SsPvMVpDs Js DG Js MVp NPVP PPNP PPNP Cologne lies on the banks of the Rhine SsMVpDMcMpDg JsJp NPPPVPNPPPNP People in Cairo like wine from the Rhine valley MpJsOs SpMvpDs Js AN NP PPVPPPNP Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD’06] We visited Paris last summer. It has many museums along the banks of the Seine. LEILA outperforms other Web-IE methods in precision and recall, but dependency parser is slow

31 Gerhard Weikum ADBIS Oct 1, 2007 31/53 IE Efficiency and Accuracy Tradeoffs precision vs. recall: two-stage processing (filter pipeline) 1)recall-oriented harvesting 2)precision-oriented scrutinizing preprocessing indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? exploit ontologies? exploit usage logs ? turn crawl&extract into set-oriented query processing candidate finding efficient phrase, pattern, and proximity queries optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] IE is cool, but what‘s in it for DB folks? [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]

32 Gerhard Weikum ADBIS Oct 1, 2007 32/53 Summary of Planck Approach Human text (and speech) is diverse and produced at higher rate than manual high-quality annotations ? Deep NLP and advanced ML are computational bottleneck IE offers reasonably robust and scalable methods for harvesting named entities and binary relations Challenge: Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) (may need large-scale P2P/Grid) with > 90% accuracy and Wikipedia-like coverage ? Disambiguation (entity matching, record linkage) needed „Joe Hellerstein (UC Berkeley)“ = „Prof. Joseph M. Hellerstein, California)“ „Max Planck Institute“ = „MPI“ ≠ „MPI“ = „Message Passing Institute“

33 Gerhard Weikum ADBIS Oct 1, 2007 33/53 Darwin Approach (Social Web) Social Wisdom & Natural Selection: Evolution of (Web 2.0) species Survival of the fittest Charles Darwin (1809 - 1882)

34 Gerhard Weikum ADBIS Oct 1, 2007 34/53 „Wisdom of Crowds“ at Work on Web 2.0 Information enrichment & knowledge extraction by humans: Collaborative Recommendations & QA Amazon (product ratings & reviews, recommended products) Netflix: movie DVD rentals  $ 1 Mio. Challenge answers.yahoo, iknow.baidu, etc. Social Tagging and Folksonomies del.icio.us: Web bookmarks and tags flickr: photo annotation, categorization, rating YouTube: same for video Human Computing in Game Form ESP and Google Image Labeler: image tagging Peekaboom: image segmenting and tagging Verbosity: facts from natural-language sentences Online Communities dblife.cs.wisc.edu for database research www.lt-world.org for language technology Yahoo! Groups, Myspace, Facebook, etc. etc.

35 Gerhard Weikum ADBIS Oct 1, 2007 35/53 Social Tagging: Example Flickr (1)

36 Gerhard Weikum ADBIS Oct 1, 2007 36/53 Social Tagging: Example Flickr (2)

37 Gerhard Weikum ADBIS Oct 1, 2007 37/53 Social Tagging: Example Flickr (3)

38 Gerhard Weikum ADBIS Oct 1, 2007 38/53 Social-Tagging Community > 1 Mio. users > 100 Mio. photos > 1 Bio. tags 30% monthly growth Source: www.flickr.com

39 Gerhard Weikum ADBIS Oct 1, 2007 39/53 ESP Game [Luis von Ahn et al. 2004 ] taboo: pyramid Louvre museum Paris art played against random, anonymous partner on Internet my labels: reflection your partner has suggested: 3 labels my labels: reflection water your partner has suggested: 7 labels my labels: reflection water Mitterand Mona Lisa your partner has suggested: 11 labels my labels: reflection water Mitterand Mona Lisa metro lignes 7, 14 your partner has suggested: 17 labels my labels: reflection water Mitterand Mona Lisa metro lignes 7, 1 Da Vinci code Congratulations! You scored 1 point! Congratulations! You scored 1 point! Game with a purpose Collects annotations (wisdom) Can exploit tag statistics (crowds) Attracts people, fun to play, some play hours ESP game collected > 10 Mio. tags from > 20000 users 5000 people could tag all photos on the Web in 4 weeks (human computing)

40 Gerhard Weikum ADBIS Oct 1, 2007 40/53 More Human Computing Verbosity [von Ahn 2006]: Collect common-knowledge facts (relation instances) 2 players: Narrator (N) and Guessor (G) N gives stylized clues: is a kind of …, is used for …, is typically near/in/on …, is the opposite of … random pairing for independence, can build statistics over many games for same concept Peekaboom, Phetch, etc.: locating & tagging objects in images, finding images, etc. incentives to play ? game design for moving up the value-chain ?

41 Gerhard Weikum ADBIS Oct 1, 2007 41/53 Dark Side of Social Wisdom Spam (Web spam – not just for email anymore): lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.; law suits about „appropriate Google rank“ Truthiness: degree to which something is truthy (not necessarily facty); truthy := property of something you know from your guts Disputes: editorial fights over critical Wikipedia articles; Citizendium: new endeavor with "gentle expert oversight" Dishonesty, Bias, …

42 Gerhard Weikum ADBIS Oct 1, 2007 42/53

43 Gerhard Weikum ADBIS Oct 1, 2007 43/53

44 Gerhard Weikum ADBIS Oct 1, 2007 44/53 The Wisdom of Crowds: PageRank PageRank (PR): links are endorsements & increase page authority, authority is higher if links come from high-authority pages with and equivalent to principal eigenvector: random walk: uniformly random choice of links + random jumps; add bias to transitions and jumps for personal PR, TrustRank, etc. Authority (page q) = stationary prob. of visiting q Social Ranking

45 Gerhard Weikum ADBIS Oct 1, 2007 45/53 The Wisdom of Crowds: Beyond PR Typed graphs: data items, users, friends, groups, postings, ratings, queries, clicks, … with weighted edges  spectral analysis of various graphs Evolving over time  tensor analysis users tags docs

46 Gerhard Weikum ADBIS Oct 1, 2007 46/53 Decentralized Graph Analysis Decentralized computation in peer-to-peer network with arbitrary, a-priori unknown overlaps of graph fragments Graph spectral analysis applied to: pages, sites, tags, users, groups, queries, clicks, opinions, etc. as nodes assessment and interaction relations as weighted edges can compute various notions of authority, reputation, trust, quality local subgraph 3 local subgraph 1 local sub- graph 2 global graph

47 Gerhard Weikum ADBIS Oct 1, 2007 47/53 JXP Algorithm [J.X. Parreira, G. Weikum: WebDB’05, VLDB’06] Decentralized, asynchronous, peer-to-peer algorithm based on theory of Markov-chain aggregation (state lumping) [P.J. Courtois 1977, C.D. Meyer 1988] each peer aggregates non-local part of global graph into „world node“ peers meet randomly, exchange data about their local computations, and iterate their local computations Theorem: authority scores from local computations converge to global scores supported by Minerva system http://www.mpi-inf.mpg.de/departments/d5/software/minerva/index.html

48 Gerhard Weikum ADBIS Oct 1, 2007 48/53 Summary of Darwin Approach Social tagging and social networks (Web 2.0) are potentially valuable knowledge sources Challenges: Design a game that intrigues serious scientists to „semantically“ annotate their scholarly work Develop an analysis method that identifies the „best“ facts, resilient to egoistic and malicious behaviors (incl. coalitions) Games (human computing) are an interesting way of enticing „knowledge input“ and collecting statistics Spectral analysis is a highly versatile tool for rating & ranking that can be extended and scaled by decentralized algorithms

49 Gerhard Weikum ADBIS Oct 1, 2007 49/53 Outline Introduction: Search for Knowledge Conclusion Harvesting Knowledge Leibniz Approach Planck Approach Darwin Approach

50 Gerhard Weikum ADBIS Oct 1, 2007 50/53 Summary Not covered here: search and ranking graph IR (for ER graphs, RDF, cross-linked XML, etc.) new ranking models (e.g. statistical LM for graphs) efficient and scalable query processing Harvesting knowledge & organizing in semantic DB/graph for scholarly Web, digital libraries, enterprise know-how, online communities, etc. Three roads to knowledge: Leibniz / Semantic Web: ontologies, encyclopedia, etc. Planck / Statistical Web: large-scale IE from text, speech, etc. Darwin / Social Web: wisdom of crowds, tagging, folksonomies

51 Gerhard Weikum ADBIS Oct 1, 2007 51/53 Major Challenges Generalize YAGO approach (Wikipedia + WordNet) Methods for comprehensive, highly accurate mappings across many knowledge sources cross-lingual, cross-temporal scalable in size, diversity, number of sources Pursue DB support towards efficient IE (and NLP) Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) with > 90% accuracy and Wikipedia-like coverage Integrate handcrafted knowledge with NLP/ML-based IE Incorporate social tagging and human computing

52 Gerhard Weikum ADBIS Oct 1, 2007 52/53 Potential Synergies among Leibniz, Planck, and Darwin Leibniz Semantic Web Planck Statistical Web Darwin Social Web bootstrap validateemerge communities statistics & feedback knowledge core

53 Gerhard Weikum ADBIS Oct 1, 2007 53/53 Thank you !


Download ppt "Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana."

Similar presentations


Ads by Google