Chapter VI: Information Extraction Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Research Internships Advanced Research and Modeling Research Group.
YAGO-NAGA Project Presented By: Mohammad Dwaikat To: Dr. Yuliya Lierler CSCI 8986 – Fall 2012.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Database and Information- Retrieval Methods for Knowledge Discovery Database and Information- Retrieval Methods for Knowledge Discovery Gerhard Weikum,
July 9, 2003ACL An Improved Pattern Model for Automatic IE Pattern Acquisition Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Information Retrieval in Practice
Introduction to Text Mining
Chapter 5: Information Retrieval and Web Search
Gerhard Herzberg Born: 25 December 1904, Hamburg, Germany Died: 3 March 1999, Ottawa, Canada Affiliation at the time of the award: National Research Council.
Albert Einstein
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Information Retrieval in Practice
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
The Biography of Max von Laue By Nicole Christensen.
Search Engines and Information Retrieval Chapter 1.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Albert Einstein. BIOGRAPHY Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, Six weeks later the family moved to Munich, where.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Max Planck By Sydney Rechtenbaugh!. Personal Facts Born on April 23, 1828 Born in Kiel, Germany Studied at the Universities of Munich and Berlin Received.
Ontology-Based Information Extraction: Current Approaches.
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
Max Planck The Father of Quantum Physics By: Michael Showak.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Chapter 6: Information Retrieval and Web Search
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Max Planck BY: Rachel L. Torres & Elizabeth Esparza.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Tutorial: Knowledge Bases for Web Content Analytics
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
MAX PLANCK By: Tyler Brown and Devontae Dixon =R9sMxEnbha0&edufilter=FwNI MIqIDIQYAexPZDE7XA.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Einat Minkov University of Haifa, Israel CL course, U
YAGO-QA Answering Questions by Structured Knowledge Queries
Web IR: Recent Trends; Future of Web Search
Introduction to Information Extraction
Social Knowledge Mining
Huayi Zhagn and Haiyan Liang
CSE 635 Multimedia Information Retrieval
CS246: Information Retrieval
Presentation transcript:

Chapter VI: Information Extraction Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12

Chapter VI: Information Extraction VI.1 Motivation and Overview IE systems: Wolfram Alpha, Yago-Naga, EntityCube Applications: Knowledge base building, question answering VI.2 IE for Entities and Relations Basic NLP techniques, rule-based IE, learning-based IE VI.3 Named Entity Disambiguation Entity reconciliation & matching functions, Markov Logic Networks VI.4 Large-Scale Knowledge Base Construction and Open IE Bootstrapping pattern mining, TextRunner, NELL December 13, 2011VI.2IR&DM, WS'11/12

VI.1 Motivation and Overview Beyond keywords as queries and documents as retrieval units: Extract entities and annotate text documents or Web pages (e.g., named entity recognition) Find instances of semantic classes (e.g., not yet known in WordNet) Extract facts (relations among entities) from text documents or Web pages (e.g., Wikipedia) to automatically populate and enhance an ontology/knowledge base Answer questions by analyzing natural-language and translation into machine-processable format Technologies: Lexicon lookups (name dictionaries, geo gazetteers, etc.) NLP (PoS tagging, chunking/parsing, semantic role labeling, etc.) Pattern matching & rule learning (regular expressions, FSAs) Statistical learning (HMMs, MRFs, etc.) Text mining in general December 13, 2011VI.3IR&DM, WS'11/12

Example: Wolfram Alpha December 13, 2011VI.4IR&DM, WS'11/12

Example: YAGO-NAGA yago-naga/ December 13, 2011VI.5IR&DM, WS'11/12

Example: YAGO-NAGA yago-naga/ December 13, 2011VI.6IR&DM, WS'11/12

Max Karl Ernst Ludwig Planck was born in Kiel, Germany, on April 23, 1858, the son of Julius Wilhelm and Emma (née Patzig) Planck. Planck studied at the Universities of Munich and Berlin, where his teachers included Kirchhoff and Helmholtz, and received his doctorate of philosophy at Munich in He was Privatdozent in Munich from 1880 to 1885, then Associate Professor of Theoretical Physics at Kiel until 1889, in which year he succeeded Kirchhoff as Professor at Berlin University, where he remained until his retirement in Afterwards he became President of the Kaiser Wilhelm Society for the Promotion of Science, a post he held until He was also a gifted pianist and is said to have at one time considered music as a career. Planck was twice married. Upon his appointment, in 1885, to Associate Professor in his native town Kiel he married a friend of his childhood, Marie Merck, who died in He remarried her cousin Marga von Hösslin. Three of his children died young, leaving him with two sons. Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace... Max Planck Nobel Prize in Physics Marie Curie Nobel Prize in Physics Marie Curie Nobel Prize in Chemistry Person Award type (Max Planck, physicist) bornOn (Max Planck, 23 April 1858) bornIn (Max Planck, Kiel) plays (Max Planck, piano) spouse (Max Planck, Marie Merck) spouse (Max Planck, Marga Hösslin) advisor (Max Planck, Kirchhoff) advisor (Max Planck, Helmholtz) AlmaMater (Max Planck, TU Munich) Information Extraction (IE): Text to Relations December 13, 2011VI.7IR&DM, WS'11/12

IE for Knowledge Base Construction {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]] [[Humboldt-Universität zu Berlin]] [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]] … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … automatically build large knowledge base from Wikipedia infoboxes & categories, WordNet, and similar high-quality sources December 13, 2011VI.8IR&DM, WS'11/12

NLP-based IE (on the Web) December 13, 2011VI.9IR&DM, WS'11/12 Open-source tool: GATE/ANNIE

IE for Life Sciences December 13, 2011VI.10IR&DM, WS'11/12

NLP-based IE from Scientific Publications (1) December 13, 2011VI.11IR&DM, WS'11/12

NLP-based IE from Scientific Publications (2) December 13, 2011VI.12IR&DM, WS'11/12

Entity-Centric Web Search: Entity Cube December 13, 2011VI.13IR&DM, WS'11/12

Entity-Centric Web Search: Entity Cube December 13, 2011VI.14IR&DM, WS'11/12

Extracting Structured Records from Deep Web Sources (1) December 13, 2011VI.15IR&DM, WS'11/12

Mining the Web: Analysis of Hypertext and Semi Structured Data (The Morgan Kaufmann Series in Data Management Systems) (Hardcover) by Soumen Chakrabarti td.productLabel { font-weight: bold; text-align: right; white-space: nowrap; vertical-align: top; padding-right: 5px; padding-left: 0px; } table.product { border: 0px; padding: 0px; border-collapse: collapse; } List Price: $62.95 Price: $62.95 & this item ships for FREE with Super Saver Shipping.... Extracting Structured Records from Deep Web Sources (2) Extract record: Title: Mining the Web … Author: Soumen Chakrabarti, Hardcover: 344 pages, Publisher: Morgan Kaufmann, Language: English, ISBN: AverageCustomerReview: 4 NumberOfReviews: 8, SalesRank: December 13, 2011VI.16IR&DM, WS'11/12

A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Jeopardy! December 13, 2011VI.17IR&DM, WS'11/12

Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City. ?c locatedIn USA. ?a1 type Airport. ?a2 type Airport. ?a1 locatedIn ?c. ?a2 locatedIn ?c. ?a1 namedAfter ?p. ?p type WarHero. ?a2 namedAfter ?b. ?b type BattleField. } Use manually created templates for mapping sentence patterns to structured queries. Focus on factoid and list questions. December 13, 2011VI.18IR&DM, WS'11/12

Deep-QA in NL 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge backends question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, December 13, 2011VI.19IR&DM, WS'11/12

More IE Applications Business analytics on customer dossiers, financial reports, etc. e.g.: How was company X (the market Y) performing in the last 5 years? Job brokering (applications/resumes, job offers) e.g.: How well does the candidate match the desired profile? Market/customer, PR impact, and media coverage analyses e.g.: How are our products perceived by teenagers (girls)? How good (and positive?) is the press coverage of X vs. Y? Who are the stakeholders in a public dispute on a planned airport? Knowledge management in consulting companies e.g.: Do we have experience and competence on X, Y, and Z in Brazil? Comparison shopping & recommendation portals e.g. consumer electronics, used cars, real estate, pharmacy, etc. Knowledge extraction from scientific literature e.g.: Which anti-HIV drugs have been found ineffective in recent papers? General-purpose knowledge acquisition Can we learn encyclopedic knowledge from text & Web corpora? Mining archives e.g.: Who knew about the scandal on X before it became public? December 13, 2011VI.20IR&DM, WS'11/12

IE Viewpoints and Approaches IE as learning (restricted) wrappers/regular expressions (wrapping pages with common structure from Deep-Web sources) IE as learning relations (rules for identifying instances of n-ary relations) IE as learning text/sequence segmentation (HMMs, etc.) IE as learning contextual patterns (graph models, etc.) IE as natural-language analysis (NLP methods) IE as large-scale text mining for knowledge acquisition (combination of tools incl. Web queries) IE as learning fact boundaries December 13, 2011VI.21IR&DM, WS'11/12

IE Viewpoints and Approaches Source: W. Cohen, A. McCallum: Information Extraction from the Web, Tutorial, KDD 2003 Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Sliding Window (+Classifier) Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models (+Classifier) Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? December 13, 2011VI.22IR&DM, WS'11/12

IE Quality Assessment Fix IE task (e.g., extracting all book records from a set of bookseller Web pages) Manually extract all correct records Use standard IR measures: → precision, (relative) recall, F1 measure, etc. or if too large to inspect manually: → statistical tests w/confidence intervals for precision, recall, etc. Benchmark settings : MUC (Message Understanding Conference), no longer active ACE (Automatic Content Extraction), TREC Enterprise Track, INEX Entity Ranking Track, Enron mining, CLEF (Multilingual&Multimodal Information Access Evaluation) CoNNL (Conference on Computational Natural Language Learning), December 13, 2011VI.23IR&DM, WS'11/12