Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1/23 Applications of NLP. 2/23 Applications Text-to-speech, speech-to-text Dialogues sytems / conversation machines NL interfaces to –QA systems –IR systems.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Overview of Search Engines
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Ling 570 Day 17: Named Entity Recognition Chunking.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
L’età della parola Giuseppe Attardi Dipartimento di Informatica Università di Pisa ESA SoBigDataPisa, 24 febbraio 2015.
A Language Independent Method for Question Classification COLING 2004.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Semi-automatic Product Attribute Extraction from Store Website
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
Overview of Statistical NLP IR Group Meeting March 7, 2006.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
11 Thoughts on STS regarding Machine Reading Ralph Weischedel 12 March 2012.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
Improving a Pipeline Architecture for Shallow Discourse Parsing
Introduction to Information Extraction
Social Knowledge Mining
CSE 635 Multimedia Information Retrieval
CS246: Information Retrieval
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Active AI Projects at WIPO
Presentation transcript:

Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans

What is Information Extraction? •Input: unstructured text •Output: structured information, fills pre- existing template (find salient information) •Most often stored in database for futher processing (e.g. data mining)

What isn’t information extraction? •Information retrieval (we need to extract info, not only find relevant documents) •Text understanding (only specific parts of the text are interesting) –large corpora can be used –possible to score objectively

Applications of IE •Can make information retrieval more precise •Summarization of documents in well- defined subject areas •Automatic generation of databases from text

Overview •Named entity recognition –Recognizing relevant entities in text •Relation extraction –Linking recognized entities having particular relevant relations

De door het Amerikaanse National Hurricane Center als 'zeer gevaarlijk' omschreven orkaan Ivan nadert Cuba. Een overzicht over wat Ivan op de Kaaimaneilanden heeft aangericht, is er nog niet. Gouverneur Bruce Dinwiddy zei maandag dat duizenden mensen dakloos zijn geworden en dat ook belangrijke regeringsgebouwen zijn getroffen. Named Entity Recognition (NER) is a combination of concept chunking and labeling those chunks: we wish to identify textual information units that represent people, places, organizations, companies, bands, etc. Named Entity Recognition

NER has many applications Why NER? • prerequisite for information extraction • improving information retrieval  indexing  querying “ belvedere ”

Intuitively simple? What ’ s the problem? NER seems intuitively simple for humans. How do we determine whether or not a (string of) word(s) represents a name? • does the word start with a capital letter? (orthographic characteristics) • have we seen it before? (lists of names) • contextual clues How do we teach this to a computer?

Some problems … Problems: • not every word that starts with a capital letter is a name ex: “Soms is dat niet mogelijk...” • context can be misleading ex: “Er was geen land met Henk te bezeilen.” • no list can ever be complete ex: “Antbeard en zijn bemanning voeren...” ex: “Wil je wat te drinken?”

Feature extraction A lot of different features can be extracted for use in (inductively) learning to classify NEs. Every word can be represented with a lot of different features: “… bedrijf dat Floralux inhuurde. In ‘ 81 …” starts w/ capital letter? YES first word o/t sentence? NO contains punctuation? NO string length? 8

Feature extraction (2) We represent the context by sliding a ‘ window ’ over the data which is anchored in the focus word. “… het bedrijf dat Floralux inhuurde. In ‘ 81 bestond …” left context right context focus word “… het bedrijf dat Floralux inhuurde. In ‘ 81 bestond …”

To split or not to split? Wolff, op het moment een journalist in Argentini ë, speelde met Del Bosque bij Real Madrid in de jaren 70. determine boundaries + types Wolff, op het moment een journalist in Argentini ë, speelde met Del Bosque bij Real Madrid in de jaren 70. determine boundaries Wolff, op het moment een journalist in Argentini ë, speelde met Del Bosque bij Real Madrid in de jaren 70. determine types Wolff, op het moment een journalist in Argentini ë, speelde met Del Bosque bij Real Madrid in de jaren 70.

State of the art STATE OF THE ART F-score English~93% Dutch~77% German~72% Spanish~81% Human performance96-98% Lots of other different languages have been targeted as well: Chinese, French, Japanese, Portuguese, Greek, Hindi, Rumanian, Turkish, Norwegian, and so on…

Information extraction •Named-entity recognition •Relation extraction •Coreference resolution –PvdA-leider Wouter Bos gaat alleen voor minister-president. Vice-premier onder CDA-leider Balkenende is geen gedachte waar hij warm van wordt. "Hou het er maar op dat ik daar nee tegen zeg", aldus Bos woensdag voor RTL Nieuws.

Information extraction •Named-entity recognition has received a lot of attention in IE •Relation extraction is taking over as focal point of attention

Relation extraction Eric Schmidt is directeur van Google. N N WW N VZ N. | PER ---- | | - ORG - | Example directeur

Why relation extraction? •Named entities can be useful to enhance information retrieval •Not enough to answer certain types of information-seeking questions •For example –Wie is de directeur van Google?

Why relation extraction? •Naïve strategy –Find documents in which [PER ] and [ORG Google ] are within each other’s vicinity –Can produce nice results, but does not always work –Also, user still has to find answer –It would be better if the system produced the answer 'Eric Schmidt'.

Examples •Some application areas are –News domain •Relations among the most typical named entities: Person, Organisation, Location, Misc •E.g. located in, parent of, part of –Biomedical domain •Relations among biomedical entities, such as DNA, proteins, diseases, etc. •Protein-protein interaction •Gene-disease relation •Every domain-specific application needs its own set of entities

Relation extraction •Difficult –Automatic systems still perform poorly –But a few reasonable solutions •Often only works in restricted domains –Techniques operating in the news domain are lousy in other domains, e.g. biomedical texts

Relations: implicit / explicit •Explicit relations are spelled out –Joe Cummings, Chairman of Sybase, spoke for four hours. •Implicit relations imply understanding a text –Sybase was scheduled to testify, and Chairman Joe Cummings spoke for four hours. •Most current research involves explicit relations

Difficulties •A relation can be phrased in many ways –Eric Schmidt is de directeur van Google. –Eric Schmidt, de nieuwe directeur van Google, verklaart... –Eric Schmidt zet een volgende stap in zijn carriere. Sinds kort is hij de directeur van Google. –...

Assumptions •Delimit the task –Relations always connect two named entities •More complex relations between >2 entities are harder –Both entities are in the same sentence •Strong simplification •See next week (Veronique Hoste guest lecture)

Relation extraction •Relation detection –Is there a relation between two entities? •Relation classification –Which type does the relation between two entities have?

MUC •Message Understanding Conference •Has organized many information extraction competitions •Since 1998, relation extraction is a MUC competition

ACE •Automatic Content Extraction •More recent than MUC •The ACE data is the most popular data set for relation extraction research

ACE •Types/Subtypes relations –ROLE •Relates a person to an organisation or geopolitical entity •member, owner, affiliate, client, citizen –PART •Generalised containment •subsidiary, physical part-of, set membership –AT •permanent and transient locations •located, based-in, residence –SOC •social relations among persons •parent, sibling, spouse, grandparent, associate

Automatic RE: Pipeline •Relation extraction finds relations among pairs of named entities •Assuming that named entities have already been identified •Simple case of a pipeline, a heavily used architecture in language technology

Pipeline Tokeniser Sentence splitter Part-of-speech tagger Syntactic parser Named-entity recogniser Relation extractor Information

Pipeline •Parts of a pipeline are dependent on what is done before them •A weak point of the pipeline architecture is that errors tend to propagate as snowballs

•Eric Schmidt is directeur van Google. ● [PER Eric Schmidt ] werkzaam-bij [ORG Google ] •Jan de Vries is vakkenvuller bij Albert Heijn. ● [PER Jan de Vries ] werkzaam-bij [ORG Albert Heijn ]

•PER is directeur van ORG. •PER is vakkenvuller bij ORG.

•PER is directeur PREP ORG. •PER is vakkenvuller PREP ORG.

•PER is directeur PREP ORG. •PER is vakkenvuller PREP ORG. •Jan de Vries is fan van PSV. –PER is fan PREP ORG. !

directeur vakkenvuller fan Semantic lexicon (e.g. WordNet) portier accountant liefhebber bewonderaar...

[PER Eric Schmidt ] werkzaam-bij [ORG Google ]

[PER Jan de Vries ] werkzaam-bij [ORG Albert Heijn ]

Similar?

PER ↑smain-su ↓smain-predc np-mod pp-obj1 ORG

Evaluatie •Comparable to text classification and named entity recognition –Precision •Number of correctly predicted relations / Total number of predicted relations –Recall •Number of correctly predicted relations / Total number of relations in the text –F-score •2 * precision * recall / (precision + recall)