Presentation is loading. Please wait.

Presentation is loading. Please wait.

Papers for today Collaboratively built semi-structured content and Artificial Intelligence: The story so far – Hovy, Navigli, Ponzetto YAGO2: A Spatially.

Similar presentations


Presentation on theme: "Papers for today Collaboratively built semi-structured content and Artificial Intelligence: The story so far – Hovy, Navigli, Ponzetto YAGO2: A Spatially."— Presentation transcript:

1 Papers for today Collaboratively built semi-structured content and Artificial Intelligence: The story so far – Hovy, Navigli, Ponzetto YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia – Hoffarta, Suchanekb, Berbericha, Weikuma

2 Collaboratively built semi-structured content Main characteristics of collaborative resources that make them attractive for AI and NLP research Semi-structured resources enable a renaissance of knowledge-rich AI techniques

3 Unstructured, structured and semi- structured resources Unstructured – Strengths: easy to harvest at very large scale, many domains, many styles, many languages… – Limitations: knowledge acquisition bottleneck (for complex inference chains), degree and quality of ontologization Structured (e.g. ontologies…) – Strengths: high quality, beneficial for all kinds of intelligent applications. – Limitations: Creation and maintenance effort, Coverage, up-to-date information, the language barrier, low coverage Semi-structured – Strengths: high quality and coverage, up-to-date and multilingual

4 Semi-structured resources Wikipedia, Wiktionary, Twitter, Yahoo! Answers Wikipedia – relies on large amounts of manually-input knowledge – provided via massive online collaboration – on the basis of semi-structured (i.e., free-form markup) content Structure given by redirection pages, internal hyperlinks, interlanguage links, category pages, infoboxes – Markup annotations indirectly encode semantic content and, thus, world and linguistic knowledge manually input by human editors

5 Filling the knowledge gap Transforming semi-structured content into machine-readable knowledge Generating semantics by exploiting the shallow structure found in Wikipedia Acquiring related terms: thesaurus extraction Is-a relation: taxonomy induction Relation extraction – sentences processing combined with hyperlink information, use of infoboxes

6 Filling the knowledge gap Ontologization: building and enriching ontologies (YAGO2) – More relations (meronomy, domain-specific…) – Exploiting structure. Some of the methods quantify semantic distances using a relatedness measure computed on the Wikipedia hyperlink graph A heuristic renaissance: High-quality, semi-structured content enables the acquisition of machine-readable knowledge on a large scale by means of heuristic methods which essentially leverage regularities found within their shallow structure. – Lightweight and scalable rule-based approaches can be devised to exploit the conventions governing the editorial base of collaboratively-generated resources, and capture large amounts of semantic information hidden within them.

7 Filling the knowledge gap Named Entity Recognition Named Entity Disambiguation (associate name with appropriate reference) Word Sense Disambiguation Wikification: bringing Entity and Word Sense Disambiguation together – keyword extraction combined with lexical disambiguation: given an input document, a wikification system identifies the most important terms in the document and links (i.e., disambiguates) them to their appropriate entries within an external encyclopedic resource, i.e., typically Wikipedia.

8 Filling the knowledge gap Computing semantic relatedness: quantifying the strength of association between words. And beyond the sentence level: Document clustering and text categorization Question Answering – YAGO2 includes an extrinsic evaluation of the quality of Wikipedia on the task of answering spatio-temporal questions

9 Filling the knowledge gap Information Retrieval – The repository of disambiguated concepts found in Wikipedia (i.e., its articles) provides a semantic space into which documents and queries can be projected in order to perform semantic retrieval beyond the simple bag-of- words model

10 Exploiting updated content from revision history Language generation Leveraging Wikipedias revision history as a source of data in order to automatically acquire sentence rewriting models.

11

12 Exploiting updated content from revision history Rewriting tasks: sentence compression, text simplification and targeted paraphrasing Summarization – ??

13 The tower of Babel: multilingual resources and applications Wikipedias multilinguality – namely, the availability of interlinked wikipedias in different languages – enables the acquisition of very large, wide-coverage repositories of multilingual knowledge. Multilingual taxonomies and ontologies Parallel corpora and thesauri

14 Some Questions Tease out the collaborative vs. semi-structured aspects Collaborative – Over the past decade, a variety of proposals -- MindPixel8 and Open Mind9 – have tried to make manual knowledge acquisition feasible by collecting input from volunteers. See also Von Ahn, which aims at acquiring knowledge from users by means of online games. However, none of these efforts, to date, has succeeded in producing truly wide- coverage resources able to compete with standard manual resources. – Why? Why people like to collaborate on Wikepedia and not --as much-- on other projects? – What makes Wikepedia so attractive and how can one try to copy from it to encourage other collaborative efforts?

15 Some Questions Semi-structured Wikipedia, Wiktionary, Twitter, Yahoo! Answers – What aspects of the structures are most important? Other resources that have similar structure –if not the collaborative aspects? – News papers? – Forums? Use revision history to discover something about the contributors?

16 Papers for today Collaboratively built semi-structured content and Artificial Intelligence: The story so far – Hovy, Navigli, Ponzetto YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia – Hoffarta, Suchanekb, Berbericha, Weikuma

17 YAGO2 Knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 447 million facts about 9.8 million entities. Paper describes the extraction methodology, the integration of the spatio-temporal dimension, and the knowledge representation SPOTL to include time and space

18 Time and space To know not only that a fact is true, but also when and where it was true. – Presidents of countries or CEOs of companies change. Even capitals of countries or spouses are not necessarily forever…. – The geographical location is a crucial property not just of physical entities such as countries, mountains, or rivers, but also of organization headquarters, or events such as battles, fairs, or peoples births.

19 Contributions Integrate entity-relationship-oriented facts with the spatial and temporal dimensions. Extensible framework for fact extraction (from Wikipedia and other sources) that can tap on infoboxes, lists, tables, categories, and regular patterns in free text, and allows fast and easy specification of new extraction rules Knowledge representation model tailored to capture time and space, as well as rules for propagating time and location information to all relevant facts New representation model, SPOTL tuples (SPO + Time + Location) with expressive and easy-to-use querying – SPO triples: subject-property-object triples

20 YAGO The YAGO knowledge base is automatically constructed from Wikipedia. Each article in Wikipedia becomes an entity in the knowledge base (e.g., since Leonard Cohen has an article in Wikipedia, LeonardCohen becomes an entity in YAGO) 100 manually defined relations ( wasBornOnDate, locatedIn …) 2 million entities and 20 million facts. Facts: triples of an entity, a relation, and another entity ( wasBornIn(LeonardCohen, Montreal) ) – SPO triples of subject (S), predicate (P), and object (O), in compatibility with the RDF data model (Resource Description Framework)RDF data model

21 YAGO2 Extraction Architecture The YAGO2 architecture is based on declarative rules that are stored in text files. The rules take the form of subject-predicate-object- triples, so that they are basically additional YAGO2 facts. – Extraction rules say that if a part of the source text matches a specified regular expression, a sequence of facts shall be generated. Wikipedia infoboxes, but also to Wikipedia categories, article titles, headings, links, or references. The extraction rules cover some 200 infobox patterns, some 90 category patterns, and around a dozen patterns for dealing with disambiguation pages.

22 Time in YAGO2 YAGO2 contains a data type yagoDate that denotes time points, typically with a resolution of days but sometimes with cruder resolution like years. YAGO2 assigns begin and/or end of time spans to all entities, to all facts, and to all events, if they have a known start point or a known end point.

23 Entities and Time Entities are assigned a time span to denote their existence in time. Four major entity types: People – relations wasBornOnDate and diedOnDate demarcate their existence times –Elvis Presley is associated with as his birthdate and as his time of death. Bob Dylan, is associated only with the time of birth, Groups such as music bands, football clubs, universities, or companies – the relations wasCreatedOnDate and wasDestroyedOnDate demarcate their existence times Artifacts such as buildings, paintings, books, music songs or albums –wasCreatedOnDate and wasDestroyedOnDate (e.g., for buildings or sculptures) Events such as wars, sports competitions like Olympics or world championship tournaments, or named epochs like the German autumn –startedOnDate and endedOnDate demarcate their existence times

24 Facts and Time The YAGO2 extractors can find occurrence times of facts from the Wikipedia infoboxes. Example: BobDylan wasBornIn Duluth is an event that happened in 1941 Two new relations, occursSince and occursUntil If the same fact occurs more than once, then YAGO2 will contain it multiple times with different ids. For example, since Bob Dylan has won two Grammy awards, we would have #1: BobDylan hasWonPrize GrammyAward with #1 occursOnDate 1973, and a second #2 : BobDylan hasWonPrize GrammyAward (with a different id) and the associated fact #2 occursOnDate 1979.

25 Space All physical objects have a location in space. YAGO2 is concerned with entities that have a permanent spatial extent on Earth – for example countries, cities, mountains, and rivers. New class yagoGeoEntity, which groups together all geo-entities – Subclasses of yagoGeoEntity are: location, body of water, geological formation, real property, facility, excavation, structure, track … – The position of a geo-entity can be described by geographical coordinates, latitude and longitude YAGO2 harvests geo-entities from two sources: Wikipedia and GeoNames GeoNames – (GeoNames has information on location hierarchies (partOf), e.g. Berlin is located in Germany is located in Europe and provides alternate names for each location, as well as neighboring countries)

26 Entities and Location Events – Can take place at a specific location, such as battles or sports competitions, where the relation happenedIn holds the place where it happened. Groups or organizations – Can have a venue, such as the headquarters of a company or the campus of a university. The location for such entities is given by the isLocatedIn relation. Artifacts that are physically located somewhere – E.g. like the Mona Lisa in the Louvre, where the location is again isLocatedIn.

27 SPOTL(X)-View Model SPOTLX 6-tuples – SPO triples augmented by Time and Location and keywords or key phrases from the conteXt of sources where the original SPO fact occurs

28 Size of YAGO2: entities

29 Size of YAGO2: facts

30 Evaluation Of extraction of facts from Wikipedia

31 Task-Based Evaluation Answering Spatio-Temporal Questions 15 questions of the GeoCLEF 2008 GiKiP Pilot3 – The original intent of the GeoCLEF GiKiP Pilot is: Find Wikipedia entries / articles that answer a particular information need which requires geo- graphical reasoning of some sort. 4 questions working perfectly; 3 questions working when relaxing a geographical condition from structural to keyword conditions – resulting in a less precise but still useful result set; 6 questions that could be well formulated as SPOTLX queries but did not return any good result for the limited coverage of the knowledge base; 2 questions that could not be properly formulated at all. A sample of temporal and spatial questions blocks from Jeopardy!

32 Evaluation on Jeopardy

33 Improving Named Entity Disambiguation by Spatio-Temporal Knowledge Dylan performed Hurricane about the black fighter Carter, from his album Desire. That album also contains a duet with Harris in the song Joey. – Here, the tokens song, album, and performed are strong cues for Joey (Bob Dylan song) instead of the TV series

34 Spatial Coherence Two entities that are geographically close to each other are a coherent pair, based on the intuition that texts or text passages (news, blog postings, etc.) usually talk about a single geographic region. Spatial Coherence is defined between two entities e1,e2 E with geo- coordinates, where E is the set of all candidates for mapping mentions in a text to canonical entities

35 Temporal Coherence Defined between two entities e1,e2 E with existence time where cet( ) is the center of an entitys existence time interval, and the denominator normalizes the distance by the maximum distance of any two entities in the current set of entity candidates, ei,ej E. The intuition is that a text usually mentions entities that are clustered around a single or a few points in time

36 Named Entity Disambiguation Calculate Spatial and Temporal Coherence between the mention in the input text and all candidates entities in the knowledge base In the weighted formula for the entities relatedness

37


Download ppt "Papers for today Collaboratively built semi-structured content and Artificial Intelligence: The story so far – Hovy, Navigli, Ponzetto YAGO2: A Spatially."

Similar presentations


Ads by Google