INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web

`CYC convinced the AI community that creating a commonsense knowledge base by hand is impossible’ (Massimo, Lecture 1) That may depend on how many people you put on to it!

THE SOCIAL WEB Increasingly, the Web is becoming not just a way to facilitate information exchange or commercial transactions, but also a tool to facilitate socialization (Facebook, LinkedIn, etc) Also, where information can be collectively created

SOCIAL CREATION OF KNOWLEDGE

WIKIPEDIA Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Wikipedia's articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone who can access the Wikipedia website. The free encyclopedia that anyone can edit ----http://en.wikipedia.org/wiki/Wikipeida

WIKIPEDIA Wikipedia is: 1. domain independent – it has a large coverage 2. up-to-date – to process current information 3. multilingual – to process information in many languages

Title Abstract Infoboxes Geo-coordinates Categories Images Links Other languages Other wiki pages To the web Redirects Disambiguates

WIKIPEDIA Wikipedia is an encyclopedia written collaboratively by many of its readers Lots of people are constantly improving Wikipedia, making thousands of changes an hour, all of which are recorded on article histories and recent changes. Inappropriate changes are usually removed quickly Unlike other encyclopedias, the volunteer authors of articles in Wikipedia don't have to be experts or scholars (though some certainly are).

Encyclopedic knowledge in coreference resolution [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. ….. [The agency] said that because MCI's offer had expired AT&T couldn't continue to offer its discount plan.

Why Wikipedia may help addressing the encyclopedic knowledge problem http://en.wikipedia.org/wiki/FCChttp://en.wikipedia.org/wiki/FCC: The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressional statute (see 47 U.S.C. § 151 and 47 U.S.C. § 154). Congressionalstatute U.S.C.§ 151U.S.C.§ 154

Another interesting scenario A fresh mandate for [Mr Ahmadinejad] would, say his critics, consecrate the “revolution within a revolution” he has been trying to effect since his surprise electoral triumph in 2005. Best known to outsiders for his bellicose grandstanding, [the incumbent] is more familiar to Iranians as a radical and hyperactive populist who has used the tacit backing of his fellow conservative, Mr Khamenei, greatly to expand the powers of the presidency. Source: It could make a big difference, The Economist, Mar 19th 2009

Why Wikipedia may help addressing the encyclopedic knowledge problem

Wikipedia as Ontology Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. However, it is more… – Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) – Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. – Up to date: Current and emerging concepts are absorbed timely. * Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900–901.

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept. – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links.

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links. – It contains a hierarchical categorization system, in which each article belongs to at least one category.

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links. – It contains a hierarchical categorization system, in which each article belongs to at least one category. – Polysemous concepts are disambiguated by Disambiguation Pages.

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA Taxonomic information: category structure Attributes: infobox, text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007) Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007) Induce a subsumption hierarchy

INFOBOXES Collaborative content Semi- structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in 1848... | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]...

DBpedia.org is a effort to : extract structured information from Wikipedia make this information available on the Web under an open license interlink the DBpedia dataset with other datasets on the Web DBPEDIA

 1,600,000 concepts  including  58,000 persons  70,000 places  35,000 music albums  12,000 films  described by 91 million triples  using 8,141 different properties.  557,000 links to pictures  1,300,000 links external web pages  207,000 Wikipedia categories  75,000 YAGO categories The DBpedia Dataset

The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data. REPRESENTING EXTRACTED INFORMATION

http://en.wikipedia.org/wiki/Calgary http://dbpedia.org/resource/Calgary dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “1079310”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council;... Extracting Infobox Data (RDF Representation):

SPARQL : SPARQL is a query language for RDF. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

 http://dbpedia.org/sparql  hosted on a OpenLink Virtuoso server  can answer SPARQL queries like  Give me all Sitcoms that are set in NYC?  All tennis players from Moscow?  All films by Quentin Tarentino?  All German musicians that were born in Berlin in the 19th century? The DBpedia SPARQL Endpoint

Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing efforts – Other initiatives: Citizen Science, Cognition and Language Laboratory, … This has been taken advantage of in AI – Open Mind Commonsense (Singh) (collecting facts) – Semantic Wikis WEB COLLABORATION FOR KNOWLEDGE ACQUISITION www.phrasedetectives.com

Open Mind Common Sense – Singh Crater mapping (results)– Kanefsky Crater mappingresults Learner / Learner2 / 1001 Paraphrases– Chklovski FACTory – CyCORP FACTory Hot or Not– 8 Days Hot or Not ESP / Phetch / Verbosity / Peekaboom– von Ahn ESPPhetchPeekaboom Galaxy Zoo– Oxford University WEB COLLABORATION PROJECTS www.phrasedetectives.com

OPEN MIND COMMONSENSE A project started in 2000 by Push Singh to take advantage of people’s collaboration to collect commonsense

WHAT’S IN OPEN MIND COMMONSENSE: CAR

OPEN MIND COMMONSENSE: ADDING KNOWLEDGE

OMCS ADDING KNOWLEDGE, 2

OPEN MIND COMMONSENSE: CHECKING KNOWLEDGE

FROM OPENMIND COMMONSENSE TO CONCEPT NET ConceptNet (Havasi et al, 2009) is a semantic network extracted from OpenMind Commonsense assertions using simple heuristics

CONCEPT NET

ConceptNet Example

FROM OPENMIND COMMONSENSE FACTS TO CONCEPTNET A lime is a very sour fruit isa(lime,fruit) property_of(lime,very_sour)

GAMES WITH A PURPOSE Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK GWAP do not rely on altruism or financial incentives to entice people to perform certain actions The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP Games at www.gwap.comwww.gwap.com – ESP – Verbosity – TagATune Other games – Peekaboom – Phetch

ESP The first GWAP developed by von Ahn and their group (2003 / 2004) The problem: obtain accurate description of images to be used – To train image search engines – To develop machine learning approaches to vision The goal: label the majority of the images on the Web

ESP: the game

ESP: THE GAME Two partners are picked at random from the large number of players online They are not told who their partner is, and can’t communicate with them They are both shown the same image The goal: guess how their partner will describe the image, and type that description – Hence, the ESP game If any of the strings typed by one player matches the string typed by the other player, they score points

THE TASK

SCORING BY MATCHING

THE CHALLENGE: SCORES One of the motivating factors is to try to score as many points as possible Hourly, daily, weekly, and monthly scores are shown

SCORES

THE CHALLENGE: TIMING Partners try to agree on as many images as they can during 2 ½ minutes The termometer on the side indicates how many images they have agreed on If they agree on 15 images they score bonus points

TABOO WORDS To ensure the production of a large number of specific labels, some words are declared TABOO and not allowed Taboo words are obtained from the game itself: any word that has been agreed upon by players who were shown a picture earlier becomes a taboo word for that image

TABOO WORDS

PASSING

GOOD LABELS, COMPLETING AN IMAGE A label is considered “good” when more than N players produce it (with N a parameter of the game) An image is “done” when its list of taboo words is so extensive that most players pass on it

IMPLEMENTATION Pre-recorded game play – Especially at the beginning, and at quiet times, there won’t always be players to pair with – In these cases a player is paired against a recorded ‘hand’ of a previous game with the same picture Cheating – Players could cheat in a number of ways, including agreeing on labels / playing against themselves – A number of mechanisms are in place against those cases Selecting images

SOME STATISTICS In the 4 months between August 9 th 2003 and December 10th 2003 – 13630 players – 1.2 million labels for 293,760 images – 80% of players played more than once By 2008: – 200,000 players – 50 million labels

ANALYSIS The numbers indicate that the game is fun to play Exciting factors: – Playing with a partner – Playing against time

QUALITY OF THE LABELS For IMAGE SEARCH: – choose 10 labels among those produced and look at which images are returned Compare labels produced by players with labels produced by participants in an experiment – 15 participants, 20 images among the 1000 with more than 5 labels – 83% of game labels also produced by participants Manual assessment of labels (‘would you use these labels to describe this image?’) – 15 participants, 20 images – 85% of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

VERBOSITY … or, the game approach to collecting commonsense knowledge Motivation: slow progress both on CYC (5 million facts collected) and on Open Mind Commonsense (around 700,000 facts)

THE GAME Based on an existing game, TABOO: – Players have to guess a word – One of the players gives hints concerning the word In Verbosity, you have two players, the DESCRIBER and the GUESSER, and a SECRET WORD

THE GAME

TEMPLATES IN VERBOSITY As in Open Mind Commonsense, templates are used to ensure that the relations / properties of interest are collected The Describer produces hints by filling in a template

GUESSING ATTRIBUTES

PRODUCING A DESCRIPTION

TEMPLATES _ is a kind of _ _ is used for _ _ is typically near/in/on _ _ is the opposite of _ / _ is related to _

EMULATION As in ESP game, pre-recorded games are used when a player cannot be paired with another player The asymmetry of the game causes a problem not encountered in ESP game – Describer: can just repeat behavior of previous describer – Guesser: not so easy

RESULTS Only published results I’m aware of predate the actual release of the game so I don’t know about the QUANTITY Quality: – Ask six raters whether 200 facts collected using Verbosity are ‘true’ – Around 85% success

PHRASE DETECTIVES www.phrasedetectives.org

2 tasks : – Find The Culprit (Annotation) User must identify the closest antecedent of a markable if it is anaphoric – Detectives Conference (Validation) User must agree/disagree with a coreference relation entered by another user www.phrasedetectives.com PHRASE DETECTIVES: THE TASKS

NAME THE CULPRIT

READINGS V. Nastase& M. Strube, Transforming Wikipedia into a large scale multilingual concept network, Artificial Intelligence, 2012 C. Havasi, J. Pustejovsky, R. Speer and H. Lieberman, Digital Intuition: Applying Common Sense Using Dimensionality Reduction, IEEE Intelligent Systems, 2009 L. von Ahn and L. Dabbish (2008). Designing games with a purpose. Communications of the ACM, v. 51, n.8, 58-67 Poesio, Chamberlain, Kruschwitz, Robaldo, & Ducceschi, 2013. Phrase Detectives: Utilizing Collective Intelligence for Internet- Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web.

Similar presentations

Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web.

Similar presentations

Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web."— Presentation transcript:

Similar presentations

About project

Feedback