Presentation is loading. Please wait.

Presentation is loading. Please wait.

Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Similar presentations


Presentation on theme: "Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael."— Presentation transcript:

1 Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael Verschoor, Nick de Jong, and Gijs Geleijnse

2 Philips Research, Jan Korst, 26 november 20042 Overview Context Ontologies Searching for enumerations / tables in web pages Case Study: Searching for famous persons on the web Concluding remarks

3 Philips Research, Jan Korst, 26 november 20043 Context recommender system: ontologies and metadata matching and reasoning preferences, personal history, and calender electronic program guide, cultural agenda recommendations for TV shows, expositions in museums, theatre shows, etc.

4 Philips Research, Jan Korst, 26 november 20044 Ontologies An ontology is a “specification of a conceptualization”. [Tom Gruber] In other words: a formal description of the concepts and their relationships in a certain domain. Example: music domain concepts: composers, songs, albums, performers,… relationships: … To define/specify ontologies for given knowledge domains semantic web languages as RDF(S) and OWL are useful.

5 Philips Research, Jan Korst, 26 november 20045 Ontologies An ontology O is defined by a 4-tuple (C, I, P, T ), where: C is a set of classes c e.g. composer, song, album, performer,… I = { I (c ) | c  C }, with I (c ) the set of instances of class c P is a set of properties p (c,c’ ) for some c, c’  C e.g. is_composer_of (composer, song) is_contained_in (song, album) T = {T (p) | p  P }, with T (p)  { (s, p, o) | s  I (c), o  I (c’ )} for each p  P the set of true statements (triples).

6 Philips Research, Jan Korst, 26 november 20046 Problem statement For a partially given ontology O’ = (C, I’, P, T’ ) of a given knowledge domain, with I’  I and T’  T, extend I’ to I’’ and T’ to T’’ to approximate I and T as well as possible. In other words: how can we populate databases. Research questions: - Can this be automated ? - Can we do this by extracting information the web ?

7 Philips Research, Jan Korst, 26 november 20047 Quality of Approximation For each class c, we define precision and recall as follows: precision (c ) = recall (c ) = For each property p, precision and recall are defined likewise.

8 Philips Research, Jan Korst, 26 november 20048 Searching for enumerations on the web basic idea: words in an enumeration tend to be of the same class. Given a small subset of instances of a given class, we want to automatically extend this subset: more-of-the-same. algorithm: - select web pages in which a given sequence or given subset of instances occurs, using Google. - scan these pages for enumerations in which one or more of the given instances occurs. - extract other terms that are in these enumerations. Similar approach has been applied on a corpus of documents in molecular biology [Nenadić, Spasić & Ananiadou, 2002].

9 Philips Research, Jan Korst, 26 november 20049 Preselection of relevant web pages Extraction of Instances/Statements Filter to remove false positives General structure of the algorithm

10 Philips Research, Jan Korst, 26 november 200410 Examples "bach vivaldi mozart" 611 --> [63] bach[154], mozart[46], vivaldi[45], haydn[17], beethoven[14], ensembles[9], handel[9], chopin[7], haendel[5], schubert[5], bizet[4], j[4], albinoni[3], brahms[3], s[3], sanz[3], tartini[3], 2[2], chaconne[2], corelligeminiani[2], gershwin[2], gluck[2], http[2], inteacutegrale[2], minor[2], paganini[2], ravel[2], strauss[2], stravinsky[2], tchaikovsky[2], teleman[2], telemann[2], albeniz[1], bellini[1], benda[1], berlioz[1], bloch[1], boccherini[1], boellman[1], boieldieu[1], bruch[1], caccini[1], caldera[1], corelli[1], diabelli[1], dowland[1], giuliani[1], grieg[1], homekcrrcom[1], jsbach[1], martin[1], milano[1], ortiz[1], pergolesi[1], prokofiev[1], purcell[1], rimskykorsakov[1], schumann[1], smetana[1], title[1], torelli[1], vieuxtemps[1]

11 Philips Research, Jan Korst, 26 november 200411 Examples (2) "france germany england italy" 246 --> [54] france[322], germany[259], brazil[257], italy[239], argentina[223], england[218], spain[215], holland[212], yugoslavia[140], croatia[133], denmark[129], norway[122], chile[91], belgium[88], nigeria[83], romania[83], mexico[66], bulgaria[59], colombia[54], scotland[34], austria[33], cameroon[30], team[25], usa[22], sth[18], states[16], morocco[13], ar[12], netherlands[12], saudi[11], africa[10], bahamas[10], paraguay[10], czech[8], jamaica[8], scandinavia[8], canada[7], japan[7], acquitane[4], australia[4], bali[4], caribbean[4], china[4], czechoslovakia[4], luxembourg[4], poland[4], us[4], flanders[2], acadeacutemiques[1], asn[1], cortona[1], europe[1], korea[1], park[1]

12 Philips Research, Jan Korst, 26 november 200412 Examples (3) poincare hilbert brouwer 1110 --> [90] brouwer[20], hilbert[20], abel[18], deligne[18], gregory[18], mandelbrot[18], taylor[18], turing[18], cavalieri[17], poisson[17], banach[16], kolmogorov[16], wiener[16], goldbach[15], grassmann[15], cohen[13], hausdorff[13], jacobi[13], kronecker[13], torricelli[13], vinogradov[13], riemann[12], dedekind[11], frege[11], artin[10], babbage[10], barrow[10], boole[10], bourgain[10], eukleidõs[10], euler[10], fraenkel[10], heaviside[10], legendre[10], möbius[10], shannon[10], tchebychev[10], borel[9], fibonacci[9], fisher[9], grothendieck[9], aryabhata[8], birkhoff[8], bolyai[8], cayley[8], church[8], descartes[8], hypatie[8], markov[8], minkowski[8], bolzano[7], cramer[7], dee[7], painlevÕ[7], cantor[6], morgan[6], puthagoras[6], gauss[5], haldane[5], hauptman[5], irons[5], lejeune[5], schwartz[5], lie[4], bayes[3], poincareacute[3], poincarÕ[3], biography[2], brahmagupta[2], carnap[2], goumldel[2], gödel[2], …

13 Philips Research, Jan Korst, 26 november 200413 Hypernym-based filtering Patterns that indicate hypernym relations are distinguished: ”h such as i 1, i 2, …, i n ” and ”i 1, i 2, …, i n and other h ” [Hearst, 1992] In these patterns h is the plural of the intended class.

14 Philips Research, Jan Korst, 26 november 200414 Geographic Data Extract all countries: Input set Precision Recall France, China, Germany 0.89 0.99 Georgia, Ghana, Latvia 0.84 0.99 Kiribati, Monaco, Togo 0.79 0.99 Find out which countries have a border in common.

15 Philips Research, Jan Korst, 26 november 200415 Case Study: Finding Famous Persons on the Web Objective: generate a long list of famous persons, by searching the web. - A famous person is a person that gets enough hits when being Googled. - We restrict ourselves to persons that have already died.

16 Philips Research, Jan Korst, 26 november 200416 Definition of number of hits Using only the last name is not specific enough. e.g. Bach, Smith Even the full name might not be specific enough. e.g. Theo van Gogh In addition, some persons score better with middle name, others without. e.g. Johann Sebastian Bach vs. Johann Bach Antonio Vivaldi vs. Antonio Lucio Vivaldi While others are best known with initials only. e.g. HG Wells, DH Lawrence

17 Philips Research, Jan Korst, 26 november 200417 Definition of number of hits We use the number of hits that are found with query: “ ( - )” e.g. “Bach (1685 – 1750)” By not using the full name, we combine different variants. e.g. Johann Sebastian Bach and JS Bach For kings, queens, popes, etc, the Latin ordinal number is used as last name. This combines the variants in different languages. e.g. Charles V Carlos V Karel V

18 Philips Research, Jan Korst, 26 november 200418 Basic idea We use potential time intervals “( - )” as starting point to search for persons. Issue exact queries to Google of the following form: allintitle: “(y1 – y2)” where y1 ∈ [1000..1999] and y2-y1 ∈ [20..110], and analyse the summaries Google returns. Look for the six words that precede “(y1 – y2)” and analyse these words.

19 Philips Research, Jan Korst, 26 november 200419 Google batch processing To process the Google queries we use a program that allows batch processing (Nick de Jong): Program allows parallel execution of multiple queries. file with queries GoogleQuery file with results

20 Philips Research, Jan Korst, 26 november 200420 Main Problem: how to separate person names from other names. Art BlakeyArt Deco West MaeWest Virginia Raul Delcroix Real Decreto HP LovecraftHP Inkjet Koye SomefunHave SomeFun Potential approaches: - - filter out non-persons by using a list of stop words. - - filter out non-persons by using an exhaustive list of first names. - - carry out further tests (“X was born in”). We only used a list of 500 stop words, including: Album, Anniversary, Archive, Articles, Biographie, Biography, Births, Boats, Burials, Catalog, Census,…

21 Philips Research, Jan Korst, 26 november 200421 Additional Problem: a single person can be presented in various ways Vasilij Kandinskij Wassily Kandinsky Vasily Kandinsky Vassily Kandinsky Kandinsky, Wassily Kandinsky Wassily Johann Sebastian Bach JS Bach Johann Sebastian Sebastian Bach Bach, Johann Sebastian

22 Philips Research, Jan Korst, 26 november 200422 Example of the word sequences that are found: [allintitle: "(1769 - 1852)" -genealogy -genealogie] 111 Rose-Philippine Duchesne ( Wellesley, 1st Duke of Wellington ( Home Study Service Rose Philippine Duchesne Arthur, 1st Duke of Wellington ( The Duke of Wellington ( Wellesley, 1st Duke of Wellington ( Arthur Wellesley, Duke of Wellington. ( Wellesley, first Duke of Wellington ( People > Duke of Wellington ( > Pobl > Dug Wellington ( medal depicting Duke of Wellington ( Arthur Wellesley Wellington ( Wellesley, 1st Duke of Wellington ( John Landseer ( Wellington, Arthur Wellesley,Duke of, Learning Library: WELLINGTON, DUKE OF (

23 Philips Research, Jan Korst, 26 november 200423 Another Example: George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel (... George Frideric HANDEL ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel ( up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (

24 Philips Research, Jan Korst, 26 november 200424 1. first reduce capitals: If a word consists of capitals only, then replace all but the first. e.g. HANDEL  Handel Unless the word contains a hyphen. e.g. SAINT-SAENS  Saint-Saens Unless the word represents a latin ordinal number. e.g. Louis XIV  Louis XIV Unless the word starts with ‘MC’. e.g. MCCULLOCH  McCulloch Unless the word is an abbreviation (initials). e.g. DE KNUTH  DE Knuth

25 Philips Research, Jan Korst, 26 november 200425 Example: George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel (... George Frideric HANDEL ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel ( up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (

26 Philips Research, Jan Korst, 26 november 200426 Example: George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel (... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel ( up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (

27 Philips Research, Jan Korst, 26 november 200427 2. delete pre- and suffixes: Delete parts that cannot be part of the name. First delete suffix. Next, scan through the words from back to front, until e.g. a colon or point is encountered.

28 Philips Research, Jan Korst, 26 november 200428 Example: George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel (... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel ( up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (

29 Philips Research, Jan Korst, 26 november 200429 Example: George Frederick Handel George F. Handel X. George Frederick Handel Handel, George Frideric George Frederic Handel Handel, George Frederic George Frederic Handel George Frideric Handel Georg Frideric Handel from Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

30 Philips Research, Jan Korst, 26 november 200430 3. correct inversions: If two words remain, where the first ends with a comma, then reverse. e.g. West, Mae  Mae West If three words remain, where the first ends with a comma, then reverse. e.g. Handel, George Frederick  George Frederick Handel If three words remain, where the second ends with a comma, then reverse. e.g. Van Gogh, Vincent  Vincent van Gogh Problem: not all inverted names contain commas.

31 Philips Research, Jan Korst, 26 november 200431 Example: George Frederick Handel George F. Handel X. George Frederick Handel Handel, George Frideric George Frederic Handel Handel, George Frederic George Frederic Handel George Frideric Handel Georg Frideric Handel from Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

32 Philips Research, Jan Korst, 26 november 200432 Example: George Frederick Handel George F. Handel X. George Frederick Handel George Frideric Handel George Frederic Handel George Frideric Handel Georg Frideric Handel from Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

33 Philips Research, Jan Korst, 26 november 200433 4. save two- and three-word names Scan the list of strings and those consisting of two or three words are stored, provided that they do not contain stop words. In addition, count how often they are found.

34 Philips Research, Jan Korst, 26 november 200434 Example: George Frederick Handel George Frederic Handel 5 George F. HandelGeorge Frideric Handel2 X. George Frederick HandelGeorge F. Handel1 George Frideric HandelGeorge Frederick Handel 1 George Frederic HandelGeorg Frideric Handel1 George Frederic Handel by GF Handel1 George Frederic Handel George Frideric Handel Georg Frideric Handel from Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich For each lastname/years combination the form that was found most often is used.

35 Philips Research, Jan Korst, 26 november 200435 Unexpected Observations - - Franz-Eugen Schlachter (1859 – 1911) has 64,500 hits, but all from the same server! It concerns an on-line bible, where each bible page is implemented as a separate web page, with Franz-Eugen Schlachter in the title. We can use the similar pages information that Google gives, to filter these out. - - Koop Juliana (1948 - 1980) has 8,200 hits. “Koop Juliana” results in considerably less hits than “Juliana (1948 – 1980)”. That can be an indication that the first name is not correct.

36 Philips Research, Jan Korst, 26 november 200436 Number of Persons Found 1000 – 1099: 40 1100 – 1199: 42 1200 – 1299: 79 1300 – 1399: 106 1400 – 1499: 357 1500 – 1599: 1050 1600 – 1699: 2258 1700 – 1799: 7239 1800 – 1899: 28637 1900 – 1999: 12101 Total 51909

37 Philips Research, Jan Korst, 26 november 200437 Top 16 born between 1500 and 1599 1 William Shakespeare (1564 - 1616) 51300 2 Rene Descartes (1596 - 1650) 33400 3 Galileo Galilei (1564 - 1642) 27300 4 Francis Bacon (1561 - 1626) 25200 5 John Dowland (1563 - 1626) 25000 6 Orlandus Lassus (1532 - 1594) 23200 7 Johannes Kepler (1571 - 1630) 22700 8 Thomas Hobbes (1588 - 1679) 15400 9 Frescobaldi Girolamo (1583 - 1643) 11900 10 Claudio Monteverdi (1567 - 1643) 11600 11 Peter Paul Rubens (1577 - 1640) 11400 12 Tycho Brahe (1546 - 1601) 11000 13 Michel de Montaigne (1533 - 1592) 10700 14 John Calvin (1509 - 1564) 9990 15 Elizabeth I (1558 - 1603) 7520 16 Andrea Palladio (1508 - 1580) 7140 17 Gibbons Orlando (1508 – 1580) 7030 18 Nicolas Poussin (1594 - 1665) 6790

38 Philips Research, Jan Korst, 26 november 200438 Top 16 born between 1600 and 1699 1 Johann Sebastian Bach (1685 - 1750) 86600 2 Antonio Vivaldi (1678 - 1741) 39700 3 Henry Purcell (1659 - 1695) 37600 4 Georg Philipp Telemann (1681 - 1767) 35700 5 Georg Friedrich Haendel (1685 - 1759) 33600 6 Voltaire (1694 - 1778) 32800 7 Isaac Newton (1642 - 1727) 31700 8 Domenico Scarlatti (1685 - 1757) 28300 9 Arcangelo Corelli (1653 - 1713) 27300 10 Francois Couperin (1668 - 1733) 27100 11 Jean-Philippe Rameau (1683 - 1764) 26700 12 Alessandro Scarlatti (1660 - 1725) 25600 13 Tomaso Albinoni (1671 - 1751) 25000 14 Jean-Baptiste Lully (1632 - 1687) 24900 15 Giuseppe Tartini (1692 - 1770) 23800 16 de la Barca (1600 - 1681) 23000 17 John Locke (1632 - 1704) 22800 18 Blaise Pascal (1623 - 1662) 22700

39 Philips Research, Jan Korst, 26 november 200439 Top 16 born between 1700 and 1799 1 Wolfgang Amadeus Mozart (1756 - 1791) 79000 2 Ludwig van Beethoven (1770 - 1827) 69400 3 Franz Schubert (1797 - 1828) 62300 4 Napoleon Bonaparte (1769 - 1821) 61500 5 Joseph Haydn (1732 - 1809) 50300 6 Johann Wolfgang Goethe (1749 - 1832) 45800 7 Immanuel Kant (1724 - 1804) 35800 8 Gioacchino Rossini (1792 - 1868) 34300 9 Benjamin Franklin (1706 - 1790) 28600 10 Washington Irving (1783 - 1859) 26900 11 Luigi Boccherini (1743 - 1805) 25100 12 Luigi Cherubini (1760 - 1842) 24100 13 William Blake (1757 - 1827) 22000 14 Arthur Schopenhauer (1788 - 1860) 21900 15 Thomas Jefferson (1743 - 1826) 20100 16 Jean-Jacques Rousseau (1712 - 1778) 19400 17 Boyce William (1711 - 1779) 17400 18 Heinrich Heine (1797 - 1856) 15900

40 Philips Research, Jan Korst, 26 november 200440 Top 16 born between 1800 and 1899 1 Charles Darwin (1809 - 1882) 73400 2 Albert Einstein (1879 - 1955) 70500 3 Johannes Brahms (1833 - 1897) 60600 4 James Joyce (1882 - 1941) 59300 5 Peter Iljitsch Tschaikowsky (1840 - 1893) 47600 6 47600 Robert Schumann (1810 - 1856) 45300 7 Frederic Chopin (1810 - 1849) 41200 8 Giuseppe Verdi (1813 - 1901) 41100 9 Claude Debussy (1862 - 1918) 39400 10 Winston Churchill (1874 - 1965) 39300 11 Franz Liszt (1811 - 1886) 38500 12 Richard Wagner (1813 - 1883) 38300 13 Richard Strauss (1864 - 1949) 37800 14 Antonin Dvorak (1841 - 1904) 35700 15 Maurice Ravel (1875 - 1937) 35300 16 Gustav Mahler (1860 - 1911) 34300

41 Philips Research, Jan Korst, 26 november 200441 Top 16 born between 1900 and 1999 16 nov. 2004 29 nov. 2004 1 Ronald Reagan (1911 - 2004) 44800 Yasser Arafat (1929 - 2004) 84200 2 Benjamin Britten (1913 - 1976) 31700 Ronald Reagan (1911 - 2004) 46600 3 John Peel (1939 - 2004) 27400 Benjamin Britten (1913 - 1976) 32000 4 Samuel Barber (1910 - 1981) 26600 Samuel Barber (1910 - 1981) 26300 5 John Fitzgerald Kennedy (1917 - 1963) 24100 John Peel (1939 - 2004) 21700 6 Robertson Davies (1913 - 1995) 18900 Robertson Davies (1913 - 1995) 18800 7 Yasser Arafat (1929 - 2004) 16600 John F. Kennedy (1917 - 1963) 17300 8 Peter Ustinov (1921 - 2004) 16500 Peter Ustinov (1921 - 2004) 16700 9 Kurt Cobain (1967 - 1994) 14800 Kurt Cobain (1967 - 1994) 14400 10 Salvador Dali (1904 - 1989) 14600 Salvador Dali (1904 - 1989) 14000 11 Christopher Reeve (1952 - 2004) 13900 Jon Lee (1968 - 2002) 13900 12 Jon Lee (1968 - 2002) 13900 Marlon Brando (1924 - 2004) 11200 13 Marlon Brando (1924 - 2004) 11200 Christopher Reeve (1952 - 2004) 10800 14 Van Gogh (1957 - 2004) 10900 Jean-Paul Sartre (1905 - 1980) 9790 15 Albert Camus (1913 - 1960) 9730 Chostakovitch Dimitri (1906 - 1975) 9640 16 Jean-Paul Sartre (1905 - 1980) 9630Albert Camus (1913 - 1960) 9180 17 Ted Hughes (1930 - 1998) 8970Van Gogh (1957 - 2004) 9050 18 Jim Morrison (1943 - 1971) 8930Steve Reich (1965 - 1995) 8370

42 Philips Research, Jan Korst, 26 november 200442 Top 16 born between 1000 and 1999 1 Johann Sebastian Bach (1685 - 1750) 86600 2 Wolfgang Amadeus Mozart (1756 - 1791) 79000 3 Charles Darwin (1809 - 1882) 73400 4 Albert Einstein (1879 - 1955) 70500 5 Ludwig van Beethoven (1770 - 1827) 69400 6 Franz Schubert (1797 - 1828) 62300 7 Napoleon Bonaparte (1769 - 1821) 61500 8 Johannes Brahms (1833 - 1897) 60600 9 James Joyce (1882 - 1941) 59300 10 Leonardo da Vinci (1452 - 1519) 53400 11 William Shakespeare (1564 - 1616) 51300 12 Joseph Haydn (1732 - 1809) 50300 13 Peter Iljitsch Tschaikowsky (1840 - 1893) 47600 14 Johann Wolfgang Goethe (1749 - 1832) 45800 15 Robert Schumann (1810 - 1856) 45300 16 Ronald Reagan (1911 - 2004) 44800

43 Philips Research, Jan Korst, 26 november 200443 Testing recall Herinneringen in Steen 195 persons recall: 0.77 150 found: James Baldwin, Olaf Palme, Simone Signoret, Henry Moore, Carel Willink, Joan Miro, Theolonius Monk, Georges Brassens, John Lennon, Jean-Paul Sartre, Simone de Beauvoir, Mae West, Kurt Gödel, Elvis Presley, Maria Callas, Charlie Chaplin, Benjamin Britten, Paul Robeson, Mao Zedong, Agatha Christie, Lotte Lehmann, Robert Stolz, Edward Kennedy, Pablo Picasso, Pablo Casals, Maurits Cornelis Escher, Ezra Pound, Jim Morrison, Louis Armstrong, Igor Stravinsky, Jimi Hendrix, Barnett Newman, Charles de Gaule, Judy Garland, Dwight David Eisenhower, Ho Tsji Minh, Martin Luther King, Robert Kennedy, Erneste Guevara, John William Coltrane,… 45 not found: Louis Paul Boon, Adriaan Roland Holst, Stijn Streuvels, Ernest Claes, Johannes XXIII, Dag Hammarskj ö ld, William Christopher Handy, Lucien Guitry, Antony Fokker, Pieter Jelles Troelstra, Paul van Ostaijen, Hugo Verriest,…

44 Philips Research, Jan Korst, 26 november 200444 Testing recall Het Kunst Boek of the first 200 (dead) persons recall: 0.84 167 found: Jaques-Laurent Agasse, Josef Albers, Allesandro Algardi, Washington Allston, Jacopo Amigoni, Fra Angelico, Antonello da Messina, Alexander Archipenko, Giuseppe Arcimboldo, Hendrick Avercamp, Francis Bacon, Giacomo Balla, Fra Bartolommeo, Jean-Michel Basquiat, Jacopo Bassano, Pompeo Batoni, Willi Baumeister, Frederic Bazille, Domenico Beccafumi, Max Beckmann, Gentille Bellini, Giovanni Bellini, Hans Bellmer, Gianlorenzo Bernini, Josef Beuys, Albert Bierstadt,… 45 not found: Andrea del Sarto, Sofonisba Anguissola, Jean Arp, John James Audubon, Hans Baldung, Andre Beauneveu, Bernardo Bellotto, George Bellows, …

45 Philips Research, Jan Korst, 26 november 200445 Testing recall The Science Book of the 156 (dead) persons recall: 0.70 109 found: Leon Battista Alberti, Nicolas Copernicus, Andreas Vesalius, Conrad Gesner, Tycho Brahe, William Gilbert, Johannes Kepler, Galileo Galilei, John Napier, William Harvey, Blaise Pascal, Pierre de Fermat, Christiaan Huygens, James Clerk Maxwell, Robert Boyle, Nicolaus Steno, Giovanni Domenico Cassini, Isaac Newton, Edmond Halley, Carolus Linnaeus, Lazzaro Spallanzani, Johan Heinrich Lambert, Joseph Priestley, Antoine Laurent Lavoisier, William Herschel, Henry Cavendish, James Hutton, Edward Jenner, Pierre-Simon Laplace, Georges Cuvier, Thomas Robert Malthus, Alexander von Humboldt, Allesandro Volta, Thomas Young,... 45 not found: Fibonacci, Piero della Francesca, Jeremiah Horrocks, Antoni van Leeuwenhoek, Rudolph Jacob Camerarius, George Hadley, Carl Wilhelm Scheele, James Hall, Joseph von Frauenhofer, William Smith,…

46 Philips Research, Jan Korst, 26 november 200446 Testing precision precision Counting false positives: 4900 – 4999 0.90 9900 – 9999 0.88 14900 – 14999 0.96 19900 – 19999 0.97 Povijest Jugoslavije (1918 - 1991) Oeuvre Poetique (1925 - 1965) Alabama Wills (1808 – 1870) Black Tennesseans (1900 - 1930) Nippon Porcelain (1891 - 1921) Personal Favorites (1977 - 1998) Wheeling Glass (1829 - 1939) Political Impact (1770 - 1814) Movie Set (1959 - 1980) Transatlantic Dialogues (1775 - 1815) Sailing Navy (1775 - 1854) Home Children (1869 - 1930) Peace Pilgrim (1908 - 1981) Briton Riviere (1840 - 1920) La Regle (1917 - 1947) Farm Tractors (1890 - 1960) Western Warfare (1775 - 1882) Le Peintre (1877 - 1968) Exakta Cameras (1933 - 1978) Offene Briefe (1945 - 1968) Portraitmatilde Muti (1862 - 1943) Nature Morte (1946 - 1993) Dessins Inconnus (1901 - 1954) Jacques Lacan-Seminaires (1952 - 1980) Legendary Parties (1922 - 1972) Memory Joggers (1940 - 1989) Klondike Ho (1897 - 1997) Events From (1907 - 1977) estimated precision for first 5000: 0.90

47 Philips Research, Jan Korst, 26 november 200447 Some observations - - Composers dominate the top for some centuries. - - Recently-died persons have relatively high score. - - Person names only consisting of one word, such as pseudonyms Voltaire, Caravaggio, and Nadar are not yet found. - - Likewise, names consisting of four or more words are not yet found, such as Joost van den Vondel. - - Also, persons that died as teenagers are not found, such as Jeanne d’Arc and Anne Frank. - - More advanced approximate pattern matching is required to better cluster the name variations of one person and potential errors in years.

48 Philips Research, Jan Korst, 26 november 200448 Concluding remarks - - Enumeration search offers an interesting approach to find more-of-the-same, since it is generally applicable. - - The famous-persons case study indicates that with simple techniques already non-trivial results can be obtained. - - Further research: extend the case study to also include information on nationality, profession, etc. of persons. Automatically search for biographic data. - - Other intended application domains: music and medical domain.

49 Philips Research, Jan Korst, 26 november 200449 Fun Section Election of ‘De Grootste Nederlander’: Vincent van Gogh

50 Philips Research, Jan Korst, 26 november 200450 Fun Section Persons that are born and died in the same years: Sir Christopher Wren (1632 – 1723) Anthony van Leeuwenhoek (1632 – 1723) Leo Tolstoy (1828 - 1910) Henri Dunant (1828 - 1910) Edouard Manet (1832 - 1883) Gustave Dore (1832 - 1883) JRR Tolkien (1892 – 1973) Pearl Buck (1892 – 1973) Miles Davis (1926 – 1991) Klaus Kinski (1926 – 1991)


Download ppt "Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael."

Similar presentations


Ads by Google