CIKM Recognition and Classification of Noun Phrases in Queries for Effective Retrieval Wei Zhang 1 Shuang Liu 2 Clement Yu 1 Chaojing Sun 3 Fang Liu 4 Weiyi Meng 5 1 Department of Computer Science, University of Illinois at Chicago 2 Ask.com 3 Broadcom Corporation 4 Microsoft 5 Department of Computer Science, Binghamton University
CIKM Motivation Our definitions of the phrases Proper noun and dictionary phrase recognition Simple and complex phrase recognition Experimental results CIKM Outline
CIKM Motivation Terms in a query are related semantically “John Smith” Recognize this relationship Partition the query terms to groups (phrases) Document retrieval using phrases Adding phrases into searching and ranking
CIKM Types of Noun Phrases Phrases that have fixed writing formats Names of Locations, people, companies, … Well defined concepts. E.g. “computer science” Freely written phrases Not formally defined but used in the real language
CIKM Four Types of Noun Phrases Proper Noun (PN) A noun phrase that names a specific person, place or thing. First letters of the content words are capitalized E.g. “John Smith”, “Atlantic Ocean” Dictionary Phrase (DP) A phrase that has a definition in a dictionary, excluding PN These two types may overlap “Atlantic Ocean” They can not replace each other E.g. “Lina’s Pizza”, “public transportation”
CIKM Four Types of Noun Phrases Simple Noun Phrase (SNP) A grammatically valid noun phrase other than PN and DP 2 words E.g. “white car”, “good hotel” Complex Noun Phrase (CNP) A grammatically valid noun phrase other than PN and DP 3 or more words May contain PN/DP/SNP E.g. “small white car”, “city public transportation”
CIKM Noun Phrase Recognition General procedure Recognize PN and dictionary phrases first Then simple and complex noun phrases A n-word query Check the original query Check the 2 (n-1)-term arrays … Check the (n-1) 2-term arrays Totally n*(n-1)/2 candidates E.g. “World Trade Organization” “World Trade” and “Trade Organization”
CIKM Noun Phrase Recognition Tools for phrase recognition Dictionaries (Wikipedia, WordNet) Large text corpus (Google for experiments) Parsers (Minipar, Collins parser) and POS tagger
CIKM PN and DP Recognition Wekipedia For proper nouns and dictionary phrases DP: existence of the entry page PN: content words in the first instance of the phrase in the main text should be capitalized
CIKM PN and DP Recognition WordNet For PN and DP recognition DP: defined in a dictionary PN: has a hypernym of city, province, country, organization, geographic area, person, syndrome, region, building, or nation.
CIKM PN and DP Recognition Minipar For PN recognition only (1) “PN” label in the parse tree (2) Semantic label of person, country, corpname, location, corpdesig, fname, gname, or date
CIKM PN and DP Recognition List of first names, last names and rules First_initial last_name First_initial mid_initial last_name First_name middle_initial last_name First_name last_name
CIKM PN and DP Recognition Text corpus For less well-known PNs Three instances, first letters of the content words capitalized Not a sub-phrase of a longer PN “if you choose windows by Vista Window Company, …” “if you choose windows by Super Vista Window Company, …”
CIKM PN and DP Recognition Overlapped phrases Search all words together Count the instances of each phrase in the returned documents e.g. “Native American Casino” “Native American” and “American Casino” Compare ( Count(“Native American”), Count(“American Casino”) )
CIKM SNP and CNP Recognition Only check the phrase candidates that are not sub-phrases of a recognized PN/DP do not overlap with a recognized PN/DP
CIKM SNP and CNP Recognition Implicit phrases “and” / “or” “main and contributing factor” “main factor” “contributing factor”
CIKM SNP and CNP Recognition Head word replacement Replace the whole phrase by its head word Collins parser Label the noun phrases NP/sedan(head word) Compact/JJBest/JJSSedan/NN NP/sedan(head word)
CIKM SNP and CNP Recognition Phrase verification To verify that a phrase is used in the world For CNP: it also means to find all the words in a text window “Colin Farrell wallpaper” and “wallpaper of Colin Farrell”
CIKM SNP and CNP Recognition Overlapped phrases Two potential SNP/CNP: Search all words, compare the numbers of the instances. “sony dvd handyam” “sony dvd” and “dvd handycam”
CIKM Document Retrieval Using Phrases Search a phrase in a document Exact match: PN/DP Search all words in a text window: SNP/CNP
CIKM Document Retrieval Using Phrases Sim(Query, Doc) = Phrase similarity Sim_P(P_i) = idf(P_i) Sim_P = sum ( sim_P(P_i) ) Term similarity Okapi/BM-25 similarity Document ranking D1 is ranked higher than D2, if (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
CIKM Experimental Results Phrase recognition experiments Tuned by using TREC queries
CIKM Experimental Results Phrase recognition experiments Tested by using Web queries
CIKM Experimental Results Performance of individual tools Wikipedia is better than WordNet and Minipar Need for a complete dictionary Collins parser alone is not enough for SNP/CNP recognition Lack of real world usage information
CIKM Experimental Results Document retrieval experiments Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14 1.Retrieval without using phrases 2.Using Wikipedia for PN/DP and just collins parser for SNP/CNP 3.Using phrases from the full recognition algorithm 33% MAP increase and 44.27% GMAP increase from 1 to 2 5.8% MAP increase and 12.58% GMAP increase from 2 to 3
CIKM Conclusions Our algorithm can effectively recognize the four types of phrases in the short Web queries The recognized phrases help improve the retrieval effectiveness
CIKM Questions?