2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
Chapter 5: Introduction to Information Retrieval
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Information Retrieval in Practice
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Chapter 23: Probabilistic Language Models April 13, 2004.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Information Retrieval in Practice
Exploiting Wikipedia as External Knowledge for Document Clustering
Search Engine Architecture
Applying Key Phrase Extraction to aid Invalidity Search
Wikitology Wikipedia as an Ontology
CS246: Information Retrieval
Information Retrieval and Web Design
Introduction Dataset search
Presentation transcript:

2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM, 2009 Speaker: Chien-Liang Wu

Outline Motivation Framework for feature constructor Hierarchical Resolution Feature Generation Feature Selection Evaluation 2

Motivation Aggregated Search Gather the search results from various resources Present them in a succinct format One of key issue in aggregated technology “How should the information be presented to the user?” Traditional browsing search results A ranked list Inconvenient for users to effectively locate their interests 3

Motivation (contd.) Some commercial aggregated search systems, such as DIGEST and Clusty Provide clustering of relevant search results Make the information more systematic and manageable Short texts Snippets, product descriptions, QA passages and image captions  play important roles in Web and IR applications Successful processing short texts is essential for aggregated search systems Consist of a few phrases or 2–3 sentences Present great challenges in clustering 4

Framework 5 (1) (2) (3) Example Query: “The Dark Knight” Google Snippet: “Jul 18, It is the best American film of the year so far and likely to remain that way. Christopher Nolan’s The Dark Knight is revelatory, visceral...”

Hierarchical Resolution Exploit internal semantics of short texts Preserve contextual information Avoid data sparsity Use NLP to construct syntax tree Decompose the original text into a three-level top-down hierarchical structure Segment level, phrase level and word level 6

Three Levels Segment Level Text is split into segments with the help of punctuations Generally ambiguous Often fail to convey the exact information to represent the short text Phrase level Adopt shallow parsing to divide sentences into a series of words Stemming and removal of stop-words from the phrases The NP and VP chunks are employed as phrase level features 7

Three Levels (contd.) Word level Decompose the phrase level features directly Build word level features Choose the non-stop words contained in NP and VP chunks Further remove the meaningless words in the short texts Original feature set Select features at phrase level and word level Phrase level: contain original contextual information Word level: avoid problem of data sparseness 8

Feature Generation Goal: Build semantic relationship with other relevant concepts Example: “The Dark Knight” and “Batman” Two steps Select seed phrases from the internal semantics Generate external features from seed phrases. 9

Seed Phrase Selection Use the features at segment level and phrase level to construct the seed phrases Redundancy problem Segment level feature: “Christopher Nolan’s The Dark Knight is revelatory visceral” Three phrase level features: [NP Christopher Nolan’s], [NP The Dark Knight] and [NP revelatory visceral] Eliminate information redundancy Measure semantic similarity between phrase level features and its parent segment level feature 10

Semantic Similarity Measure A phrase-phrase semantic similarity measure algorithm Use co-occurrence double check in Wikipedia to reduce the semantic duplicates Download XML corpus of Wikipedia Build a Solr index of all XML articles Let P = {p 1, p 2,…,p n } P: segment level feature, p i : phrase level feature InfoScore(p i ): Semantic similarity between p i and {p 1, p 2,..., p n } 11

Semantic Similarity Measure (contd.) Given two phrases p i and p j Use p i and p j separately as query to retrieve top C Wikipedia pages f(p i ): total occurrences of p i in the top C Wikipedia pages retrieved by query p i f(p i |p j ): total occurrences of p i in the top C Wikipedia pages retrieved by query p j Variants of three popular co-occurrence measures 12

Semantic Similarity Measure (contd.) Similarity scores are normalized into values in [0, 1] range A linear combination: 13

Semantic Similarity Measure (contd.) For each segment level feature Rank the information score for its child node features at phrase level Remove the phrase level feature p * Delegate the most information duplicate to the segment level feature P 14

Feature Generator 15 retrieve the top w articles (1)titles and bold terms (links) in retrieved Wikipedia pages (2)key phrases extracted from the Wikipedia pages by Lingo Algorithm  External features Example: "in his car“ WordNet synsets: "atuo", "automobile", "autocar"

Feature Selection Overzealous external features Bring adverse impact on the effectiveness Dilute the influence of valuable original information Empirical rules to refine the unstructured features obtained from Wikipedia pages Remove too general seed phrase (occurrence more than 10,000) Transform features used for Wikipedia management or administration e.g. "List of hotels" → "hotels", "List of twins" → "twins“ Phrase sense stemming using Porter stemmer Remove features related to chronology, e.g. “year”, “decade” and “centuries” 16

External Feature Collection Construct n 1 + n 2 dimension feature space n 1 original features, n 2 external features, where θ=0  no external features, θ=1  no original features One seed phrase s i (0< i ≦ m) may generate k external features {f i1, f i2,..., f ik } Select one feature f i * for each seed phrase Top n 2 - m features are extracted from the remaining external features based on their frequency 17

Evaluation Datasets Reuters Remove the texts which contain more than 50 words Filter those clusters with less than 5 texts or more than 500 texts Leave 19 clusters comprising 879 texts Web Dataset Ten hot queries from Google Trends Top 100 snippets for each query are retrieved Build a 10-category Web Dataset with 1000 texts 18

Clustering Methods Two clustering algorithms, K-means and EM Six different text representation methods BOW (baseline 1) : Traditional “bag of words” model with the tf-idf weighting schema BOW+WN (baseline 2) : BOW integrated with additional features from WordNet BOW+Wiki (baseline 3) : BOW integrated with additional features from Wikipedia BOW+Know (baseline 4) : BOW integrated with additional features from Wikipedia and WordNet BOF : The bag of original features extracted with the hierarchical view SemKnow : Our proposed framework 19

Evaluation Criteria F1 measure F1 Score = 2*(precision * recall) / (precision + recall) Average Accuracy 20

Performance Evaluation Parameter Setting: C=100, w=20, α=β=⅓, θ=0.5 Use k-means algorithm Use EM algorithm 21

Effect of External Features At a small external feature set size of θ= 0.2 or θ= 0.3, SemKnow achieve the best performance 22 Reuters using K-means algorithmWeb Dataset using K-means algorithm

Optimal results using two algorithms 23

Detail Analysis Feature space for the example of snippet 24