Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

1 ~Khaled Shaban PhD. Candidate Supervisors: Dr. Otman Basir Dr. Mohammad Kamel.
WEB MINING. Why IR ? Research & Fun
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Graph-based cluster labeling using Growing Hierarchal SOM Mahmoud Rafeek Alfarra College Of Science & Technology The second International.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Slovak University of Technology Department of Computer Science and Engineering Bratislava, Slovakia Pavol Návrat, Mária Bieliková {navrat,
Data Mining Techniques: Clustering
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Learning Networks connecting people, organizations, autonomous agents and learning resources to establish the emergence of effective lifelong learning.
Information Retrieval in Practice
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
1/1/ Designing an Ontology-based Intelligent Tutoring Agent with Instant Messaging Min-Yuh Day 1,2, Chun-Hung Lu 1,3, Jin-Tan David Yang 4, Guey-Fa Chiou.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Information Retrieval in Practice
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Masoud Makrehchi, PAMI, UW Learning Object Metadata Masoud Makrehchi PAMI University of Waterloo August 2004.
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.
ICT-enabled Agricultural Science for Development Scenarios, Opportunities, Issues by ICTs transforming agricultural science, research & technology generation.
Chittampally Vasanth Raja 10IT05F vasanthexperiments.wordpress.com.
Chittampally Vasanth Raja vasanthexperiments.wordpress.com.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
TUMOR BURDEN ANALYSIS ON CT BY AUTOMATED LIVER AND TUMOR SEGMENTATION RAMSHEEJA.RR Roll : No 19 Guide SREERAJ.R ( Head Of Department, CSE)
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Information Retrieval in Practice
Hanan Ayad Supervisor Prof. Mohamed Kamel
Data and Applications Security Developments and Directions
Color-Texture Analysis for Content-Based Image Retrieval
A Consensus-Based Clustering Method
Presented by: Prof. Ali Jaoua
CS 430: Information Discovery
Presentation transcript:

Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario, Canada LORNET Theme 4

The Problem Web / LOR Text Documents Web Documents Discussion Articles... Automatic Clustering/Grouping ProgrammingLanguages Database Systems Pattern Recognition How do we judge similarity? Data Mining

Group Similar Documents Together Maximize intra-cluster similarity Maximize intra-cluster similarity Minimize inter-cluster similarity Minimize inter-cluster similarity Need to accurately calculate document similarity Clustering Documents

Document Similarity How similar each document is to every other document? Very time consuming! O(n 2 )

Document Similarity Information Theoretic Measure (Dekang’98): How do we intersect every pair of documents without sacrificing efficiency? What features should we intersect? Words Words Phrases Phrases

Fast Phrase-based Document Indexing and Matching Document Index Graph Structure A model based on a digraph representation of the phrases in the document set A model based on a digraph representation of the phrases in the document set Nodes correspond to unique terms Nodes correspond to unique terms Edges maintain phrase representation Edges maintain phrase representation A phrase is a path in the graph A phrase is a path in the graph The model is an inverted list (terms  documents) The model is an inverted list (terms  documents) Nodes carry term weight information for each document in which they appear Nodes carry term weight information for each document in which they appear Shared phrases can be matched efficiently Shared phrases can be matched efficiently Phrase-based Features Phrases: more informative feature than individual words  local context matching Phrases: more informative feature than individual words  local context matching Represent sentences rather than words Represent sentences rather than words Facilitate phrase-matching between documents Facilitate phrase-matching between documents Achieves accurate document pair-wise similarity Achieves accurate document pair-wise similarity Avoid high-dimensionality of vector space model Avoid high-dimensionality of vector space model Allow incremental processing Allow incremental processing Document Index Graph

- river rafting - river - vacation plan - river - trips

Phrase-based Document Indexing Document Index Graph (internal structure) Document Index Graph (size scalability) Document Index Graph (time performance)

Effect of using phrase-based similarity over individual words Effect of using phrase similarity (F-measure)Effect of using phrase similarity (Entropy)

Applications Grouping search engine results on-the-fly (incremental processing) Creating taxonomies of documents (Yahoo! and Open Directory style) Implementing “Find Related” or “Find Similar” features of information retrieval systems Automatic generation of descriptive phrases about a set of documents (i.e. labeling clusters) Detecting plagiarism

Collaboration Provide Data Mining services (primarily text mining) for other groups Opportunity for collaboration with U of Saskatchewan: I-Help Discussion System I-Help Discussion System Course Delivery Tools Course Delivery Tools Others are welcome

Publications Journal Publications K. Hammouda and M. Kamel, “Efficient Phrase-based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering. Accepted, September K. Hammouda and M. Kamel, “Efficient Phrase-based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering. Accepted, September K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model”, Knowledge and Information Systems. Springer. Accepted, May K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model”, Knowledge and Information Systems. Springer. Accepted, May Conference Publications K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity Histograms”, The 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp , Halifax, Canada, October 2003 K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity Histograms”, The 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp , Halifax, Canada, October 2003 K. Hammouda and M. Kamel, “Phrase-based Document Similarity Based on an Index Graph Model”, The 2002 IEEE International Conference on Data Mining (ICDM'02), pp , Maebashi, Japan, December K. Hammouda and M. Kamel, “Phrase-based Document Similarity Based on an Index Graph Model”, The 2002 IEEE International Conference on Data Mining (ICDM'02), pp , Maebashi, Japan, December Available at: Available at:

Questions Instant Messaging MSN Messenger: MSN Messenger:

Text Documents Phrasal Indexing and Cohesive Document Clustering Phrase-based Text Indexing and Matching A model based on a digraph representation of the phrases in the document set A model based on a digraph representation of the phrases in the document set Nodes correspond to unique words Nodes correspond to unique words Edges maintain phrase representation Edges maintain phrase representation Phrases: more informative feature than individual words  local context matching Phrases: more informative feature than individual words  local context matching Facilitate phrase-matching between documents Facilitate phrase-matching between documents Achieves accurate document pair-wise similarity Achieves accurate document pair-wise similarity Avoid high-dimensionality of traditional vector space model Avoid high-dimensionality of traditional vector space model Allows incremental processing Allows incremental processing Document Clustering Similarity Histogram-based Clustering (SHC) Similarity Histogram-based Clustering (SHC) Clusters are represented using concise statistical representation called similarity histograms Clusters are represented using concise statistical representation called similarity histograms Maximize clusters coherency by maintaining high similarity distributions in clusters histograms Maximize clusters coherency by maintaining high similarity distributions in clusters histograms Enhance a cluster any time by re-distributing documents among clusters Enhance a cluster any time by re-distributing documents among clusters Both original and receiving clusters benefit from more tight similarity distributions Both original and receiving clusters benefit from more tight similarity distributions Phrase-based Document Index Graph Document Clustering Using Similarity Histograms