Information Organization: Overview
IO: What What is Information Organization? Systematic arrangement of items group similar items together assign meaning to groups determine relationships between groups assign items to groups Grouping 1 Grouping 2 Grouping 3 Big Small Square Circle Blue Red Small Big Small Big Small Big Small Big Blue Red Square Circle Square Circle Search Engine
IO: Why Why organize information? Why do we put certain things in certain places? Closet - Seasonal groups - Pants vs. Shirts - Color groups - Favorite vs. non-favorite To find things easier → Information Retrieval (IR) Taxonomy Food Good Bad sweet taste smell like milk too hot hard to chew To make sense of the world → Knowledge Discovery (KD) Search Engine
IO: How What to do when information to organize is massive? How do we organize information? General Approach anticipate how item is searched for e.g. by subject, date, author look for common features among items determine what an item is about Classification Identification/creation of classes Assignment of items into classes Clustering group similar items together What to do when information to organize is massive? 10,000 books 100,000 journal papers 1,000,000 web pages Search Engine
Machine Learning: Introduction What is Machine Learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997). Any change in a system that allows it to perform better the second time on repetition of the same task or on task drawn from the same population (H. Simon, 1983). How can systems improve? By acquiring new knowledge Acquiring new facts Acquiring new skills By adapting its behavior Solving problems more accurately Solving problems more efficiently Search Engine
Machine Learning: Introduction Which is different? Which are similar? How is learning possible? Because there are regularities in the world. Search Engine
ML: Classification vs. Clustering Task is to learn to assign instances to predefined classes Supervised Learning data has to specify what we are trying to learn (the classes) requires training data predefined classes and classified items Clustering Task is to learn a classification from the data no predefined classification is required Unsupervised Learning data doesn’t specify what we are trying to learn (the clusters) Clustering algorithms divide a data set into natural groups (clusters) items in the same cluster are similar to each other and share certain properties Search Engine
IO for IR Clustering Document Clustering Cluster Hypothesis Documents having similar contents tend to be relevant to the same query Rank clusters by Query-Cluster Similarity Cluster documents based on vector similarity Post-retrieval clustering Scatter-Gather Keyword Clustering Automatic Thesaurus Construction Query Expansion Search Engine
IO for IR Classification Document Categorization classify documents into manually defined categories supports hierarchical browsing, query expansion via relevance feedback Document Indexing assign keywords to documents automatic indexing with controlled vocabulary, metadata generation Document Filtering e.g. news delivery, email spam filtering Query Classification collection selection algorithm selection Search Engine