©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.

1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.

Information Retrieval in Practice

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

Evaluating the Performance of IR Sytems

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Search Engines and Information Retrieval Chapter 1.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Information Retrieval

Post-Ranking query suggestion by diversifying search Chao Wang.

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Information Organization: Overview

Information Retrieval and Web Search

Information Retrieval and Web Search

Applying Key Phrase Extraction to aid Invalidity Search

IR Theory: Evaluation Methods

IL Step 3: Using Bibliographic Databases

Searching with context

Panagiotis G. Ipeirotis Luis Gravano

A Suite to Compile and Analyze an LSP Corpus

Information Organization: Overview

Introduction to Search Engines

Presentation transcript:

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap Innovations Work Performed under AFRL contract FA C-0052

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Capabilities Automated domain relevant information gathering – Gathers documents relevant to domains of interest from www or proprietary databases. Automated content organization – Organizes documents by topics, keywords, sources, time references and features of interest. Automated information discovery – Assists users with automated recommendations on related documents, topics, keywords, sources…

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Comparison to existing manual information gathering method (what most users do currently) 3. Search 6. Satisfie d 7. Refine Query (Conjure up new keywords) 5. Examine Results Search Engine Interface Generalized Search Index 4. Results 1. Develop Information Need User Yes No Query Dat a Take a break The goal is to maximize the results for a user keyword query User performs a “Keyword Search” 7a. Give up  Index: User Task Search Engine Task Data 2. Form Keywords

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Information Gathering method (what users do with Apollo) 3. Filter 6. Satisfie d 5. Examine Results Apollo Interface Specialized Domain Model 4. Results 1. Develop Information Need User Dat a The focus is on informative results seeded by a user selected combination of features User explores, filters and discovers documents assisted by Apollo features Features 7. Discover new/related information via Apollo features Yes No Take a break Features - Vocabulary, Location, Time, Sources … Index: User Task Apollo Task Data 2. Explore Features

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Architecture

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Domain Modeling (behind the scenes) 1. Bootstrap Domain 2. Define domain, topics, subtopics 3. Get Training Documents (Option A/B/AB) Build Representative Keywords Query Search Engine (s) Curate (optional) A. From the Web B. From Specialized Domain Repository (Select a small sample) 4. Build Domain Signature Identify Salient Terms per Domain, Topic, Sub topic Compute Classification Threshold 5. Organize Documents (Option A/B/AB) Filter Documents Extract Features - Vocabulary, Location, Time … A. From the Web B. From Specialized Domain Repository Classify into defined topics/subtopics

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Data Organization Document Data Source Classify into defied domain topics/subtopics Is in Domain Extract Features: domain relevant vocabulary locations, time references, sources, … Store document e.g. Web Site, Proprietary database,... e.g. Published Article, News Report, Journal Paper, … Apollo collection process Data Source Data Source Data Source Data Source Data Source Data Source … Document … Discard No Yes Apollo collection/organization process Domain A Apollo library of domain relevant documents Feature ADoc 1Doc 2Doc N Organize documents by features Domain B Domain C … Snapshot of Apollo process to collect a domain relevant document Snapshot of Apollo process to evolve domain relevant libraries

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Information Discovery User selects a feature via the Apollo interface Apollo builds a set of documents from the library that contains the feature Apollo collates all other features from the set and ranks them by domain relevance User is presented with co-occurring features e.g.: user selects phrase “global warming” from domain “climate change” A set of n documents containing phrase “global warming” e.g. user sees phrase “greenhouse gas emissions” And “ice core” as phrases co-occurring with “global warming” and explores documents containing the phrases User can use discovered features to expand or restrict the focus of search based on driving interests

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Illustration: Apollo Web Content Management Application for the domain “Climate Change”

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. “Climate Change” Domain Model Vocabulary (Phrases, Keywords, idioms) identified for the domain from training documents collected from the web Building blocks of the model of the domain Modeling error based on noise in the training data Can be reduced by input from human experts

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Prototype Keyword Filter Extracted “Keywords” or Phrases across the collection of documents Document results of filtering Domain Automated Document Summary Extracted “Locations” across the collection of documents

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Inline Document View Filter Interface Additional Features Features extracted only for this document

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Expanded Document View Features extracted for this document Cached text of the Document

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Automatically Generated Domain Vocabulary Importance changes as the library changes Vocabulary collated across domain library Font size and thickness shows domain importance

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Performance

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Experiment Setup The experiment setup comprised the Text Retrieval Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were: – The collection contained documents from Reuters Corpus Volume 1. – There were 83,650 training and 723,141 testing documents – There were 50 assessor and 50 intersection topics. The assessor topics had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned. – The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility. 1.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Experiment Each topic was set as an independent domain in Apollo. Only the set of relevant documents from the training set of the topic were used to create the topic signature. The topic signature was used to output a vector – called the filter vector – that comprised single word terms that were weighted by their ranks. A threshold of comparison was calculated based on the mean and standard deviation of the cross products of the training documents with the filter vector. Different distributions were assumed to estimate the appropriate thresholds. In addition, the number of documents to be selected was set to be a multiple of the training sample size. The entire testing set was indexed using Lucene. For each topic, the documents were compared using the cross product with the topic filter vector in the document order prescribed by TREC.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Initial Results Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks Precision and recall can be improved by leveraging additional components of the signatures. 50 Assessor TopicsAvg. RecallAvg. PrecisionAvg. T11F (FBeta) Apollo TREC Benchmark KerMit [2] Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Topic Performance

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Filtering Performance Apollo training period was linear to the number and size of the training set (num training docs vs. avg. training time). On average, the filtering time per document was constant (avg. test time).