Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Similar presentations


Presentation on theme: "©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap."— Presentation transcript:

1 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap Innovations Work Performed under AFRL contract FA8750-06-C-0052

2 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Capabilities Automated domain relevant information gathering – Gathers documents relevant to domains of interest from www or proprietary databases. Automated content organization – Organizes documents by topics, keywords, sources, time references and features of interest. Automated information discovery – Assists users with automated recommendations on related documents, topics, keywords, sources…

3 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Comparison to existing manual information gathering method (what most users do currently) 3. Search 6. Satisfie d 7. Refine Query (Conjure up new keywords) 5. Examine Results Search Engine Interface Generalized Search Index 4. Results 1. Develop Information Need User Yes No Query Dat a Take a break The goal is to maximize the results for a user keyword query User performs a “Keyword Search” 7a. Give up  Index: User Task Search Engine Task Data 2. Form Keywords

4 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Information Gathering method (what users do with Apollo) 3. Filter 6. Satisfie d 5. Examine Results Apollo Interface Specialized Domain Model 4. Results 1. Develop Information Need User Dat a The focus is on informative results seeded by a user selected combination of features User explores, filters and discovers documents assisted by Apollo features Features 7. Discover new/related information via Apollo features Yes No Take a break Features - Vocabulary, Location, Time, Sources … Index: User Task Apollo Task Data 2. Explore Features

5 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Architecture

6 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Domain Modeling (behind the scenes) 1. Bootstrap Domain 2. Define domain, topics, subtopics 3. Get Training Documents (Option A/B/AB) Build Representative Keywords Query Search Engine (s) Curate (optional) A. From the Web B. From Specialized Domain Repository (Select a small sample) 4. Build Domain Signature Identify Salient Terms per Domain, Topic, Sub topic Compute Classification Threshold 5. Organize Documents (Option A/B/AB) Filter Documents Extract Features - Vocabulary, Location, Time … A. From the Web B. From Specialized Domain Repository Classify into defined topics/subtopics

7 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Data Organization Document Data Source Classify into defied domain topics/subtopics Is in Domain Extract Features: domain relevant vocabulary locations, time references, sources, … Store document e.g. Web Site, Proprietary database,... e.g. Published Article, News Report, Journal Paper, … Apollo collection process Data Source Data Source Data Source Data Source Data Source Data Source … Document … Discard No Yes Apollo collection/organization process Domain A Apollo library of domain relevant documents Feature ADoc 1Doc 2Doc N Organize documents by features Domain B Domain C … Snapshot of Apollo process to collect a domain relevant document Snapshot of Apollo process to evolve domain relevant libraries

8 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Information Discovery User selects a feature via the Apollo interface Apollo builds a set of documents from the library that contains the feature Apollo collates all other features from the set and ranks them by domain relevance User is presented with co-occurring features e.g.: user selects phrase “global warming” from domain “climate change” A set of n documents containing phrase “global warming” e.g. user sees phrase “greenhouse gas emissions” And “ice core” as phrases co-occurring with “global warming” and explores documents containing the phrases User can use discovered features to expand or restrict the focus of search based on driving interests

9 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Illustration: Apollo Web Content Management Application for the domain “Climate Change”

10 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. “Climate Change” Domain Model Vocabulary (Phrases, Keywords, idioms) identified for the domain from training documents collected from the web Building blocks of the model of the domain Modeling error based on noise in the training data Can be reduced by input from human experts

11 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Prototype Keyword Filter Extracted “Keywords” or Phrases across the collection of documents Document results of filtering Domain Automated Document Summary Extracted “Locations” across the collection of documents

12 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Inline Document View Filter Interface Additional Features Features extracted only for this document

13 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Expanded Document View Features extracted for this document Cached text of the Document

14 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Automatically Generated Domain Vocabulary Importance changes as the library changes Vocabulary collated across domain library Font size and thickness shows domain importance

15 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Performance

16 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Experiment Setup The experiment setup comprised the Text Retrieval Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were: – The collection contained documents from Reuters Corpus Volume 1. – There were 83,650 training and 723,141 testing documents – There were 50 assessor and 50 intersection topics. The assessor topics had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned. – The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility. 1. http://trec.nist.gov/data/filtering/T11filter_guide.html

17 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Experiment Each topic was set as an independent domain in Apollo. Only the set of relevant documents from the training set of the topic were used to create the topic signature. The topic signature was used to output a vector – called the filter vector – that comprised single word terms that were weighted by their ranks. A threshold of comparison was calculated based on the mean and standard deviation of the cross products of the training documents with the filter vector. Different distributions were assumed to estimate the appropriate thresholds. In addition, the number of documents to be selected was set to be a multiple of the training sample size. The entire testing set was indexed using Lucene. For each topic, the documents were compared using the cross product with the topic filter vector in the document order prescribed by TREC.

18 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Initial Results Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks Precision and recall can be improved by leveraging additional components of the signatures. 50 Assessor TopicsAvg. RecallAvg. PrecisionAvg. T11F (FBeta) Apollo0.350.630.499 TREC Benchmark KerMit [2]-0.430.495 2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.

19 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Topic Performance

20 ©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo Filtering Performance Apollo training period was linear to the number and size of the training set (num training docs vs. avg. training time). On average, the filtering time per document was constant (avg. test time).


Download ppt "©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap."

Similar presentations


Ads by Google