Presentation on theme: "Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist."— Presentation transcript:
Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research firstname.lastname@example.org May 14, 2004
Connecting the Docs - Mark Wasson2 Talk Outline Introduction Search and retrieval, classification and indexing Clustering and summarization Extraction and aggregation Record linkage Analysis, visualization and discovery Closing remarks, Q&A References and related materials
May 14, 2004Connecting the Docs - Mark Wasson3 Introduction
May 14, 2004Connecting the Docs - Mark Wasson4 What is Information Integration? Pull together an appropriate amount of information about some subject matter (company, person, topic, product, event, etc.) into a single information product Key steps –Target some subject matter –Find relevant information across all relevant sources –Focus on the particularly useful information –Connect information about the target found in different documents, sources –Eliminate redundant information –Package the information
May 14, 2004Connecting the Docs - Mark Wasson5 Search and Retrieval, Classification and Indexing
May 14, 2004Connecting the Docs - Mark Wasson6 Search and Retrieval Search basics –Choose sources, search tools –Formulate query –Submit search –Review results –Refine and repeat as appropriate The result is generally a set of documents
May 14, 2004Connecting the Docs - Mark Wasson7 Search and Retrieval Accuracy – all over the place –Recall (completeness) –Precision (correctness) What impacts results? –What you are searching for –Ambiguity, synonymy, variants –Source size and focus –Search functionality –Search engine algorithms, coverage –Data annotations and enhancements –Searchers skills, knowledge of the topic User still must analyze search results
May 14, 2004Connecting the Docs - Mark Wasson8 Google Mark Wasson
May 14, 2004Connecting the Docs - Mark Wasson9 Google Mark Wasson Results 57 references in Top 100 (April 22, 2004) –About me –My papers –My pictures –Conference programs and attendees lists –Cites to my papers –Links to my site and pictures Using the retrieval results –Need to know a lot about me to select, connect the 57 –Look at most to get a fairly complete profile –Look at more than a few to get a solid introduction (unless you turn up a really good page early on)
May 14, 2004Connecting the Docs - Mark Wasson10 Categorization and Indexing Map documents to a taxonomy of topics Current state of the technology –State of art at 90-95% accuracy (recall, precision) –Many at 80-85% accuracy –Often designed to work with human editors –Academic research community skeptical Big commercial applications –Inxight/Factiva Machine learning technology/editorial hybrid –LexisNexis SmartIndexing Knowledge-based approach –Thomson-West CaRE (used in West km) Machine learning-based approach
May 14, 2004Connecting the Docs - Mark Wasson11 Categorization and Indexing Pros and Cons Pros –Creates sets of related documents –Higher accuracy (recall and precision) –With good organization and UI, can support ease of search, retrieval Cons –Coverage gaps –Incompatible scopes –Different recall, precision priorities And youre still dealing with documents
May 14, 2004Connecting the Docs - Mark Wasson12 Clustering and Summarization
May 14, 2004Connecting the Docs - Mark Wasson13 Statistical Document Clustering Find sets of potentially related documents –Create a feature representation for each document Words, phrases, equivalences, variants, frequencies Classifications Publication attributes –Compare, score feature similarity –Cluster most similar documents together Youre still working with documents –Select most representative documents, one or more of those closest to a clusters centroid
May 14, 2004Connecting the Docs - Mark Wasson14 Clusters and Centroids Dots are documents Ovals are clusters Xs are centroids Picture from CS5604 – Information Storage and Retrieval class notes, Ed Fox, Virginia Tech, http://ei.cs.vt.edu/~cs5604/
May 14, 2004Connecting the Docs - Mark Wasson15 Google News
May 14, 2004Connecting the Docs - Mark Wasson16 Google News Integrates information at the document level –Finds, retrieves, organizes, presents todays news –Enough info is provided to provide a nice overview –Links are provided for those who want the details Beginning to go beyond documents –Sub-document Headlines Leading sentences Pictures –Across documents Story ranking based on cluster attributes Representative documents are selected
May 14, 2004Connecting the Docs - Mark Wasson17 The Information Unit Information takes lots of forms –Documents –Paragraphs –Sentences –Sentence fragments –Headlines, other document components –Tables –Databases –Directories –Lists –Facts –Ideas –Relationships (within, across documents)
May 14, 2004Connecting the Docs - Mark Wasson18 Multidocument Summarization Identify related documents and create a single summary that captures their highlights –Document classification and clustering –Statistical sentence analysis –Extract key sentences, sentence fragments –Recombine the extracted information –Natural language analysis and generation to improve readability
May 14, 2004Connecting the Docs - Mark Wasson19 Columbia Newsblaster Daily Page
May 14, 2004Connecting the Docs - Mark Wasson20 Columbia Newsblaster Summary, Links
May 14, 2004Connecting the Docs - Mark Wasson21 Extraction and Aggregation
May 14, 2004Connecting the Docs - Mark Wasson22 Extraction and Aggregation Find related pieces of information across a document collection and package those pieces into a single information product Information can be spread across lots of sources Information can be found in lots of formats Information is not always explicitly linked
May 14, 2004Connecting the Docs - Mark Wasson23 LexisNexis Company Dossiers Users want good information about companies Company information is found in numerous news, directory, financial, government, legal and other sources –Literally dozens of searches needed to find everything Company names are not always used consistently across sources –Need ability to create a common search key across content, e.g., normalized form of company names Information is presented in free text, lists, tables, databases and directory entry formats –Need ability to find and extract important information
May 14, 2004Connecting the Docs - Mark Wasson24 Company Dossier
May 14, 2004Connecting the Docs - Mark Wasson25 Company Dossier (cont.)
May 14, 2004Connecting the Docs - Mark Wasson26 Company Dossier (cont.)
May 14, 2004Connecting the Docs - Mark Wasson27 Company Dossier (cont.)
May 14, 2004Connecting the Docs - Mark Wasson28 Company Dossier (cont.)
May 14, 2004Connecting the Docs - Mark Wasson29 Record Linkage
May 14, 2004Connecting the Docs - Mark Wasson30 Record Linkage Record linkage techniques are used to connect related records when there is no explicit key –Data lacks explicit keys, such as ID numbers, normalized company names, etc. –Data lacks consistent features, such as unique names, presence of address or phone number, etc. Combine feature extraction and analysis –Identify, extract, normalize features as evidence –Compare features across records, looking for a preponderance of evidence of relatedness –Apply other heuristics, e.g., top-ranked, score threshold
May 14, 2004Connecting the Docs - Mark Wasson31 Westlaw Profiler-related Research Users want background information on attorneys, judges and expert witnesses Information about attorneys and judges found in case law, jury verdicts, directories, etc. Information about expert witnesses found in jury verdicts, medical publications, news, websites, etc. People names are problematic –Many people with same names –Variation is common But set of attorneys, judges is somewhat defined by directories.
May 14, 2004Connecting the Docs - Mark Wasson32 Westlaw Profiler-related Research (cont.) Link judges, attorneys between case law and West Legal Directory (Dozier & Haschart, 2000) Case law feature extraction –Find critical sections within cases –For each attorney, attempt to extract first name, middle name, last name, name suffix, firm name, city, state –For each judge, attempt to extract first name, middle name, last name, name suffix, court, date –Package features into Template Records West Legal Directory feature extraction –Extract similar features from directory entries for judges and attorneys –Package features into Biography Records
May 14, 2004Connecting the Docs - Mark Wasson33 Westlaw Profiler-related Research (cont.) Match Template Records to Biography Records –Attempt to match normalized features between pairs of records to create a match probability score –For given attorney or judge Template Record, the match to Biography Record with highest match probability score is likely correct match Additional heuristics –The dates must be compatible –Highest match probability score must exceed threshold –No match is made if a tie score occurs
May 14, 2004Connecting the Docs - Mark Wasson34 Westlaw Profiler-related Research (cont.) Attorney match accuracy –99% precision, 92% recall Judge match accuracy –98% precision, 90% recall Common causes of errors –Marriage-based name changes –Spelling errors in the data –Gaps in the directory, such as past positions See Dozier et al. (2003) for similar work with expert witness-related information
May 14, 2004Connecting the Docs - Mark Wasson35 Analysis, Visualization and Discovery
May 14, 2004Connecting the Docs - Mark Wasson36 From Integration to Exploration and Discovery Analytical, visualization and discovery tool uses –Summarize key information in a document set –Find and explain interesting facts, relationships and patterns in a document set –Discover previously unknown information Key components –Extract entities, co-occurrence patterns, subject-verb- object relationship –Coreference resolution, name variant linkage –Statistical analysis –Link analysis –Report generation tools –Data visualization tools
May 14, 2004Connecting the Docs - Mark Wasson37 Insightfuls InFact Concept Graph Example from Insightful website
May 14, 2004Connecting the Docs - Mark Wasson38 ClearForests ClearResearch Relations Map Example from ClearForest website
May 14, 2004Connecting the Docs - Mark Wasson39 Closing Remarks
May 14, 2004Connecting the Docs - Mark Wasson40 Closing Thoughts We have solved the information overload problem! Content has exploded –Web: 0 pages > 1 billion pages > 6 billion pages? –Subscription services: Elsevier, Factiva, LexisNexis, Westlaw, lots of others –Deep web: 500 times bigger than surface web Even if we solve retrieval, classification, indexing –Amount of highly relevant material often overwhelming
May 14, 2004Connecting the Docs - Mark Wasson41 Closing Thoughts Information integration is coming (some is here!) –Information retrieval –Document categorization and indexing –Document clustering –Entity identification –Information extraction –Relationship extraction –Information aggregation –Record linkage –Multidocument summarization –Analytical tools –Data visualization –Knowledge discovery
May 14, 2004Connecting the Docs - Mark Wasson42 The End Any questions? Mark Wasson email@example.com http://www.emarkwasson.com (206) 728-7109 Product and service names are trademarks or registered trademarks of their holders.
May 14, 2004Connecting the Docs - Mark Wasson43 References and Related Materials
May 14, 2004Connecting the Docs - Mark Wasson44 References and Related Materials ClearForest –ClearForest, http://www.clearforest.com –ClearResearch, http://www.clearforest.com/Products/Analytics/ClearResear ch.asp Columbia –Columbia Natural Language Processing Group, http://www.cs.columbia.edu/nlp/ –Columbia Newsblaster, http://newsblaster.cs.columbia.edu/ –Schiffman et al. (2002). Experiments in Multidocument Summarization. 2002 Human Language Technology Conference. –McKeown et al. (2003). Columbia's Newsblaster: New Features and Future Directions. 2003 Human Language Technology-North American Association for Computational Linguistics Conference.
May 14, 2004Connecting the Docs - Mark Wasson45 References and Related Materials Google –Google, http://www.google.com –Google News, http://news.google.com Insightful –Insightful, http://www.insightful.com –Insightful InFact, http://www.insightful.com/products/infact/ Inxight –Inxight, http://www.inxight.com –Inxight classification, http://www.inxight.com/products/smartdiscovery/ –Hersey (2003). Factiva Reaps Benefits from Automatic Text Classification – An End User Case Study. 3 rd Workshop on Operational Text Classification Systems.
May 14, 2004Connecting the Docs - Mark Wasson46 References and Related Materials LexisNexis –LexisNexis, http://www.lexisnexis.com –LexisNexis Company Dossier, http://www.lexisnexis.com/companydossier/ –Wasson (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Language Technology Joint Conference: ANLP-NAACL 2000.
May 14, 2004Connecting the Docs - Mark Wasson47 References and Related Materials Thomson-West –Thomson-West, http://west.thomson.com –Westlaw Profiler, http://west.thomson.com/store/product.asp?product%5F id=Westlaw+Profiler&catalog%5Fname=wgstore –Dozier & Haschart (2000). Automatic Extraction and Linking of Person Names in Legal Text. RIAO-2000. –Dozier et al. (2003). Creation of an Expert Witness Database Through Text Mining. 9 th International Conference on Artificial Intelligence and Law. –Dabney et al. (2003). West km 2.0 – Classifying Document Collections with CaRE. Thomson-West white paper.