P. 1 2005-3-28Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science.

Slides:

Advertisements

Similar presentations

ELibrary Science Product Demonstration Get ready to experience science in a whole new way –eLibrary Science offers targeted science text and tools.

Advertisements

CONTRIBUTIONS Ground-truth dataset Simulated search tasks environment Multiple everyday applications (MS Word, MS PowerPoint, Mozilla Browser) Implicit.

Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.

Multimedia Database Systems

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Versioning Requirements and Proposed Solutions CM Jones, JE Brace, PL Cave & DR Puplett OR nd April

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

Computer Integrated Manufacturing CIM

Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.

Educause October 29, 2001 A GEM of a Resource: The Gateway to Educational Materials Copyright Nancy Virgil Morgan, This work is the intellectual.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Striving for Quality Using continuous improvement strategies to increase program quality, implementation fidelity and durability Steve Goodman Director.

1 Framework Programme 7 Guide for Applicants

A Lightweight Approach To Support of Resource Discovery Standards The Problem Dublin Core is an international standard for resource discovery metadata.

Divide and Conquer: Challenges in Scaling Federated Search Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC SearchEngine Meeting.

Using Metadata Skills for a Course Inventory Lee Richardson Health Sciences Library University of North Carolina at Chapel Hill ALA Annual Conference June.

Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.

Institute for Collaborative Research in Education, Assessment, and Teaching Environments for STEM NGSS Resources by CREATE for STEM Institute MSU licensed.

A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.

Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

ENGINEERING LESSONS LEARNED AND SYSTEMS ENGINEERING APPLICATIONS Paul S. Gill and Danny Garcia, NASA Technical Standards Program Office William W. Vaughan,

Workshop on Software Product Archiving and Retrieving System Takeo KASUBUCHI Hiroshi IGAKI Hajimu IIDA Ken’ichi MATUMOTO Nara Institute of Science and.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.

Keyword Query Routing.

Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.

1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.

CONCLUSIONS & CONTRIBUTIONS Ground-truth dataset, simulated search tasks environment Multiple everyday applications (MS Word, MS PowerPoint, Mozilla Browser)

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

The AIACC Project Assessments of Impacts and Adaptations to Climate Change Neil Leary, AIACC Science Director AIACC Regional Workshop for Latin America.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

The FDES revision process: progress so far, state of the art, the way forward United Nations Statistics Division.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn

Capacity Building in: GEO Strategic Plan 2016 – 2025 and Work Programme 2016 Andiswa Mlisa GEO Secretariat Workshop on Capacity Building and Developing.

Search Engines and Search techniques

Writing a Research Abstract

Multimedia Information Retrieval

Search Techniques and Advanced tools for Researchers

Applying Key Phrase Extraction to aid Invalidity Search

Data Mining Chapter 6 Search Engines

Understanding How the Ranking is Calculated

Information Retrieval and Web Design

Presentation transcript:

p Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science The University of Alabama Advisor: Dr. Randy K. Smith

p Beini Ouyang Outline  Problem & Motivation  Background & Related Work  Approach & Uniques  Results and Contributions

p Beini Ouyang Problem & Motivation  Problem  Deal with hundreds of thousands of technical standards from hundreds of organizations.  Need to know  The most current and relevant information  What related knowledge is available  A mechanism is needed to assist in answering the following questions  Are there similar technical standards available?  Are there training material related to this standard?  Are there lessons learned that have been documented related to this standard?

p Beini Ouyang Problem & Motivation Figure 1. Relationship between Lessons Leaned, Training Material and Technical Standards

p Beini Ouyang Motivation  A lot of work has been done on document search  Exploiting matching strategies to address the issue of locating similar documents  Generally based on the frequency of single words  Single word: supplied keywords or generated by indexing the document of interest  Result:  Degrade the efficiency and precision of the searching pace once the document size and the number of documents grows

p Beini Ouyang Motivation  We propose an approach that emphasizes word phrase over single word indexes.  Goal: finding fewer but precisely related documents  Phrase-based search will be used to refine the results

p Beini Ouyang BACKGROUND & RELATED WORK  Background  NASA’s Technical Standards Program (NTSP) has the facility to provide access to over 1600 NASA agency-wide preferred technical standards, over 45,000 standards from other government groups, and more than 95,000 standards from over 145 national and international SDOs (Standards Development Organizations), committees and working groups.  The Lessons Learned and Best Practices (LLBP) include NASA published lessons and links to over 30 lessons-learned databases from government and non-government organizations

p Beini Ouyang BACKGROUND & RELATED WORK  The SA_MetaMatch tool was developed to aid the discovery and linking of related standards and lessons learned documents.  The SA_Metamatch tool is a component of the larger Standard Advisors Project  SA_MetaMatch was designed for finding similar documents in NASA experience databases using single word scoring across document meta-data.

p Beini Ouyang BACKGROUND & RELATED WORK  Related Work: SA_MetaMatch Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC). Then, we mainly focus on integrating Dublin Core with metadata for each document.  After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.  In addition, SA_Metamatch also adopts a word- scored mechanism for ranking the results’ documents.

p Beini Ouyang BACKGROUND & RELATED WORK Fig. 2 Generate / Edit Metadata ScreenFig 3. Class Diagram for SA_MetaGen

p Beini Ouyang BACKGROUND & RELATED WORK  Related Work: SA_MetaMatch Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC). Then, we mainly focus on integrating Dublin Core with metadata for each document.  After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.  In addition, SA_Metamatch also adopts a word- scored mechanism for ranking the results’ documents.

p Beini Ouyang BACKGROUND & RELATED WORK  SA_MetaMatch  An effective tool in locating similar documents  However, it does return a large set or unrelated documents.  The use of single word index files which are used in matching to find the related documents finds a large number of documents  slows down the search pace for large documents

p Beini Ouyang APPROACH & UNIQUENESS  Word phrase indexing can play a more significant role in matching documents than single word indexes.  This research explores a phrase-based indexing extension to SA_Metamatch.  This extension is expected to improve results for NASA NTSP.

p Beini Ouyang APPROACH & UNIQUENESS  The approach taken includes:  Generating the phrase and word index metadata.  Naturally, phrase length plays an important role in the indexing and matching process. Heuristically, this work begins with a four word phrase limit. The approach taken is:  Beginning based on the position of the word in the document.  Recursively generating phrases in terms of word position.  Limiting the phrase length  Only matching top 20 phrases for the occurrence of phrase frequency greater than 1.  Adding a phrase weight score mechanism. The phrase carries more weight than the raw index. In the end, it can give more specific results than the previous single word weight score mechanism.

p Beini Ouyang RESULTS AND CONTRIBUTIONS Fig 4: single word index frequencyFig 5: Phrase Word Index Frequency

p Beini Ouyang RESULTS & CONTRIBUTIONS  Preliminary results indicate phrase-based indexing achieves better results than single-word indexing for certain types of documents  Our results indicate that phrase-based indexing and matching is most beneficial when examining large documents  The amortized cost of generating the phrase index with the improved matching precision is justified when the target document and search documents are large.  Future work:  Examining 4-word phrase heuristic  Assessing our weighting scheme.

p Beini Ouyang REFERENCES  P. Gill, W. Vaughan, and D. Garcia, “Lessons Learned and Technical Standards: A Logical Marriage,” ASTM Standardization News, November  Cooper J.W. and Prager, John M. “Anti-Serendipity Finding Useless Documents and Similar Documents,” Proceeding of the 33rd Hawaii International Conference on System Sciences,Maui, HI, January,2000.  C. Yau and S. Hawker, “SA_MetaMatch: Document Discovery Through Document Metadata and Indexing,” Proceedings of the 42nd Annual ACM Southeast Regional Conference, Huntsville, AL, April 2-3,  DCMI. Dublin Core Metadata Element Set, Version 1.1: Reference Description, 2 June 2003

p Beini Ouyang Thanks!