Presentation is loading. Please wait.

Presentation is loading. Please wait.

P. 1 2005-3-28Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science.

Similar presentations


Presentation on theme: "P. 1 2005-3-28Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science."— Presentation transcript:

1 p. 1 2005-3-28Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science The University of Alabama bouyang@cs.ua.edu Advisor: Dr. Randy K. Smith

2 p. 2 2005-3-28Beini Ouyang Outline  Problem & Motivation  Background & Related Work  Approach & Uniques  Results and Contributions

3 p. 3 2005-3-28Beini Ouyang Problem & Motivation  Problem  Deal with hundreds of thousands of technical standards from hundreds of organizations.  Need to know  The most current and relevant information  What related knowledge is available  A mechanism is needed to assist in answering the following questions  Are there similar technical standards available?  Are there training material related to this standard?  Are there lessons learned that have been documented related to this standard?

4 p. 4 2005-3-28Beini Ouyang Problem & Motivation Figure 1. Relationship between Lessons Leaned, Training Material and Technical Standards

5 p. 5 2005-3-28Beini Ouyang Motivation  A lot of work has been done on document search  Exploiting matching strategies to address the issue of locating similar documents  Generally based on the frequency of single words  Single word: supplied keywords or generated by indexing the document of interest  Result:  Degrade the efficiency and precision of the searching pace once the document size and the number of documents grows

6 p. 6 2005-3-28Beini Ouyang Motivation  We propose an approach that emphasizes word phrase over single word indexes.  Goal: finding fewer but precisely related documents  Phrase-based search will be used to refine the results

7 p. 7 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK  Background  NASA’s Technical Standards Program (NTSP) has the facility to provide access to over 1600 NASA agency-wide preferred technical standards, over 45,000 standards from other government groups, and more than 95,000 standards from over 145 national and international SDOs (Standards Development Organizations), committees and working groups.  The Lessons Learned and Best Practices (LLBP) include NASA published lessons and links to over 30 lessons-learned databases from government and non-government organizations

8 p. 8 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK  The SA_MetaMatch tool was developed to aid the discovery and linking of related standards and lessons learned documents.  The SA_Metamatch tool is a component of the larger Standard Advisors Project  SA_MetaMatch was designed for finding similar documents in NASA experience databases using single word scoring across document meta-data.

9 p. 9 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK  Related Work: SA_MetaMatch Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC). Then, we mainly focus on integrating Dublin Core with metadata for each document.  After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.  In addition, SA_Metamatch also adopts a word- scored mechanism for ranking the results’ documents.

10 p. 10 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK Fig. 2 Generate / Edit Metadata ScreenFig 3. Class Diagram for SA_MetaGen

11 p. 11 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK  Related Work: SA_MetaMatch Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC). Then, we mainly focus on integrating Dublin Core with metadata for each document.  After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.  In addition, SA_Metamatch also adopts a word- scored mechanism for ranking the results’ documents.

12 p. 12 2005-3-28Beini Ouyang BACKGROUND & RELATED WORK  SA_MetaMatch  An effective tool in locating similar documents  However, it does return a large set or unrelated documents.  The use of single word index files which are used in matching to find the related documents finds a large number of documents  slows down the search pace for large documents

13 p. 13 2005-3-28Beini Ouyang APPROACH & UNIQUENESS  Word phrase indexing can play a more significant role in matching documents than single word indexes.  This research explores a phrase-based indexing extension to SA_Metamatch.  This extension is expected to improve results for NASA NTSP.

14 p. 14 2005-3-28Beini Ouyang APPROACH & UNIQUENESS  The approach taken includes:  Generating the phrase and word index metadata.  Naturally, phrase length plays an important role in the indexing and matching process. Heuristically, this work begins with a four word phrase limit. The approach taken is:  Beginning based on the position of the word in the document.  Recursively generating phrases in terms of word position.  Limiting the phrase length  Only matching top 20 phrases for the occurrence of phrase frequency greater than 1.  Adding a phrase weight score mechanism. The phrase carries more weight than the raw index. In the end, it can give more specific results than the previous single word weight score mechanism.

15 p. 15 2005-3-28Beini Ouyang RESULTS AND CONTRIBUTIONS Fig 4: single word index frequencyFig 5: Phrase Word Index Frequency

16 p. 16 2005-3-28Beini Ouyang RESULTS & CONTRIBUTIONS  Preliminary results indicate phrase-based indexing achieves better results than single-word indexing for certain types of documents  Our results indicate that phrase-based indexing and matching is most beneficial when examining large documents  The amortized cost of generating the phrase index with the improved matching precision is justified when the target document and search documents are large.  Future work:  Examining 4-word phrase heuristic  Assessing our weighting scheme.

17 p. 17 2005-3-28Beini Ouyang REFERENCES  P. Gill, W. Vaughan, and D. Garcia, “Lessons Learned and Technical Standards: A Logical Marriage,” ASTM Standardization News, November 2001. http://www.astm.orghttp://www.astm.org  Cooper J.W. and Prager, John M. “Anti-Serendipity Finding Useless Documents and Similar Documents,” Proceeding of the 33rd Hawaii International Conference on System Sciences,Maui, HI, January,2000.  C. Yau and S. Hawker, “SA_MetaMatch: Document Discovery Through Document Metadata and Indexing,” Proceedings of the 42nd Annual ACM Southeast Regional Conference, Huntsville, AL, April 2-3, 2004.  DCMI. Dublin Core Metadata Element Set, Version 1.1: Reference Description, 2 June 2003

18 p. 18 2005-3-28Beini Ouyang Thanks!


Download ppt "P. 1 2005-3-28Beini Ouyang Phrase Matching: Assessing Document Similarity for NASA Scientists and Engineers Beini Ouyang Department of Computer Science."

Similar presentations


Ads by Google