Presentation is loading. Please wait.

Presentation is loading. Please wait.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Similar presentations


Presentation on theme: "GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)"— Presentation transcript:

1 GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

2 Roadmap Problem definition Motivation Solution Framework Demo Conclusion

3 Problem definition The purpose of an automatic glossary compiler is to aid in the construction of a list of definitions across a large collection of documents. Definition is a concise description of what an entity is. Challenges:  Multiple ways to phrase a definition  Single term has multiple definitions  Need clustering

4 Motivation Benefit for everyone:  Construct a glossary without marking index words by hand;  Briefly look up the definition of a term in a book, journal articles, a set of books or collection of papers on a particular topic. No current similar tool exists.

5 Solution framework Query processing  Yahoo API; Definition extraction  Minipar; Clustering algorithm  K-means; Technology  IE Toolbar.

6 Page processing Goals  Fetch pages for a given query Use multi-threading to accelerate  Convert multiple formats into text format e.g., PDF files  Filter Remove HTML tags, incomplete tokens… Detect sentence boundaries. Remove garbage

7 Process Query Yahoo API query string result set query pages.TXT Fetch URL pdf ? html ? Remove TagConvert to TXT Sentence Segmentation Garbage Cleaning Page processing (cont.)

8 Definition extraction Dependency parser (MINIPAR):  Based on the theory of dependency grammars;  Broad coverage parser;  Output is a parse tree representing head-modifier relations. Generic definition patterns:  Use generic semantic patterns to overcome the syntactic variability (expressing the same meaning with the same set of words by employing different syntactic structures of a sentence);  Extensible, easily coded in XML, requires minimum knowledge of linguistics.

9 Definition extraction “Data Mining, also known as knowledge discovery in data bases, is the process of automatically searching large volumes of data for patterns.”

10 Definition extraction Simple and complex definitions;  Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts;  Data Mining can be defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data“. Simple and complex terms being defined;  Data Mining;  Core of comparative genome analysis. Extensible; High accuracy (limited by the parser).

11 Clustering Algorithm:  K-means; Similarity measure:  Vector space model; Challenges:  Define k;  Define similarity measure.

12 Demo

13 Thank you! Questions?


Download ppt "GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)"

Similar presentations


Ads by Google