Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska

2007/12/13 Chih-Ti Shih 2 Project Outline  Project Objective  Building Specialized Corpus using Bootcat toolkit.  Building language model using CMU language toolkit.  Compute perplexity of corpuses.  Project result.

2007/12/13 Chih-Ti Shih 3 Project Objective – Introduction to Perplexity  To measure a performance of a language model, the best way is to use end-to-end evaluation.  End-to-end evaluation is expensive and time consuming.  Perplexity is the most common evaluation metric and provide a fast, efficient way to evaluate the performance of a language model.

2007/12/13 Chih-Ti Shih 4 Perplexity - 1 probability of the test set Normalized by number of words.

2007/12/13 Chih-Ti Shih 5 Perplexity - 2 Bi-Gram example Probability of W i-1 follow by W i Normalized by total number of words

2007/12/13 Chih-Ti Shih 6 Project Objective  Inverse application: Use perplexity to identify the content of the article or paper.  The lower the perplexity the closer the content between the training corpus and the test corpus.  The corpus from the same filed will show relatively low perplexity compare to other corpuses.

2007/12/13 Chih-Ti Shih 7 Project Objective  Specialized corpus from different field need to be build.  In this project, 3 specialized corpora are built. They are Business, History and Computer Eng. Corpuses.  In order to test it, 12 (4 from each 3 fields) articles are chosen as test corpus.

2007/12/13 Chih-Ti Shih 8 Building Specialized Corpus using Bootcat toolkit. Select seed Generate n-Tuples Retrieve urls Fetch corresponding pages and build corpus Check corpus content and remove unwanted information Steps:

2007/12/13 Chih-Ti Shih 9 Building Specialized Corpus using Bootcat toolkit: select seed  The seeds or keywords of each corpus are the main factor which directly affects the specialty of the corpus. The more specific of the seeds the more specialized the corpus can be. Business Finance Credit Loan Stock Dow Nasdaq Currency Mutual Funds ETFs Bonds Investing Taxes Rea Estate Property Wall Street S&P500 DJIA Gas price DAX Trade Great Depression Credit Card Investment Market Seeds of Business corpus:

2007/12/13 Chih-Ti Shih 10 Building Specialized Corpus using Bootcat toolkit: tuples  Tuples are generated randomly from seeds  No word repeating is allow in the same tuple. Dow Business "Great Depression" Finance Stock Business Property S&P500 Property Dow Nasdaq Taxes Market DJIA "Gas price" Bonds ETFs Bonds "Gas price" Taxes "Gas price" Credit Bonds ETFs Dow ETFs "Gas price" "Wall Street" Loan Trade Property "Wall Street" Finance Credit DJIA ETFs "Rea Estate" Stock Property ETFs Stock DJIA Bonds Business Investing Nasdaq "Credit Card" Loan Finance "Wall Street" Investing "Rea Estate" Credit Market Investing "Credit Card" Property "Rea Estate" Credit Loan Business tuples:

2007/12/13 Chih-Ti Shih 11 Building Specialized Corpus using Bootcat toolkit: collect information from Yahoo!  Send tuples to Yahoo! and collect urls of the search result pages.  Remove repeated urls  Retrieve articles from each urls.  Manually remove unwanted information.

2007/12/13 Chih-Ti Shih 12 Building Specialized Corpus using Bootcat toolkit:  CBusiness_Corpus_50k.txt: 50k words business corpus.  CBusiness_Corpus_100k.txt: 100k words business corpus.  CBusiness_Corpus_200k.txt: 200k words business corpus.  CHistory_Corpus_50k.txt: 50k words history corpus.  CHistory_Corpus_100k.txt: 100k words history corpus.  CHistory_Corpus_200k.txt: 200k words history corpus.  CComputereng_Corpus_50k.txt: 50k words Computer Engineering corpus.  CComputereng_Corpus_100k.txt: 100k words Computer Engineering corpus.  CComputereng_Corpus_200k.txt: 200k words Computer Engineering corpus.

2007/12/13 Chih-Ti Shih 13 Building language model using CMU LM toolkit

2007/12/13 Chih-Ti Shih 14 Building language model using CMU LM toolkit  Build a list of every word which occurred in the training corpus, along with its number of occurrences.  Build a vocabulary file which content the most frequent 20000 words.

2007/12/13 Chih-Ti Shih 15 Building language model using CMU LM toolkit  Generate N-gram.  In this project, 5-gram is used.  Build language model. Business CorpusTest Message B1B1%B2B2%B3B3%B4B4% perplexity704.9 846.77 526.39 589.85 No. of hit 5-grames40.27152.7790.9720.28 NO. of hit 4-grames372.48112.03202.16172.4 NO. of hit 3-grames9312.966712.3614315.449613.56 NO. of hit 2-grames64843.5220938.5640243.4130943.6 NO. of hit 1-grames60740.7724044.2835238.0128440.11

2007/12/13 Chih-Ti Shih 16 Building language model using CMU LM toolkit  Calculate the perplexity of test articles to the training corpus.  The batter model will assign a higher probability to the test data which lower the perplexity.  Average of the perplexity from 3 corpus from the same field but different size.

2007/12/13 Chih-Ti Shih 17 Project Result Test Corpus B1B2B3B4 Test article type:Business Avg. PP. of Business Corpus 698.61104.25698.61104.25 Avg. PP of History Corpus 959.91977.98959.91977.98 Avg. PP of Computer Corpus 803.62393.19803.62393.19 Identified type:Business Test Corpus H1H2H3H4 Test article type:History Avg. PP. of Business Corpus 579.403843.503668.057576.917 Avg. PP of History Corpus 520.373663.913714.975827.95 Avg. PP of Computer Corpus 628.535865.558734.805686.505 Identified type:History Business Test Corpus C1C2C3C4 Test article type:Computereng Avg. PP. of Business Corpus 1580.403663.3031017.26777.63 Avg. PP of History Corpus 1868.145950.6681070.141035.12 Avg. PP of Computer Corpus 1598.153473.123859.79779.373 Identified type:BusinessComputereng Business

2007/12/13 Chih-Ti Shih 18 Project Result  There are total of 12 test corpus and 8 of them are been correctly identified and 3 of them are wrong. Thus, the error rate is about 33%. Please refer to the /perplexity.xls for the detail experiment result.

2007/12/13 Chih-Ti Shih 19 Possible ways to improve the result  Remove the most common words from the vocabulary. It is because, the word such as “ The ”, “ and ”, and “ it ”, are not related to the specialized field.  Adjusting the training corpus, usually, the best ratio between the training corpus and the test corpus is 1:10. We can use it as a target to dynamically change the size of the training corpus.

2007/12/13 Chih-Ti Shih 20 Reference  Bootcat toolkit: Simple Utilities for Bootstrapping Corpora and Terms from the Web. By Marco Baroni and Silvia Bernardini http://sslmit.unibo.it/~baroni/bootcat.html CMU language Toolkit:CMU language Toolkit: Carnegie Mellon University, http://www.speech.cs.cmu.edu/

2007/12/13 Chih-Ti Shih 21 Questions?

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.

Similar presentations

Presentation on theme: "Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.

Similar presentations

Presentation on theme: "Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska."— Presentation transcript:

Similar presentations

About project

Feedback