Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Economics Understandings To play the game, go to the next slide and click on an point value to go to a question. To go to final Wrap-Up click on Final.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
A Machine Learning Approach for Improved BM25 Retrieval
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.
 How to Manage Your Cash › Daily Cash Needs  Lunch, movies, gas, or paying for other activities  Carry cash  Go to an ATM  Credit Card  Know pros.
Common Cents Investment Group Mutual Funds & Exchange Traded Funds Tuesday October 8 th.
© 2004 South-Western Publishing 1 Chapter 12 Futures Contracts and Portfolio Management.
Evaluating Search Engine
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Types of Investments  Savings Savings  Treasury Bills Treasury Bills  Bonds Government Municipal Corporate  Education  Mutual Funds  Real Estate.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Essential Standard 4.00 Understand the role of finance in business.
Introduction to Stock Market. Common Vocabulary Common Vocabulary Stock Exchange – Place where publicly held companies are bought and sold Nasdaq – an.
Stock Market Game Current Events.
Introduction to ETFs Fall What is an ETF? ETFs are “index funds or trusts that are listed on an exchange but trade like a single stock. They hold.
ETFs and ETPs Colgate Finance Club. What is an ETF/ETP  An ETF/ETP is an exchange-traded fund or exchange- traded product that is traded on stock exchanges.
1 Personal Finance: Another Perspective Investments 11 - Final Questions & Answers.
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
Speech Recognition Final Project Resources
CMU-Statistical Language Modeling & SRILM Toolkits
Types of Investments Stocks Bonds Mutual Funds Real Estate Savings/Certificates of Deposit Collectibles.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
An Introduction to Money and the Financial System
Slide 1–1. Part I Introduction Chapter One Why Study Financial Markets and Institutions?
Copyright © 2005 by South-Western, a division of Thomson Learning, Inc. All rights reserved. Exam Next Week ●Study now ●Do WebStudy quiz for class after.
1 Investing  Making money with money  Investing = Saving  It involves risk—you can lose your $$
Practical 4: Dollar Cost Averaging, Compounding and Statistical Excel Functions Gopalan Vivek
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Implementing Genetic Algorithms in Finance Applications Nihaar Sinha.
Financial Competency Lifelong Learning Centre Wednesday, November 20, :00 to 9:00 p.m. Gallery Room 106 Dr. Cyril Kesten Education 334, Faculty of.
Implementing Genetic Algorithms in Finance Applications Nihaar Sinha.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
September 17 th Common Cents Investment Group September, 2012 Agenda  Membership  Investopedia  Stock Ownership  Dollar Cost Averaging  Funds.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
ETF’s An ETF is a security that tracks an index, a commodity or a basket of assets like an index fund, but trades like a stock on an exchange. ETFs experience.
Types of Investments. Stocks / Mutual Funds / Index Funds Stocks Represent ownership of a company You buy them when… you think a company will increase.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Long Term Investing 401K’s, IRA’s, Mutual Funds. Financial Literacy Bank Accounts Credit Cards Brokerage Accounts Stocks Bonds Student Loans Real Estate.
Ross Finance Club Weekly Breakout Sales, Trading, and Research December 5, 2004 Zachary Emig MBA Class of 2005.
Assignment #2 Following the Stock Market. What to Do Now! Pick a company (stock) Pick a mutual fund and locate prospectus Find a daily source of stock.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
INVESTING BASICS. A. THE STOCK MARKET STOCKS- UNIT OF OWNERSHIP IN A CORPORATION. STOCKS EXPLAINED.
Estimating N-gram Probabilities Language Modeling.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Stock Market Valuation Valuing Individual Companies.
Brokerage Firms 401K’s, IRA’s, Mutual Funds. Banks vs. Brokerage Firms Brokerage Firm Specialize in accounts for stocks, bonds, mutual funds Banks Offer.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
CHAPTER 11 The Stock Market. Section 3: The Stock Market  Objectives:  Evaluate the benefits and risks of buying stock by comparing them to those of.
SAVING AND INVESTMENT CHOICES  Savings plans  Savings account  Certificate of deposit  Money market account  Securities  Stock investments  Bond.
Personal Finance Review.
Personal Finance Balance Sheet
Investment Stocks.
Exam Next Week Study now Do WebStudy quiz for class after exam.
Introduction to the Stock Market
Investing Ways to Invest.
Special Saturday Session: Midterm Review Session
Saving and Investing.
Account Types, Investment Strategy & Fee Minimization
Bucket investing strategy
Presentation transcript:

Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska

2007/12/13 Chih-Ti Shih 2 Project Outline  Project Objective  Building Specialized Corpus using Bootcat toolkit.  Building language model using CMU language toolkit.  Compute perplexity of corpuses.  Project result.

2007/12/13 Chih-Ti Shih 3 Project Objective – Introduction to Perplexity  To measure a performance of a language model, the best way is to use end-to-end evaluation.  End-to-end evaluation is expensive and time consuming.  Perplexity is the most common evaluation metric and provide a fast, efficient way to evaluate the performance of a language model.

2007/12/13 Chih-Ti Shih 4 Perplexity - 1 probability of the test set Normalized by number of words.

2007/12/13 Chih-Ti Shih 5 Perplexity - 2 Bi-Gram example Probability of W i-1 follow by W i Normalized by total number of words

2007/12/13 Chih-Ti Shih 6 Project Objective  Inverse application: Use perplexity to identify the content of the article or paper.  The lower the perplexity the closer the content between the training corpus and the test corpus.  The corpus from the same filed will show relatively low perplexity compare to other corpuses.

2007/12/13 Chih-Ti Shih 7 Project Objective  Specialized corpus from different field need to be build.  In this project, 3 specialized corpora are built. They are Business, History and Computer Eng. Corpuses.  In order to test it, 12 (4 from each 3 fields) articles are chosen as test corpus.

2007/12/13 Chih-Ti Shih 8 Building Specialized Corpus using Bootcat toolkit. Select seed Generate n-Tuples Retrieve urls Fetch corresponding pages and build corpus Check corpus content and remove unwanted information Steps:

2007/12/13 Chih-Ti Shih 9 Building Specialized Corpus using Bootcat toolkit: select seed  The seeds or keywords of each corpus are the main factor which directly affects the specialty of the corpus. The more specific of the seeds the more specialized the corpus can be. Business Finance Credit Loan Stock Dow Nasdaq Currency Mutual Funds ETFs Bonds Investing Taxes Rea Estate Property Wall Street S&P500 DJIA Gas price DAX Trade Great Depression Credit Card Investment Market Seeds of Business corpus:

2007/12/13 Chih-Ti Shih 10 Building Specialized Corpus using Bootcat toolkit: tuples  Tuples are generated randomly from seeds  No word repeating is allow in the same tuple. Dow Business "Great Depression" Finance Stock Business Property S&P500 Property Dow Nasdaq Taxes Market DJIA "Gas price" Bonds ETFs Bonds "Gas price" Taxes "Gas price" Credit Bonds ETFs Dow ETFs "Gas price" "Wall Street" Loan Trade Property "Wall Street" Finance Credit DJIA ETFs "Rea Estate" Stock Property ETFs Stock DJIA Bonds Business Investing Nasdaq "Credit Card" Loan Finance "Wall Street" Investing "Rea Estate" Credit Market Investing "Credit Card" Property "Rea Estate" Credit Loan Business tuples:

2007/12/13 Chih-Ti Shih 11 Building Specialized Corpus using Bootcat toolkit: collect information from Yahoo!  Send tuples to Yahoo! and collect urls of the search result pages.  Remove repeated urls  Retrieve articles from each urls.  Manually remove unwanted information.

2007/12/13 Chih-Ti Shih 12 Building Specialized Corpus using Bootcat toolkit:  CBusiness_Corpus_50k.txt: 50k words business corpus.  CBusiness_Corpus_100k.txt: 100k words business corpus.  CBusiness_Corpus_200k.txt: 200k words business corpus.  CHistory_Corpus_50k.txt: 50k words history corpus.  CHistory_Corpus_100k.txt: 100k words history corpus.  CHistory_Corpus_200k.txt: 200k words history corpus.  CComputereng_Corpus_50k.txt: 50k words Computer Engineering corpus.  CComputereng_Corpus_100k.txt: 100k words Computer Engineering corpus.  CComputereng_Corpus_200k.txt: 200k words Computer Engineering corpus.

2007/12/13 Chih-Ti Shih 13 Building language model using CMU LM toolkit

2007/12/13 Chih-Ti Shih 14 Building language model using CMU LM toolkit  Build a list of every word which occurred in the training corpus, along with its number of occurrences.  Build a vocabulary file which content the most frequent words.

2007/12/13 Chih-Ti Shih 15 Building language model using CMU LM toolkit  Generate N-gram.  In this project, 5-gram is used.  Build language model. Business CorpusTest Message B1B1%B2B2%B3B3%B4B4% perplexity No. of hit 5-grames NO. of hit 4-grames NO. of hit 3-grames NO. of hit 2-grames NO. of hit 1-grames

2007/12/13 Chih-Ti Shih 16 Building language model using CMU LM toolkit  Calculate the perplexity of test articles to the training corpus.  The batter model will assign a higher probability to the test data which lower the perplexity.  Average of the perplexity from 3 corpus from the same field but different size.

2007/12/13 Chih-Ti Shih 17 Project Result Test Corpus B1B2B3B4 Test article type:Business Avg. PP. of Business Corpus Avg. PP of History Corpus Avg. PP of Computer Corpus Identified type:Business Test Corpus H1H2H3H4 Test article type:History Avg. PP. of Business Corpus Avg. PP of History Corpus Avg. PP of Computer Corpus Identified type:History Business Test Corpus C1C2C3C4 Test article type:Computereng Avg. PP. of Business Corpus Avg. PP of History Corpus Avg. PP of Computer Corpus Identified type:BusinessComputereng Business

2007/12/13 Chih-Ti Shih 18 Project Result  There are total of 12 test corpus and 8 of them are been correctly identified and 3 of them are wrong. Thus, the error rate is about 33%. Please refer to the /perplexity.xls for the detail experiment result.

2007/12/13 Chih-Ti Shih 19 Possible ways to improve the result  Remove the most common words from the vocabulary. It is because, the word such as “ The ”, “ and ”, and “ it ”, are not related to the specialized field.  Adjusting the training corpus, usually, the best ratio between the training corpus and the test corpus is 1:10. We can use it as a target to dynamically change the size of the training corpus.

2007/12/13 Chih-Ti Shih 20 Reference  Bootcat toolkit: Simple Utilities for Bootstrapping Corpora and Terms from the Web. By Marco Baroni and Silvia Bernardini CMU language Toolkit:CMU language Toolkit: Carnegie Mellon University,

2007/12/13 Chih-Ti Shih 21 Questions?