Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Prioritizing User-session-based Test Cases for Web Applications Testing Sreedevi Sampath, Renne C. Bryce, Gokulanand Viswanath, Vani Kandimalla, A.Gunes.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Tour the World: building a web-scale landmark recognition engine ICCV 2009 Yan-Tao Zheng1, Ming Zhao2, Yang Song2, Hartwig Adam2 Ulrich Buddemeier2, Alessandro.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
A Language Independent Method for Question Classification COLING 2004.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Data mining in web applications
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Panos Ipeirotis Luis Gravano
Presentation transcript:

Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University CIKM 2000

Sample Docs

What is a LM?  Probability distribution over terms in the language  Most common LMs are n-gram models  Unigram Model: For a doc d that is composed of words w 1,w 2,…,w n :

How to create a LM?  Acquire a (representative) Corpus in the language  Apply standard language modeling methods (Chen & Goodman 1996) the … antelope The 390 Are 350 Is 330 … Antelope 1

What corpora are available?  Explicit, marked up corpora: Linguistic Data Consortium languages [Liebermann and Cieri 1998]  Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite languages Google languages AltaVista languages Lycos languages

Motivating Question  Can we quickly get a LM of contemporary usage for some minority language e.g. Tagalog? Slovene? Persian? (Tagalog is spoken in the Phillipines by 39 million people; Slovene is spoken in Slovenia by 2 million people)

Our Solution  Acquire a Corpus of the language  Apply standard language modeling methods  Acquire Construct a Corpus of the language efficiently with minimum human effort

How to construct our corpus?  Expensive Method Spider all of the web (~1.5 Billion Pages) and use a language filter  Less Expensive Method Selective Spidering

Task  Given: 1 Document in Target Language 1 Document in Other Language Access to a Web Search Engine  Create a Language Model of the Target Language quickly with no human effort

Experimental Data  498 documents in Tagalog 133 geocities.com 29 maxpages.com 29 bhepatitis.com 45 under.ph: 12.edu.ph, 16.com.ph, 14.net.ph, 4.gov.ph  16,000 distractors in English and Brazilian Portuguese  Common vocabulary “at”: Tagalog for “and” “the”, “homepage”: Internet English

Algorithm LM Builder Query Generator WWWSeed Docs Language Filter

Algorithm 1.Select one seed document from each of M and O. 2.Build language models for M' (LM M’ ) and O’ (LM O' ) from the seed documents. 3.Sample a document from the database. 4.Use the language filter to decide whether to add the new document to the list of documents in M, or those in O. 5.Update the language models for LM M’ and LM O'. 6.If the stopping criterion has not been reached, go back to Step 3

Methodology -- Adaptive Query-based Sampling  Query-based sampling [Callan et al. 99]  Adaptive Sampling -- used in oceanography [Curtin et al. 1993], environmental monitoring [Thompson and George 1995] Target class is rare Sampling is expensive Target examples clustered together

Language Identification  Simple filter – uses the current LMs  Does not require any more prior knowledge than the seed documents  Improves with time

Query Methods Query MethodSample Query Random Most-Frequent+sa Unigram+kanyang Most-Frequent-exclude-Most- Frequent +sa –the Unigram-exclude-Most-Frequent+kanyang –the Unigram-exclude-Unigram+kanyang -more

Evaluation  Average of three runs  True Unigram Model constructed from all 498 Tagalog documents  Take sampled collection and compare to True Unigram Model  Metrics Kullback Leibler divergence (relative entropy, distance between probability distributions) Percent vocabulary coverage Cumulative term frequency

Conclusions  Can automatically build a corpus and thus a LM for a minority language  Short queries successful  Mostly, single document adequate starting point

Future Work  Stopping Criteria  Synthetic Data  More sophisticated filtering  Other languages  Bigger Data sets (Web)  Incorporate hyper-link information  Use a Reinforcement Learning framework and learn the query parameters  Release code

 URL for this presentation  URL for the project