Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Similar presentations


Presentation on theme: "Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic."— Presentation transcript:

1 Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia

2 Who Needs a Language Specific Corpus? Language Technology Applications Language Modeling Speech Recognition Machine Translation Linguistic and Socio-Linguistic Studies Multilingual Retrieval

3 What Corpora are Available? Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998] Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite - 12 languages Google - 25 languages AltaVista - 25 languages Lycos - 25 languages

4 BUT what about Slovenian? Or Tagalog? Or Tatar? You’re just out of luck!

5 The Human Solution Start from Yahoo->Slovenia… Crawl www.*.si Search on the web, look at documents, modify query, analyze documents, modify query,… Repetitive, time-consuming, requires reasonable familiarity with the language

6 Task Given: 1 Document in Target Language 1 Other Document (negative example) Access to a Web Search Engine Create a Corpus of the Target Language quickly with no human effort

7 Algorithm Query GeneratorWWW Seed Docs Language Filter

8 Web Word Statistics Initial Docs Build Query Filter Relevant Non-Relevant Learning

9 Query Generation Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

10 Query Term Selection Methods Uniform (UN) – select k words randomly from the current vocabulary Term-Frequency (TF) – select top k words ranked according to their frequency Probabilistic TF (PTF) – k words with probability proportional to their frequency

11 Query Term Selection Methods RTFIDF – top k words according to their rtfidf scores Odds-Ratio (OR) – top k words according to their odds-ratio scores Probabilistic OR (POR) – select k words with probability proportional to their Odds- Ratio scores

12 3. Language Filter TextCat (van Noord’s implementation) trained on a handful of documents Manually evaluated through sampling 100 Slovenian documents and found to be 99% accurate Contains models for 60 languages

13 Evaluation Goal: Collect as many relevant documents as possible while minimizing the cost Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

14 Experimental Setup Language: Slovenian Initial documents: 1 web page in Slovenian, 1 in English Search engine: Altavista

15 Results

16 Results – Precision at 3000 Percentage of Target Docs after 3000 Docs Retrieved

17 Results – Docs Per Query

18 Results - Summary In terms of documents: For lengths 1-3, Odds-Ratio works best In terms of queries: Odds-Ratio is consistently better than others Long queries are usually very precise but do not result in a lot of documents (low recall)

19 Results - Num of Docs Retrieved

20

21 Results – Num of Queries

22

23 Further Experiments Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature Keywords Similar results when initializing with keywords instead of documents Other Languages Similar results with Croatian, Czech and Tagalog

24 Comparison with Altavista’s “More Like This” Feature Our Query- Generation mechanism with 5 inclusion and exclusion terms each using Odds-Ratio scoring outperforms Altavista’s “Find Similar Pages” Feature

25 Effect of Initial Documents No Visible differences until over a 1000 documents have been retrieved TypeLengthVocab Size 1Formal568354 2News716446 3Informal9074

26 Initializing with Keywords Obtaining entire documents in a language may not always be possible Used six different sets of 10 keywords obtained from Slovenian speakers as seed documents with Odds-Ratio 3 Each keyword set resulted in performance comparable to using entire documents

27 Other Languages Tried the same approach with Croatian Tagalog Czech Odds-Ratio outperformed other term- selection methods for all these languages

28 Other Languages MethodLanguageTarget Docs at 1000 total Docs TF-3Slovenian178 PTF-3Slovenian646 OR-3Slovenian835 TF-3Croatian39 PTF-3Croatian410 OR-3Croatian677 TF-3Czech385 PTF-3Czech451 OR-3Czech743 TF-3Tagalog440 PTF-3Tagalog359 OR-3Tagalog664

29 Other Languages – Sample Queries Typical queries that most commonly found positive documents in our experiments

30 Conclusions Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines Not sensitive to initial “seed” documents System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder

31 Conclusions Automatic Query Generation is a cheap way to collect Language Corpora Odds-Ratio term selection works well Mostly independent of “seed” documents Can be “seeded” with a handful of keywords in a language

32 Ideas for Future Work Explore other Term-Selection methods From Language specific corpus to Topic Specific corpus as an alternative to focused spidering Finding documents matching a user profile – Personal Agent

33 Fixed Query Parameters Fix Query Lengths and Vary Term-Selection Methods Fix Term-Selection Methods and Vary Query Lengths Results (Ghani et al., SIGIR 2001): Odds-Ratio works well overall Long Queries are precise but with low recall

34 Algorithm 1. Initialization 2. Generate query terms from relevant and non- relevant documents 3. Retrieve document using the Query from 2. 4. Use the language filter to add the new document to the relevant or non-relevant set of documents. 5. Update frequencies and scores 6. Return to Step 2.

35 1. Initialize Given documents in the target and non-target languages Calculate various statistics over the words in each set


Download ppt "Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic."

Similar presentations


Ads by Google