1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India.

1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India TechVista Symposium, Jan 12, 2006 and MSR India TechVista Symposium, Jan 12, 2006 The Million Book Digital Library Project: Research Issues in Data Mining and Text Mining

2 Digital Libraries and Universal Access to Information  Create a Universal Digital Library containing all the books ever published  Unfortunately many of the books are in English  Not readable by over 80% of the population

3 Information Overload  If we read a book every day  we can only read, at most, 40,000 books in a life time  Having millions of books online and accessible creates an information overload  “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon  Multilingual search technology can help to reduce the overload  permits users to search very large data bases quickly and reliably  independent of language and location

4 Understanding Language  Books in non-native languages remain incomprehensible to most people  Translation and Summarization essential for world wide use  Current translation systems are not yet perfect  Significant improvements in language understanding systems in the past few decades  Systems based on statistical and linguistic techniques have shown significant performance improvements  improve performance using machine learning  Digitization projects will act as test bed  for validating Language Understanding Systems Research  e.g. The Million Book Digital Library Project

5 The Million Book Digital Library  Collaborative venture among many countries including USA, China and India  So far 400,000 books have been scanned in China and 200,000 in India  Content is made freely available around the globe  Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip http://www.rr.cs.cmu.edu/MSRI.ziphttp://www.rr.cs.cmu.edu/MSRI.zip

The Grand Challenge Create Access to  All published works online  Instantly available  In any language  Anywhere in the world  Searchable, browsable, navigable  By humans and machines

The Challenge:

One Step at a Time…  Million Book DL  Only about 1% of all the world’s books  Harvard University12M  Library of Congress30M  OCLC catalog 42M  All Multilingual Books~100M  At the rate of digitization of the last decade it would take a 100 years!

Million Book Project: Issues  Time  At one page per second (20,000 pages per day shift), it will take 100 years (200 working days per year) to scan a million books of 400 pages each  Cost  100M books at US$100 per book would coat $10B  Even in India and China the cost will be $1B  The annual cost is currently expected to be close $10M per year with support from US, India and China.  Selection  Selection of appropriate books for scanning is time consuming and expensive

Million Book Project: Issues (cont)  Logistics  Each containers hold 10,000 to 20,000 books. Shipping and handling costs about $10,000  Meta Data  Accessing and/or creating Meta data requires professionals trained in Library science  Optical Character Recognition Technology  Essential for searching, translation and summarization  Many languages don’t have OCR

Million Book Project: Status  21 Centers in India  17 centers in China  1 Center in Egypt  Planned : Australia and Europe  About 600,000 books scanned  About 120,000+ accessible on the web from India  http://dli.iiit.ac.in/ http://dli.iiit.ac.in/  Uses 8TB of storage  10 TB server at CMU Library planned for July 2005  1,000,000 books by the end of 2007  Capacity to scan a million pages a day expected to be operational by the end of 2006

Million Book Project: Policy Challenges  Compensating for Creative Works  5% out of copyright  92% out-of-print and in-copyright  3% in-print and in-copyright  Options  Tax Credit  Usage based Government funded compensation  Analogous to Public Lending Right in UK and Australia  Usage charges to the user  Compulsory Licensing  Digital Submissions to National Archives of all books that are “born-digital”

Million Book Project: Research Challenges  Providing Access to Billions everyday  Distributed Cached Servers in every country and region  Easy to use interfaces for Billions  Text Mining Challenges  Multilingual Information Retrieval  Summarization  Text Categorization  Named-Entity identification  Novelty Detection  Translation

27 What is Text Mining  Search documents, web, news  Categorize by topic, taxonomy  Enables filtering, routing, multi-text summaries, …  Extract n ames, relations, …  Summarize text, rules, trends, …  Detect redundancy, novelty, anomalies, …  Predict outcomes, behaviors, trends, … Who did what to whom and where?

28 Data Mining vs. Text Mining Data Mining vs. Text Mining  Data: relational tables  DM universe: huge  DM tasks:  DB “cleanup”  Taxonomic classification  Supervised learning with predictive classifiers  Unsupervised learning clustering, anomaly detection  Visualization of results  Text: HTML, free form  TM universe: 103X DM  TM tasks:  All the DM tasks, plus:  Extraction of roles, relations and facts  Machine translation for multi-lingual sources  Parse NL-query (vs. SQL)  NL-generation of results

29 New Bill of Rights New Bill of Rights  Get the right information  To the right people  At the right time  On the right medium  In the right language  With the right level of detail

30 Relevant Text Mining Technologies Relevant Text Mining Technologies  “…right information”  “…right people”  “…right time”  “…right medium”  “…right language”  “…right level of detail”  IR (search engines)  Classification, routing  Anticipatory analysis  Info extraction, speech  Machine translation  Summarization

31 “…right information” Information Retrieval

32 Beyond Pure Relevance in IR Beyond Pure Relevance in IR Information Retrieval Maximizes Relevance to Query  What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...??  Novelty is approximated by non-redundancy!  we really want to maximize: relevance to the query, given the user profile and interaction history,  P(U(f i,..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history ...but we don’t yet know how. Darn.

33 query documents MMR IR Standard IR Maximal Marginal Relevance vs Standard Information Retrieval

34 “…right information” Novelty Detection

35  Find the first report of a new event  (Unconditional) Dissimilarity with Past  Decision threshold on most-similar story  (Linear) temporal decay  Length-filter (for teasers)  Cosine similarity with standard weights: Detecting Novelty in Streaming Data

36 New First Story Detection Directions  Topic-conditional models  e.g. “airplane,” “investigation,” “FAA,” “FBI,” “casualties,”  topic, not event  “TWA 800,” “March 12, 1997”  event  First categorize into topic, then use maximally-discriminative terms within topic  Rely on situated named entities  e.g. “Arcan as victim,” “Sharon as peacemaker ”

37 Link Detection in Texts Link Detection in Texts  Find text (e.g. Newstories) that mention the same underlying events.  Could be combined with novelty (e.g. something new about interesting event.)  Techniques: text similarity, NE’s, situated NE’s, relations, topic-conditioned models, …

38 “…right people” Text Categorization

39 Text Categorization Assign labels to each document or web-page  Labels may be topics such as Yahoo-categories  finance, sports, News  World  Asia  Business  Labels may be genres  editorials, movie-reviews, news  Labels may be routing codes  send to marketing, send to customer service

40  Manual assignment  as in Yahoo  Hand-coded rules  as in Reuters  Machine Learning (dominant paradigm)  Words in text become predictors  Category labels become “to be predicted”  Predictor-feature reduction (SVD,  2, …)  Apply any inductive method: kNN, NB, DT,… Text Categorization Methods

41 Multi-tier Event Classification Multi-tier Event Classification

42 “…right medium” Named-Entity identification

43 Purpose: to answer questions such as:  Who is mentioned in these 100 Society articles?  What locations are listed in these 2000 web pages?  What companies are mentioned in these patent applications?  What products were evaluated by Consumer Reports this year? Named-Entity identification

44 President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization. Named Entity Identification

45  Finite-State Transducers w/variables  Example output: FNAME: “Bill” LNAME: “Clinton” TITLE: “President ” FNAME: “Bill” LNAME: “Clinton” TITLE: “President ”  FSTs Learned from labeled data  Statistical learning (also from labeled data)  Hidden Markov Models (HMMs)  Exponential (maximum-entropy) models  Conditional Random Fields [Lafferty et al] Methods for NE Extraction

46 Extracted Named Entities (NEs) People Places President Clinton Singapore Mickey Kantor Japan Ms. Xuemei Peng China Mr. Hideto Suzuki Australia Mr. Langford Named Entity Identification

47 Motivation: It is useful to know roles of NE’s:  Who participated in the economic meeting?  Who hosted the economic meeting?  Who was discussed in the economic meeting?  Who was absent from the the economic meeting? Role Situated NE’s

48 Emerging Methods for Extracting Relations Emerging Methods for Extracting Relations  Link Parsers at Clause Level  Based on dependency grammars  Probabilistic enhancements [Lafferty, Venable]  Island-Driven Parsers  GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’]  Tree-bank-trained probabilistic CF parsers [IBM, Collins]  Herald the return of deep(er) NLP techniques.  Relevant to new Q/A from free-text initiative.  Too complex for inductive learning (today).

49 Example: (Who does What to Whom) "John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q4 2001 without problems from federal regulators" Relational NE Extraction

50  Useful for relational DB filling, to prepare data for “standard” DM/machine-learning methods Acquirer Acquiree Sh.price Year __________________________________ Flexicon Logi-truck 18 1999 Flexicon Supplyhouse 30 2001 buy.com reel.com 10 2000............ Fact Extraction Application

51 “…right language” Translation

52  Knowledge-Engineered MT  Transfer rule MT (commercial systems)  High-Accuracy Interlingual MT (domain focused)  Parallel Corpus-Trainable MT  Statistical MT (noisy channel, exponential models)  Example-Based MT (generalized G-EBMT)  Transfer-rule learning MT (corpus & informants)  Multi-Engine MT  Omnivorous approach: combines the above to maximize coverage & minimize errors “…in the Right Language”

53 Types of Machine Translation Syntactic Parsing Semantic Analysis Sentence Planning Text Generation Source (Arabic) Target (English) Transfer Rules Direct: EBMT Interlingua

54 English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu. EBMT example

55 Multi-Engine Machine Translation  MT Systems have different strengths  Rapidly adaptable: Statistical, example-based  Good grammar: Rule-Based (linguisitic) MT  High precision in narrow domains: KBMT  Minority Language MT: Learnable from informant  Combine results of parallel-invoked MT  Select best of multiple translations  Selection based on optimizing combination of:  Target language joint-exponential model  Confidence scores of individual MT engines

56 Illustration of Multi-Engine MT El punto de descarge The drop-off point se cumplirá en will comply with el puente Agua Fria The cold Bridgewater El punto de descarge The discharge point se cumplirá en will self comply in el puente Agua Fria the “Agua Fria” bridge El punto de descarge Unload of the point se cumplirá en will take place at el puente Agua Fria the cold water of bridge

57 State of the Art in MEMT for New “Hot” Languages  We can do now: Gisting MT for any new language in 2-3 weeks (given parallel text) Medium quality MT in 6 months (given more parallel text, informant, bi-lingual dictionary) Improve-as-you-go MT Field MT system in PCs  We cannot do yet: High-accuracy MT for open domains Cope with spoken-only languages Reliable speech-speech MT (but BABYLON is coming) MT on your wristwatch

58 “…right level of detail” Summarization

59 Types of Summaries TaskQuery-relevant(focused)Query-free(generic) INDICATIVE for Filtering (Do I read further?) Filter search engine results Short abstracts CONTENTFUL for reading in lieu of full doc Solve problems for busy professionals Executive summaries Document Summarization

60Conclusion

1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India.

Similar presentations

Presentation on theme: "1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India.

Similar presentations

Presentation on theme: "1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India."— Presentation transcript:

Similar presentations

About project

Feedback