Presentation on theme: "Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:
Text Mining -- Extraction Web-Based Information Architectures MSEC Mini II Jaime Carbonell
General Topic: Text Extraction Motivation: Text Mining Context-Free Entity Extraction Role-based Entity Extraction Relational Extraction eBusiness Applications
Text Mining (1) The Need to Process Text Automatically Text is meant to be read by humans, not programs. Most useful information is stored as text. (100 times as much online text as online DBs) HTML web pages are text (with structuring tags). Data Mining (covered later) operates on data tables (i.e. numbers, fixed fields, adherence to data models).
Text Mining (2) The Need to Process Text Automatically We need text => data table transducers. General Natural Language Understanding is still too hard. But, can we solve simpler but useful sub- problems? Yes – categorization of text by topic and extraction of certain kinds of information from free text or HTML-structured text is possible.
Text Mining (3) Components of Text Mining Categorization by topic or Genre Introduced here, see Prof Yang’s lecture Fact extraction from text Topic of this class Data Mining from DBs or extracted facts Later lecture on Data Mining
Text Categorization (1) Definition Assign labels to each document or web-page Labels may be topics such as Yahoo-categories e.g. "finance," "sports," "news>world>asia>business" Labels may be genres e.g. "editorials" "movie-reviews" "news" Labels may be binary e.g. "interesting-to-me" "not-interesting-to-me"
Text Categorization (2) Methods Manual assignment (as in Yahoo) Hand-coded rule based (as in Reuters) (Usually If the document contains a given boolean combination of words, then assign it a specified category.)
Text Categorization (3) Methods Learning of document-label assignment function –Most new applications rely on machine learning –k-Nearest Neighbors (simple, powerful) See Prof. Yang’s lecture –Decision-tree induction (most common method) –Support-vector machines (newest method)
Named Entity Identification I (1) Purpose To answer questions such as: Who is mentioned in these 100 Society article? What locations are listed in these 2000 web pages? What companies are mentioned in these patent forms? What products were evaluated by Consumer Reports this year?
Named Entity Identification I (2) Example President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization.
Named Entity Identification I (3) Extracted Named Entities (NEs) PEOPLEPLACES __________________________________________ President Clinton Singapore Mickey Kantor Japan Ms. Xuemei Peng China Mr. Hideto Suzuki Australia Mr. Langford
Named Entity Identification II Finite-State Machines (1) Definition of Finite State Acceptor (FSA) A FSA is a directed graph With a "start" node With one or more "accepting" nodes
Named Entity Identification II Finite-State Machines (2) Definition of Finite State Acceptor (FSA) With link-labels matching input items –exact-match links labels e.g. "China" matching only "China" –wildcard (?) match e.g. "?" matches "100" or "China" or... –feature-match e.g. CAP matches any capitalized word –list-membership match e.g. if HON-LIST := (Mr, Ms, Dr, President,...) it would match any of those words in the input
Named Entity Identification II Finite-State Machines (3) Definition of Finite State Acceptor (FSA) With an input source (e.g. string of words) Outputs "YES" or "NO"
Named Entity Identification III Finite-State Machines Definition of A Finite State Transducer (FST) An FSA with variable binding Outputs "NO" or "YES"+variable-bindings Variable bindings encode recognized entity e.g. "YES "
Finite State Acceptor (FSA) Start State Accept State CAP HON-LIST
Finite State Transducer (FST) CAP HON-LIST CAP HON := FirstName := LastName :=
Role-Situated Named Entities (1) Motivation It is useful to know roles of NE’s, e.g.: Who participated in the economic meeting? Who hosted the economic meeting? Who was discussed in the economic meeting? Who was absent from the the economic meeting?
Role-Situated Named Entities (2) How do we Assign Roles to Entities? Instead of one FSM, use a trio of 3 FSMs – Where left and right context help assign role
Role-Situated Named Entities (3) Example If = Then entity.role = ABSENT If = Then entity.role = HOST
Relational Information Extraction (1) Motivation It useful to know who is doing what to whom
Relational Information Extraction (2) Example "John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q without problems from federal regulators"
Relational Information Extraction (3) Extraction System is Template of FSMs [Corporate-acquisition [acquirer ] [acquiree
Fact Extraction: State of the Art (1) Observations Entity => entity+roles => relation templates Increasing richness of information extracted But not equivalent to language understanding Only pre-determined info types extracted
Fact Extraction: State of the Art (2) Observations Useful for relational DB filling Acquirer Acquiree Sh.price Year __________________________________ Flexicon Logi-truck Flexicon Supplyhouse buy.com reel.com
Fact Extraction: State of the Art (3) Technical Approaches Manually-built ad-hoc extraction "rules" Manually-built FSTs Feature-based training from labeled instances (Naive Bayes, Decision Trees) Hidden Markoff Models FSTs with feedback-driven turning
Applications of Text Extraction I (1) Financial auto-response –e.g. "What is the balance of account N ?" –First categorize as balance-request –Then extract account number
Applications of Text Extraction I (2) Financial Template filling from bank order –e.g. "Please transfer 100,000 USD from N to checking account A tomorrow“ –First categorize as transfer
Applications of Text Extraction I (3) Financial Template filling from bank order –Then extract: [account-transfer ] –Then employee checks template and adds/corrects information such as missing date (e.g. if the system cannot interpret "tomorrow")
Applications of Text Extraction II (1) Informational For all seminar announcements in BB extract time/title/speaker/location From messages about proposed meetings extract time/participants/location
Applications of Text Extraction II (2) Large-scale Wed applications Build DB of all job openings –Categorize web pages as job descriptions –Extract company/date/salary/level/... –fill in relational DB with extracted info Whizbang! (a Pittsburgh eCompany) is doing just this via its flipdog.com site Build DB of all web-posted resumes, first categorizing pages as resumes, then extracting key fields name/expertise/...
Applications of Text Extraction II (3) Corporate Intelligence Extract key facts about competition web sites –New products offered –Any changes to prices, sales, etc. Extract key facts about customers of competitors