Contents Overview of our Text mining work Text Mining for individual text Text Mining for discussion text Text Mining for e-mail Discussion on E-mail mining Pair-mail Three levels of e-mail mining targets Preliminary study of e-mail mining
Text Mining Text mining has become one of the most influential natural language processing research. Text mining is extended to various domain CRM (Customer Relationship Management) Biomedical domain Web pages Discussion records Patent
Text Mining for Individual Text Call Taker: James Date: Aug. 30, 2002 Duration: 10 min. CustomerID: ADC00123 Q: cust sys has stopped working. A: checked cust bios and it need updated. … Unstructured Data Structured Data [Call Taker] James [Date] 2002/08/30 [Duration] 10 min. [CustomerID] ADC00123 [Noun] Customer [Software] BIOS [Subj...Verb] customer system..stop [SW..Problem] BIOS..need Original DataMeta Data Linguistic Analysis Tagging Dependency Analysis Named Entity Extraction Intention Analysis Category Dictionary Synonym Dictionary CategoryItem Visualization & Interactive Mining Mining IBM TAKMI (Nasukawa, Nagano,1999) Mining target: individual text Mining unit: >texts >category labeled items extracted from text using NLP
TAKMI Client GUI Mining History Document List Distribution Analysis View Other Mining Views
Text Mining for Discussion Records Mail A Mail B Mail C Quotation from Mail A Comment on the quotation Quotation from Mail B Comment on the quotation Thread Summary Discussion Mining (Murakami, Nagao,2001) Linguistic Annotation Mining target: discussion records Mining unit: >summarized texts based on thread structure >mail graph structures
Text Mining for E-mail Private E-mail Data Various structured data as mail messages Sender(From), Receiver(To,cc.,bcc.), Time Stamp, Mail unique ID, Referential ID, etc. Independent and relational documents are mixed in e-mail data. F.Y.I., invitation, CFP etc. Mailing List, inquiry, request etc.
Properties of e-mail messages Private Mail without c.c. Private Mail with c.c. Private Public Independent Relative F.Y.I Spam memo Mailing List Schedule Discussion Mining E-mail Mining Text Mining Discussion, BBS,,, Discourse Paper, Report,,,
E-mail mining Not suitable for annotation Need to consider scalability Shorter threads than discussion records. New concept of the E-mail mining target is required. AND Lack of information like discourse structure participants are small than discussion
Pair-mail Pair-mail is formed by reference link, reply-to information. Each reply-to link forms a pair-mail. It contains reference type information based on previous/next mail contents Question/Answer, Imperative/Action, Action/Regards... etc reply-to
Mining Target -mining units- Three levels of mining target in mail data 1 st level : e-mail an individual e-mail as a single substance 2 nd level : pair-mail a pair of e-mail linked by reply-to relations. 3 rd level : thread a chain of e-mail messages (threads) Scalability High Low
Examples of mail mining Mail data for one month (May, 2003) Business related mails discussion with co-author of my paper meeting invitations mail magazines and mailing list messages are received in another account Including my sending messages Volume: 380mail messages (19 mail messages / a working day)
Thread Properties Extracting thread structure based on the header information (Reference ID). Average length of threads 1.60 mail message(238 threads). but, most of mail message are individual type Average length without individual mail is 3.09 mail messages(68 threads). Most threads are shorter than 3 messages Long thread (over 4 messages) is only 16 The average of participant number of long thread (more than 4 messages) is 3.5.
Changes in numbers of thread participants Expansion of participants number general information No member in c.c. field Special topics in sender and receiver Consider the pair mail properties (ex. the shift of the number of participants), it helps to extract the relevant information.
Pair-mail Extraction Extracted pair-mail contains some expression in second mail ex. gratitude expression such as Thank you. These pair-mails contain some relation to the expression in the example, gratitude expressions is a result of some action in the previous mail thank you... Action
Result of pair-mail extraction Most of the expressions are found in previous mail as attachment - data cleansing are required In the rest of results, we can find the action described in previous mail. About 40% is ones gratitude for actions described in mail (8% is for information) and 10 % is for real world action. 5% is platitudinous expression. Extracted 106 pair-mail
Summary Text Mining for e-mail Text Mining for individual and relational text Introduce the new mining unit Three levels of e-mail mining targets single mail. pair-mail. thread Preliminary study of e-mail mining Pair-mail information is important in threads. Needs data cleansing. Remove signature, attachment,,,