Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM India Research Lab Unstructured information integration through data- driven similarity discovery Rema Ananthanarayanan, IBM India Research Lab Yuen.

Similar presentations

Presentation on theme: "IBM India Research Lab Unstructured information integration through data- driven similarity discovery Rema Ananthanarayanan, IBM India Research Lab Yuen."— Presentation transcript:

1 IBM India Research Lab Unstructured information integration through data- driven similarity discovery Rema Ananthanarayanan, IBM India Research Lab Yuen Yee Lo, Nuance Communications Inc, US Berthold Reinwald, IBM Almaden Research Lab, US Sreeram Balakrishnan, IBM Software Group, US

2 IBM India Research Lab Outline Motivation Background Our approach Experiment detail Conclusion

3 IBM India Research Lab Motivation Huge amount of business data are unstructured (>80%) – Email, spreadsheets, report, documents – Customer feedback via phone/email/online form – Web pages, blogs, etc Unstructured data – No schema information – Schema information itself is from different sources Automatically information integration and query across heterogeneous data sources (structured and unstructured) is a major enterprise challenge.

4 IBM India Research Lab Motivation Instance-value based data integration appears more amenable for complete end-to-end automation augmented with schema-based approach – Feasibility – Enhanced performance

5 IBM India Research Lab Background Integration of structured data Entity matching Record matching Table matching Integration of unstructured data Instance level eg. personal information on a desktop Entity retrieval over unstructured data Instance based Metadata based

6 IBM India Research Lab Our approach Identify groups of related data sets based on various text processing and data mining techniques. Identify attributes of the related data sets, based on comparison with domain-specific reference sets and keyword generation. Present a unified view of the various data sets, as a single repository, for subsequent querying.

7 IBM India Research Lab Definitions Data setA collection of data items or documents, depending on the context SignatureSpecific characteristics of a data set, that is pertinent to evaluating similarity with other data sets. e.g. char-level, word-level Top-level data type A top-level characterization of the data in the data set e.g. alpha-numeric, plain text,...

8 IBM India Research Lab System architecture

9 IBM India Research Lab Step 1: Top-level classification of document types * An NLP tool, LanguageWare, was used to classify the unstructured text. ** A and B could be significant where there are many product names and product details in the text. AB

10 IBM India Research Lab Step1: Signature generation Constructed at various levels of granularity Character ngram Morpheme ngram by TexHyphenation package smallest linguistic unit that has semantic meaning e.g. sentence { sen, tence} Word ngram LanguageWare for tokenization New York {New_York} Observations Name and address data appear more amenable to character trigram signatures. Word ngram signatures appear representative for plain text. Product data sets required special signatures based on the alphanumeric patterns in the data sets.

11 IBM India Research Lab Step2: Text similarity measure Cosine similarity with tf-idf has been used for computing similarity across different data sets. For 2 data sets Di and Dj the cosine similarity has been computed as Where is the term weight of term D in data set I and W t,Di = tf t,di.idf t idf t is the inverse doc freq., idf t = log( ), where N is the total #docs and n t is the #. of docs containing the term t.

12 IBM India Research Lab Step3: Generating the clusters Identify number of clusters using the gap statistic technique Do clustering based on the similarity scores

13 IBM India Research Lab Step4: Relating the clusters to the reference data Compute the similarity of each reference data set with each cluster Map the cluster to the reference data set to which it is closest, if it is beyond a threshold Advantages of using reference data set: Similarity with the reference data set helps us validate the clustering It helps us identify new topics as they arise since the corpora Extensive sets of reference data would help enhance the level of automation that is possible.

14 IBM India Research Lab Application scenario A service provider provides support to customers on multiple channels: -Phone -Email -Outsourced call centers -Directly from point-of-sale -…. The provider receives feedback from the customers also from these multiple channels, in various formats. In a specific scenario, there are 8 sources Information, each with one or more Sources of UnStructured Text (referred as USTi) The goal is to provide a consolidated view of all the data gathered from the Various call centers

15 IBM India Research Lab Sample data from various UnStructured Text sources UST5: Wrong commitments on X by salesperson. Bill not recd on time and sales person made wrong commitment. Says the service is good. UST7: Sub wants to know about scheme X. Sub is getting error XXX. Customer wanted to have details regarding scheme Y; agent took the details and updated the cust rgding the same. UST8: Able to communicate clearly and professionally. Clarity and communication can be better. UST10: Customer issues resolved online. Partial issues resolved online. UST11: Spoke to X today; as per him he has not received the welcome kit till today; plz check and do whatever is necessary. Customer is not satisfied with the explanation given by the sales executive; induction visit also has not happended. Customer suggests to book an appointment before visiting.

16 IBM India Research Lab Experiment Test data sets Real life data from a service provider, gathered from 8 input sources, in different formats and points in time: Information provided by customer when registering for service Information provided when the customer logged a complaint at the call-center Information gathered by the service provider periodically, by calling the customer... Additional data added: obtained from various sources, for augmenting our available data and validating the various techniques proposed Product data from a hardware and software vendor Movie reviews downloaded from the web Presidential speeches (Natural Language Toolkit) Financial documents – annual and quarterly filings of companies Continued

17 IBM India Research Lab Details of the experiment (cont.) Top-level categories were determined Word-level signatures were used to generate the similarity measures across each pair of data sets. The gap statistic tool estimate #.clusters. The actual clusters were then computed. The output in terms of number of clusters and actual clusters matched with our observation.

18 IBM India Research Lab Sample query: Return all feedback received from each customer from all sources CustomerData setFeedback Id1 UST17 UST14 UST11 UST4 UST2 No issues Officer has time till tomorrow morning Installation engr done work neatly Customer says good service DSL speed line very slow and every half hour disconnects Id2 UST17 UST14 UST1 UST4 Pls honor when committed Problem X was reported, was committed 4 times that one of customer care execs will turn up to his place but no one had turned up Overall it is good Now no problem Id3 UST13 UST1 Spoken to Mr X - as per him, satisfied with services Good

19 IBM India Research Lab Conclusions Existing techniques from NLP, data mining and related areas have been put together to solve an important problem of heterogeneous information integration. We describe a data-driven approach which is more scalable for future automation. Availability of larger and diverse data sets would help further validate our techniques. More extensive reference data sets would also help recognize new data sets as they arise.

20 IBM India Research Lab Thanks you.... For more information, please contact:

Download ppt "IBM India Research Lab Unstructured information integration through data- driven similarity discovery Rema Ananthanarayanan, IBM India Research Lab Yuen."

Similar presentations

Ads by Google