Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.

Similar presentations


Presentation on theme: "Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico."— Presentation transcript:

1 Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico Lee April 7, 2005

2 Information and Telecommunication Technology Center (ITTC) University of Kansas Background – XML Technology  XML is a mark-up language for data representation and data exchange  Characteristics and advantages of XML: Users from different professions can define their own tags and attribute names Allow people in the same field to exchange data and information Users from different professions can define their own tags and attribute names Allow people in the same field to exchange data and information XML document structures can be nested to any level of complexity XML document structures can be nested to any level of complexity XML document can contain an optional description of its grammar for performing structural validation XML document can contain an optional description of its grammar for performing structural validation  XML is important in today’s high-volume data-collection environments As of 10/26/2004, 1,556,009 people worldwide were using a single XML application tool As of 10/26/2004, 1,556,009 people worldwide were using a single XML application tool Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office 2003 Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office 2003 Major companies are putting their data in XML Major companies are putting their data in XML  Many professional groups have developed their own XML ontologies, e.g. OMF for meteorologists and CML for chemists

3 Information and Telecommunication Technology Center (ITTC) University of Kansas XML Document - Example

4 Information and Telecommunication Technology Center (ITTC) University of Kansas Background - Autofill Technology  Currently 70 million workers or 59% of working adults in the U.S. complete forms on a regular basis  Data entry process is tedious, error-prone, time- consuming and person-power intensive Most businesses continue to process almost 80% of their forms manually (according to Verity Inc.) Most businesses continue to process almost 80% of their forms manually (according to Verity Inc.)  Autofill Auto-complete  Autofill and Auto-complete technologies ease the burden of data entry by automatically predicting and suggesting values for empty data fields  Problems with current autofill technologies: Require a perfect match with the historical data, e.g. AOL Or, require previously stored templates, e.g. Roboform Mostly are for web-base forms Can only handle simple data, e.g. name and address in online shopping forms; no support for complex XML structures Inaccurate

5 Information and Telecommunication Technology Center (ITTC) University of Kansas Motivation  XML is the primary standard of data representation and data exchange  Most businesses continue to process almost 80% of their forms manually  Data entry process for XML documents is tedious, error- prone, time-consuming and person-power intensive  Current software tools for XML only simplify the implementation process Information for XML documents still needs to be manually entered Information for XML documents still needs to be manually entered  Previous software tools for assisting data entry Inaccurate Inaccurate Do not support complex XML grammars Do not support complex XML grammars

6 Information and Telecommunication Technology Center (ITTC) University of Kansas Approach  Our goal: reduce the burden on the user by automating the data entry into XML documents  SmartXAutofill - an intelligent data entry assistant for predicting and automating inputs for XML documents based on the contents of historical document collections in the same XML domain based on the contents of historical document collections in the same XML domain  Incorporate an ensemble classifier that integrates multiple internal classification algorithms into a single architecture  Each internal classifier uses approximate techniques from Machine Learning to predict and suggest a value for an empty XML field Approximate match: predict the empty node values between the values in a historical collection of XML documents and the values in a partially filled document, e.g. probabilistic Approximate match: predict the empty node values between the values in a historical collection of XML documents and the values in a partially filled document, e.g. probabilistic Very different from current autofill systems which require a perfect match between the incomplete document and the values of stored documents Very different from current autofill systems which require a perfect match between the incomplete document and the values of stored documents

7 Information and Telecommunication Technology Center (ITTC) University of Kansas Overview 1. User enters data into an XML form and moves cursor to an empty field 2. SmartXAutofill examines the data entered 3. SmartXAutofill examines the historical XML collection 4. Machine Learning algorithms predict what the data value should be 5. Weighting System learns and improves from past performance by rewarding algorithms that make correct predictions 6. Voting System forms a consensus decision 7. SmartXAutofill returns one or more suggestions for the current field 8. User selects one of the SmartXAutofill suggestions or enters another value

8 Information and Telecommunication Technology Center (ITTC) University of Kansas Underlying Technology – Ensemble Learning  Problem: impossible to predict which classification algorithm will work best for what type of document  Solution: Ensemble classifier A collection of a number of classification algorithms; each classifier provides predictions for the value of an XML node A collection of a number of classification algorithms; each classifier provides predictions for the value of an XML node Learn which individual algorithms provide better predictive accuracy for different XML domains and for different nodes in the XML documents in these domains Learn which individual algorithms provide better predictive accuracy for different XML domains and for different nodes in the XML documents in these domains Adapt itself to the specific XML collection, and perform better than any individual predictive algorithm Adapt itself to the specific XML collection, and perform better than any individual predictive algorithm  Boosting is one of the most widely used ensemble method

9 Information and Telecommunication Technology Center (ITTC) University of Kansas Underlying Technology – Ensemble Learning (cont’d)  Our ensemble boosts the internal classifiers based on their past performances through weighting the individual classifiers Previous work in boosting combined the same type of classifier, learned by the same methodology, but trained on different examples Previous work in boosting combined the same type of classifier, learned by the same methodology, but trained on different examples Our ensemble combines different types of classifiers into an integrated classification framework Our ensemble combines different types of classifiers into an integrated classification framework  Extra feature: collection of XML documents used for prediction are constrained by a “time window” Only N latest documents are used Only N latest documents are used N is defined by the user N is defined by the user Allow the system to adapt itself to the type of documents being entered recently Allow the system to adapt itself to the type of documents being entered recently

10 Information and Telecommunication Technology Center (ITTC) University of Kansas System Architecture – Ensemble Learning  Ensemble Learning Multiple internal Machine Learning algorithms integrated into a single architecture Multiple internal Machine Learning algorithms integrated into a single architecture  Weighting System Different classification algorithms show different predictive accuracy for different nodes Different classification algorithms show different predictive accuracy for different nodes Overall accuracy of the ensemble is improved by learning how internal classifiers perform on each node Overall accuracy of the ensemble is improved by learning how internal classifiers perform on each node Weigh each classifier by past performance Weigh each classifier by past performance  Suggestion Aggregator Voting forms a consensus decision on which value is suggested by the internal classifiers Voting forms a consensus decision on which value is suggested by the internal classifiers

11 Information and Telecommunication Technology Center (ITTC) University of Kansas Ensemble Weighted Voting Example   Three classifiers provide three suggestions each   All classifiers have the same weight initially   Classifiers are modified based on their performance for different nodes in the XML domain   Classifier A makes three suggestions: the top one receives a rank value of 3 the second one of 2 the third one of 1   Rank values are multiplied by the weight of the classifier and then normalized by the sum of the weights of all the classifiers   Suggestion with the highest score is the one selected by the ensemble and presented to the user

12 Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Demo Editor for the element “Title” Drop-down box containing the best suggested values Pop-up menu for adding new elements Editor for the element “place”, which has two child elements, “room” and bldg Node Information displays data about the currently selected element Suggestion Information displays the top-ranked suggestions from each suggestor for the currently selected element Voting Information displays a bar for each possible suggestion - colored components show contribution of each suggestor in the vote Weight Information displays historical accuracy of each suggestor for the currently selected element

13 Information and Telecommunication Technology Center (ITTC) University of Kansas Testing Approach  To span the size and complexity dimensions, XML document data were collected from 11 domains APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT, UniProf, UWM, and WSU APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT, UniProf, UWM, and WSU Size ranged from around 50 to 5000 documents Size ranged from around 50 to 5000 documents Between 20 and 420 nodes per document Between 20 and 420 nodes per document  Document collections were randomly separated into two sets: seed (10% of the collection or 100 documents), and training collections  Seed collection - historical information for making predictions  Training collection - trained the ensemble by modifying its weights based on the accuracy of the suggestion Continuously trained the learning component and tested the system Continuously trained the learning component and tested the system Documents were randomly selected and all nodes were suggested in random order Documents were randomly selected and all nodes were suggested in random order Add documents from the training collection to the seed after used Add documents from the training collection to the seed after used Note: Classifier does not made suggestion for a particular field if there were no historical data for it or if every previous value for the field was unique, e.g. abstracts of papers Note: Classifier does not made suggestion for a particular field if there were no historical data for it or if every previous value for the field was unique, e.g. abstracts of papers

14 Information and Telecommunication Technology Center (ITTC) University of Kansas Test Results for Different Domains

15 Information and Telecommunication Technology Center (ITTC) University of Kansas Weights of selected XML nodes from iProClass domain

16 Information and Telecommunication Technology Center (ITTC) University of Kansas Test Result Discussion  Different classification algorithms perform better for different domains  Ensemble classifier performed at least as well as the best performing internal classification algorithm for a domain  Different classifiers are preferred for different nodes

17 Information and Telecommunication Technology Center (ITTC) University of Kansas Our Technology - SmartXAutofill  First methodology proven to intelligently predict, suggest and autofill data for XML documents  Learn and adapt itself to any XML domain without the need of custom algorithms  “Time window” allows the technology to adapt itself to the particular set of XML documents being filled at that time  Speed up data entry process for XML documents from 20% to 99%


Download ppt "Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico."

Similar presentations


Ads by Google