Presentation is loading. Please wait.

Presentation is loading. Please wait.

Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois Office: (217) 244-9129.

Similar presentations


Presentation on theme: "Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois Office: (217) 244-9129."— Presentation transcript:

1 Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois dsears@ncsa.uiuc.edu Office: (217) 244-9129 http://alg.ncsa.uiuc.edu Michael Welge, Director, welge@ncsa.uiuc.eduwelge@ncsa.uiuc.edu Loretta Auvil, Project Manager, lauvil@ncsa.uiuc.edu, (217) 265-8021lauvil@ncsa.uiuc.edu July 9, 2004 Text Mining with D2K/T2K

2 alg | Automated Learning Group Outline Text Mining Brief Intro Unsupervised Supervised Information Extraction … ALG Technology Pieces Demonstrations Discussion

3 alg | Automated Learning Group What is text mining? In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data. But … You might not know what you’re looking for. Discovering patterns in the haystack. (clustering, mining associations) How to recognize a needle. Sifting through the haystack. (model building, supervised learning) Just the facts please. Enumerating the make and model of every needle. (information extraction)

4 alg | Automated Learning Group Common Tasks for Text Mining & Analysis Information retrieval Automatic grouping (clustering) of documents (Active) Classification Information extraction Topic detection and tracking Automatic summarization “Understanding” text and question answering Machine Translation

5 alg | Automated Learning Group Text Preprocessing Preprocessing (Text -> Numeric Representation) Tokenization Sentence Splitting Part-of-Speech Tagging Term Normalization (Stemming) Filtering (Stops) Chunking Term Extraction Filtering (Again) Term Weighting Other Transformations Resource Taxing

6 alg | Automated Learning Group Agglomerative (bottom up) Quadratic time complexity Sampling Random Partition Hard vs. Soft Unsupervised method Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence) Strongly Similar Arcs Kept Weakly Similar Arcs Broken Clustering: Document Self-Organization

7 alg | Automated Learning Group How to Recognize a Needle To classify your data you often need to build a model. To build a model you typically need examples from a “teacher” – metaphorically speaking. Finding good examples can be hard. T2K can also use active learning to help find good examples faster making model building easier.

8 alg | Automated Learning Group Pattern Mining Finding frequent item sets -> Rule Discovery Many methods: Apriori, Charm, FPGrowth, CLOSET Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li Application: topic tree construction

9 alg | Automated Learning Group Just the Facts Please Finding a document that has the information you need is often not the end goal. To extract information you must first recognize it – you need to build a model, and that means you need to have examples. Levels of IE: What’s hard and what’s harder?

10 alg | Automated Learning Group D2K

11 alg | Automated Learning Group D2K Features Extension of existing API Provides the capability to programmatically connect modules and set properties. Allows D2K-driven applications to be developed. Provides ability to pause and restart an itinerary. Enhanced Distributed Computing Allows modules that are re-entrant to be executed remotely. Uses Jini services to look up distributed resources. Includes interface for specifying the runtime layout of a distributed itinerary. Processor Status Overlay Shows utilization of distributed computing resources. Distributed Checkpointing Resource Manager Provides a mechanism for treating selected data structures as if they were stored in global memory. Provides memory space that is accessible from multiple modules running locally as well as remotely. Batch Processing / Web Services D2K Overview

12 alg | Automated Learning Group D2K/T2K/I2K - Data, Text, and Image Analysis Information Visualization

13 alg | Automated Learning Group The Engine (distributed, parallelized, persistent) Core Modules (building blocks) T2K is a specialized set of modules for text analysis I2K is a specialized set of modules for image analysis D2K Toolkit (rapid development environment) ThemeWeaver is an independent application that uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform Other D2K driven applications (StreamLined, EMO, …) D2K Engine Core ModulesT2K Applications The Technology Pieces I2K Toolkit

14 alg | Automated Learning Group T2K Core Tokenization POS Tagging Stemming Chunking Filters Term Weighting Supervised / Unsupervised Learning GATE Integration Pattern Mining Text Streams Summarization T2K Core 1.0 (Beta)

15 alg | Automated Learning Group ThemeWeaver

16 alg | Automated Learning Group ThemeWeaver: Prototype Text Clustering Application Hard clustering algorithms Modified Kmeans (3 sampling methods) Soft clustering Suffix tree based algorithm Can be used for longer documents Visualizations “Single link” graph representation Dendogram cluster tree Clusters over time Drill down and backtrack UI D2K/T2K Driven

17 alg | Automated Learning Group The ALG Team Staff Loretta Auvil Peter Bajcsy Colleen Bushell Dora Cai David Clutter Lisa Gatzke Vered Goren Chris Navarro Greg Pape Tom Redman Duane Searsmith Andrew Shirk Anca Suvaiala David Tcheng Michael Welge Students Tyler Alumbaugh Bradley Berkin Jacob Biehl John Cassel Peter Groves Olubanji Iyun Sang-Chul Lee Young-Jin Lee Xiaolei Li Brian Navarro Scott Ramon Sunayana Saha Martin Urban Bei Yu Hwanjo Yu

18 alg | Automated Learning Group * Demo / Discussion *


Download ppt "Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois Office: (217) 244-9129."

Similar presentations


Ads by Google