Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois Office: (217) 244-9129.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
1 Part 1: Classical Image Classification Methods Kai Yu Dept. of Media Analytics NEC Laboratories America Andrew Ng Computer Science Dept. Stanford University.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
University of Illinois Role of Mashups, Cloud Computing, and Parallelism for Visual Analytics Loretta Auvil.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Overview of Search Engines
Presenter: Teng-Chih Yang Professor: Ming-Puu Chen Date: 10/ 28/ 2009 Data mining in course management systems: Moodle case study and tutorial Romero,
Introduction to machine learning
Introduction to Data Mining Engineering Group in ACL.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Clustering Unsupervised learning Generating “classes”
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois D2K –
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Amy Dai Machine learning techniques for detecting topics in research papers.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Cross Language Clone Analysis Team 2 February 3, 2011.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Peter Bajcsy, Ph.D. Research Scientist Adjunct Assistant Professor, CS Department, UIUC Automated Learning Group National Center for Supercomputing Applications.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Wednesday NI Vision Sessions
Data Mining and Text Mining. The Standard Data Mining process.
(1) Organize information processing centers environment, the various functions and details Electronic Data Processing (EDP): can refer to the use of automated.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Artificial Intelligence DNA Hypernetworks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Information Retrieval in Practice
Data Mining – Intro.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
DATA MINING © Prentice Hall.
Data Mining 101 with Scikit-Learn
Natural Language Processing (NLP)
So far we have covered … Basic visualization algorithms
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
湖南大学-信息科学与工程学院-计算机与科学系
CSE 635 Multimedia Information Retrieval
Text Categorization Berlin Chen 2003 Reference:
Natural Language Processing (NLP)
The Student’s Guide to Apache Spark
Natural Language Processing (NLP)
Presentation transcript:

Duane Searsmith Automated Learning Group National Center for Supercomputing Applications University of Illinois Office: (217) Michael Welge, Director, Loretta Auvil, Project Manager, (217) July 9, 2004 Text Mining with D2K/T2K

alg | Automated Learning Group Outline Text Mining Brief Intro Unsupervised Supervised Information Extraction … ALG Technology Pieces Demonstrations Discussion

alg | Automated Learning Group What is text mining? In simplified and practical terms it is the extraction of a relatively small amount of information of interest from a mass amount of text data. But … You might not know what you’re looking for. Discovering patterns in the haystack. (clustering, mining associations) How to recognize a needle. Sifting through the haystack. (model building, supervised learning) Just the facts please. Enumerating the make and model of every needle. (information extraction)

alg | Automated Learning Group Common Tasks for Text Mining & Analysis Information retrieval Automatic grouping (clustering) of documents (Active) Classification Information extraction Topic detection and tracking Automatic summarization “Understanding” text and question answering Machine Translation

alg | Automated Learning Group Text Preprocessing Preprocessing (Text -> Numeric Representation) Tokenization Sentence Splitting Part-of-Speech Tagging Term Normalization (Stemming) Filtering (Stops) Chunking Term Extraction Filtering (Again) Term Weighting Other Transformations Resource Taxing

alg | Automated Learning Group Agglomerative (bottom up) Quadratic time complexity Sampling Random Partition Hard vs. Soft Unsupervised method Basic notion to all of these approaches is some heuristic for measuring similarity between documents and document groups (term co-occurrence) Strongly Similar Arcs Kept Weakly Similar Arcs Broken Clustering: Document Self-Organization

alg | Automated Learning Group How to Recognize a Needle To classify your data you often need to build a model. To build a model you typically need examples from a “teacher” – metaphorically speaking. Finding good examples can be hard. T2K can also use active learning to help find good examples faster making model building easier.

alg | Automated Learning Group Pattern Mining Finding frequent item sets -> Rule Discovery Many methods: Apriori, Charm, FPGrowth, CLOSET Working with Jiawei Han and students -- Hwanjo Yu and Xiaolei Li Application: topic tree construction

alg | Automated Learning Group Just the Facts Please Finding a document that has the information you need is often not the end goal. To extract information you must first recognize it – you need to build a model, and that means you need to have examples. Levels of IE: What’s hard and what’s harder?

alg | Automated Learning Group D2K

alg | Automated Learning Group D2K Features Extension of existing API Provides the capability to programmatically connect modules and set properties. Allows D2K-driven applications to be developed. Provides ability to pause and restart an itinerary. Enhanced Distributed Computing Allows modules that are re-entrant to be executed remotely. Uses Jini services to look up distributed resources. Includes interface for specifying the runtime layout of a distributed itinerary. Processor Status Overlay Shows utilization of distributed computing resources. Distributed Checkpointing Resource Manager Provides a mechanism for treating selected data structures as if they were stored in global memory. Provides memory space that is accessible from multiple modules running locally as well as remotely. Batch Processing / Web Services D2K Overview

alg | Automated Learning Group D2K/T2K/I2K - Data, Text, and Image Analysis Information Visualization

alg | Automated Learning Group The Engine (distributed, parallelized, persistent) Core Modules (building blocks) T2K is a specialized set of modules for text analysis I2K is a specialized set of modules for image analysis D2K Toolkit (rapid development environment) ThemeWeaver is an independent application that uses the D2K engine to run algorithms constructed from T2K modules. It is a demonstration platform Other D2K driven applications (StreamLined, EMO, …) D2K Engine Core ModulesT2K Applications The Technology Pieces I2K Toolkit

alg | Automated Learning Group T2K Core Tokenization POS Tagging Stemming Chunking Filters Term Weighting Supervised / Unsupervised Learning GATE Integration Pattern Mining Text Streams Summarization T2K Core 1.0 (Beta)

alg | Automated Learning Group ThemeWeaver

alg | Automated Learning Group ThemeWeaver: Prototype Text Clustering Application Hard clustering algorithms Modified Kmeans (3 sampling methods) Soft clustering Suffix tree based algorithm Can be used for longer documents Visualizations “Single link” graph representation Dendogram cluster tree Clusters over time Drill down and backtrack UI D2K/T2K Driven

alg | Automated Learning Group The ALG Team Staff Loretta Auvil Peter Bajcsy Colleen Bushell Dora Cai David Clutter Lisa Gatzke Vered Goren Chris Navarro Greg Pape Tom Redman Duane Searsmith Andrew Shirk Anca Suvaiala David Tcheng Michael Welge Students Tyler Alumbaugh Bradley Berkin Jacob Biehl John Cassel Peter Groves Olubanji Iyun Sang-Chul Lee Young-Jin Lee Xiaolei Li Brian Navarro Scott Ramon Sunayana Saha Martin Urban Bei Yu Hwanjo Yu

alg | Automated Learning Group * Demo / Discussion *