Text mining and machine learning: examples from life Evgeny Klochikhin, PhD American Institutes for Research Tech Talk - DCDataFest 2015.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Building a Mock Universe Cosmological nbody dark matter simulations + Galaxy surveys (SDSS, UKIDSS, 2dF) Access to mock catalogues through VO Provide analysis.
An Introduction to Data Mining
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Chapter 3: System design. System design Creating system components Three primary components – designing data structure and content – create software –
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Chapter 5: Information Retrieval and Web Search
Food Safety Is Risky Business Food Safety Is Risky Business Nancy Flores, Ph.D., Extension Food Technology Specialist Various.
The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored By: Menzies & Kocaganeli – Lane Dept of CS/EE, WVU Bird,
Introduction to machine learning
Unit 3a Industrial Control Systems
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Dr H Schnerr Leatherhead Food International Ltd., Randalls Road, Leatherhead, Surrey, KT22 7RY  +44 (0)  
Food Safety and Inspection Service U. S. Department of Agriculture
Data Mining Chun-Hung Chou
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Fault Diagnosis System for Wireless Sensor Networks Praharshana Perera Supervisors: Luciana Moreira Sá de Souza Christian Decker.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Enviromatics Environmental sampling Environmental sampling Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Chapter 6: Information Retrieval and Web Search
1 Introduction to Software Engineering Lecture 1.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Effect of mycotoxins in the nutrition of farm animals secondary metabolites of fungi fungi start to produce them under stress conditions some of them are.
Knowledge and Learning in Complex Business Systems Zuobing Xu University of California, Santa Cruz (Silicon Valley Center) Ram Akella, Kristin Fridgeirsdottir,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
United States Department of Agriculture Food Safety and Inspection Service 1 HACCP Systems Validation NACMPI September 22-23, 2011 Washington, DC William.
Regulations and Ethics. There are two sides to every issue… Do I look like a Frankenfood?
Toulouse, September 2003 Page 1 JOURNEE ALTARICA Airbus ESACS  ISAAC.
United States Department of Agriculture Food Safety and Inspection Service 1 Public Health Based Slaughter Inspection: Rationale and Process August 7,
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Compliance and Investigations Division (CID). Proposed Rules  Official establishments, and retail stores that grind raw beef products, will keep records.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Decision Support
Business Rules 12 th Meeting Course Name: Business Intelligence Year: 2009.
 ACTION RESEARCH. Action research is undertaken in a school setting. It is a reflective process that allows for inquiry and discussion as components.
Chapter – 8 Software Tools.
Awareness Training: ‘HARPC’ for Food Safety Complimentary Presentation by Quality Systems Enhancement 1790 Wood Stock Road Roswell GA E. mail:
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
United States Department of Agriculture Food Safety and Inspection Service Within Establishment Public Health Risk-Based Inspection for Poultry Slaughter.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Engineering Design Process Selecting an idea – design and build Steps 3 and 4.
Food Safety Challenges and Benefits of New Technology Randall Huffman, Ph.D. Vice President, Scientific Affairs American Meat Institute Foundation USDA-
Compiling Information and Inferring Useful Knowledge for Systems Biology by Text Mining the Literature Anália Lourenço IBB – Institute for Biotechnology.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Meeting Today’s Food Safety Challenges at the National Institute of Food and Agriculture An Overview of Food Safety Programs Jan Singleton, PhD, RDN Director,
School of Computer Science & Engineering
Data Mining Jim King.
Adversarial Learning for Neural Dialogue Generation
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Multidisciplinary nature of environmental studies Lecture #1
مدیریت داده ها و اطلاعات آزمایشگاه پزشکی
Content Analysis of Text
Control and Prevention of emerging and future pathogens at cellular and molecular level throughout the food chain Prof. Mogens Jakobsen Dept. of Food Science.
Information Retrieval
Using Supervised Machine Learning to Classify Customer Input
Presentation transcript:

Text mining and machine learning: examples from life Evgeny Klochikhin, PhD American Institutes for Research Tech Talk - DCDataFest 2015

Rule #1: TEXT IS NOT NUMBERS Example: The down is falling down. © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Rule #2: METHOD DEPENDS ON APPLICATION Use cases: -Text categorization -Validation of record linkage -Knowledge discovery -Document clustering and classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Use case #1: Text categorization Where do the categories come from? Do we have definite number of classes or let the machine decide? Are there any additional variables (e.g. meta- data)? Choices: topic modeling, information retrieval, machine classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Use case #2: Knowledge discovery Do we know what knowledge we want to discover? Is there a ‘gold standard’ data set, or ground truth? Choices: information retrieval/NLP, active learning, machine classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Rule #3: MAKE SURE SOFTWARE IS ROBUST Examples: -Topic modeling: Mallet vs gensim -Explicit Semantic Analysis: EasyESA vs esalib2 © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Rule #4: NOTHING IS FULLY AUTOMATED Humans should always be involved (curate, validate, ground truth) Examples: -General corpora: Mechanical Turk and Crowdflower -Scientific corpora: expert curators © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Implementation: usual steps Data collection Data organization Data cleaning Pre-processing: remove common stop words, tokenize, TFIDF Apply method Post-processing: validation and evaluation © 2015 Evgeny Klochikhin, PhD American Institutes for Research

TOPIC MODELING © 2015 Evgeny Klochikhin, PhD American Institutes for Research

What is text: ‘bag-of-words’ Vector space representation of text – every word has its unique id (e.g., ‘microscopy’=0, ‘afm’=1, ‘topography’=2, ‘nanoscale’=3, etc.) and the number of occurrences within the document: Award : Systems Approach to Dynamic Atomic Force Microscopy Abstract The goal of this project is to establish a framework for model based simultaneous topography and parameter estimation in the amplitude modulation atomic force microscopy (AFM). Parametric models of tip-sample interaction that are amenable to real- time identification will be developed. Harmonic balance and power balance tools will be incorporated towards the estimation of the model parameters. The amplitude and phase dynamics based on the model will be developed, which will be used to validate the model with experimental data and subsequently used for control design purposes. These methods will be used to study yeast cells. A framework for non-parametric reconstruction of tip-sample interaction potential will be researched. Limitations on how well amplitude modulated AFM can decipher different sample interactions will be studied… # of instances word IDs © 2015 Evgeny Klochikhin, PhD American Institutes for Research

What is topic modeling (D. Newman) The topic model is an algorithm that automatically learns topics (themes) from a collection of documents – It works by observing words that tend to co-appear in documents, for example gene and dna, or climate and warming – The topic model assumes each document exhibits multiple topics – The topic model learns topics directly from the text Each topic is displayed by showing its top-20 words, for example: – dark_matter cosmological cosmology universe dark_energy lensing survey CMB redshift cosmic mass galaxy scale galaxies gravitational measurement power_spectrum parameter observation structure... – This is a topic about Dark Matter, Dark Energy and Cosmology © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Examples Abstract excerptTop-3 topicsProbability scores Engineering for food safety and quality The food industry is one of the most conservative among industries in the United States; it is experiencing, like never before, the need for change, for innovation. Consumers are much more demanding and better educated in terms of food quality and nutritional aspects, regulatory agencies are searching for technologies that offer better products with greater safety… pathogen foodborne safety farm contamination control intervention food-borne borne reduce 0.32 poultry campylobacter jejuni chicken salmonella broiler egg colonization avian vaccine 0.32 symptom abdominal treatment vomiting cramp protect patient dos vaccine testing 0.16 Edible coatings to improve food quality and food safety and minimize packaging cost An edible film resembles plastic film wrap but is formed from renewable edible protein (e.g., milk protein) and/or polysaccharide (e.g., cornstarch). Edible films can be used as food wraps or formed into pouches for foods, thus reducing use of synthetic plastic films. Edible films can also be formed directly on the surfaces of the food as coatings to protect or enhance the food in some manner, becoming part of the food and remaining on the food through consumption... produce fresh outbreak coli contamination pathogen spinach lettuce salmonella o mycotoxin aflatoxin fungi fungal grain aspergillus feed flavus toxin fusarium 0.15 detection rapid phase method detect pathogen assay sensor sensitive biosensor 0.09 © 2015 Evgeny Klochikhin, PhD American Institutes for Research

Software MALLET - Sample steps: – Import documents: bin/mallet import-dir --input /data/topic-input --output topic-input.mallet \ --keep- sequence --remove-stopwords – Build the model: bin/mallet train-topics --input topic- input.mallet \ --num-topics output-state topic- state.gz – Inference topics: bin/mallet infer-topics --inferencer- filename [FILENAME] © 2015 Evgeny Klochikhin, PhD American Institutes for Research