We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byBritton Cunningham
Modified over 2 years ago
Text mining and machine learning: examples from life Evgeny Klochikhin, PhD American Institutes for Research Tech Talk - DCDataFest 2015
Rule #1: TEXT IS NOT NUMBERS Example: The down is falling down. © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #2: METHOD DEPENDS ON APPLICATION Use cases: -Text categorization -Validation of record linkage -Knowledge discovery -Document clustering and classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Use case #1: Text categorization Where do the categories come from? Do we have definite number of classes or let the machine decide? Are there any additional variables (e.g. meta- data)? Choices: topic modeling, information retrieval, machine classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Use case #2: Knowledge discovery Do we know what knowledge we want to discover? Is there a ‘gold standard’ data set, or ground truth? Choices: information retrieval/NLP, active learning, machine classification © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #3: MAKE SURE SOFTWARE IS ROBUST Examples: -Topic modeling: Mallet vs gensim -Explicit Semantic Analysis: EasyESA vs esalib2 © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #4: NOTHING IS FULLY AUTOMATED Humans should always be involved (curate, validate, ground truth) Examples: -General corpora: Mechanical Turk and Crowdflower -Scientific corpora: expert curators © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Implementation: usual steps Data collection Data organization Data cleaning Pre-processing: remove common stop words, tokenize, TFIDF Apply method Post-processing: validation and evaluation © 2015 Evgeny Klochikhin, PhD American Institutes for Research
TOPIC MODELING © 2015 Evgeny Klochikhin, PhD American Institutes for Research
What is text: ‘bag-of-words’ Vector space representation of text – every word has its unique id (e.g., ‘microscopy’=0, ‘afm’=1, ‘topography’=2, ‘nanoscale’=3, etc.) and the number of occurrences within the document: Award 0814615: Systems Approach to Dynamic Atomic Force Microscopy Abstract The goal of this project is to establish a framework for model based simultaneous topography and parameter estimation in the amplitude modulation atomic force microscopy (AFM). Parametric models of tip-sample interaction that are amenable to real- time identification will be developed. Harmonic balance and power balance tools will be incorporated towards the estimation of the model parameters. The amplitude and phase dynamics based on the model will be developed, which will be used to validate the model with experimental data and subsequently used for control design purposes. These methods will be used to study yeast cells. A framework for non-parametric reconstruction of tip-sample interaction potential will be researched. Limitations on how well amplitude modulated AFM can decipher different sample interactions will be studied… # of instances word IDs © 2015 Evgeny Klochikhin, PhD American Institutes for Research
What is topic modeling (D. Newman) The topic model is an algorithm that automatically learns topics (themes) from a collection of documents – It works by observing words that tend to co-appear in documents, for example gene and dna, or climate and warming – The topic model assumes each document exhibits multiple topics – The topic model learns topics directly from the text Each topic is displayed by showing its top-20 words, for example: – dark_matter cosmological cosmology universe dark_energy lensing survey CMB redshift cosmic mass galaxy scale galaxies gravitational measurement power_spectrum parameter observation structure... – This is a topic about Dark Matter, Dark Energy and Cosmology © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Examples Abstract excerptTop-3 topicsProbability scores Engineering for food safety and quality The food industry is one of the most conservative among industries in the United States; it is experiencing, like never before, the need for change, for innovation. Consumers are much more demanding and better educated in terms of food quality and nutritional aspects, regulatory agencies are searching for technologies that offer better products with greater safety… pathogen foodborne safety farm contamination control intervention food-borne borne reduce 0.32 poultry campylobacter jejuni chicken salmonella broiler egg colonization avian vaccine 0.32 symptom abdominal treatment vomiting cramp protect patient dos vaccine testing 0.16 Edible coatings to improve food quality and food safety and minimize packaging cost An edible film resembles plastic film wrap but is formed from renewable edible protein (e.g., milk protein) and/or polysaccharide (e.g., cornstarch). Edible films can be used as food wraps or formed into pouches for foods, thus reducing use of synthetic plastic films. Edible films can also be formed directly on the surfaces of the food as coatings to protect or enhance the food in some manner, becoming part of the food and remaining on the food through consumption... produce fresh outbreak coli contamination pathogen spinach lettuce salmonella o157 0.53 mycotoxin aflatoxin fungi fungal grain aspergillus feed flavus toxin fusarium 0.15 detection rapid phase method detect pathogen assay sensor sensitive biosensor 0.09 © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Software MALLET - http://mallet.cs.umass.edu/http://mallet.cs.umass.edu/ Sample steps: – Import documents: bin/mallet import-dir --input /data/topic-input --output topic-input.mallet \ --keep- sequence --remove-stopwords – Build the model: bin/mallet train-topics --input topic- input.mallet \ --num-topics 100 --output-state topic- state.gz – Inference topics: bin/mallet infer-topics --inferencer- filename [FILENAME] © 2015 Evgeny Klochikhin, PhD American Institutes for Research
Dr H Schnerr Leatherhead Food International Ltd., Randalls Road, Leatherhead, Surrey, KT22 7RY +44 (0)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Effect of mycotoxins in the nutrition of farm animals secondary metabolites of fungi fungi start to produce them under stress conditions some of them are.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Data Mining Chun-Hung Chou
Unit 3a Industrial Control Systems
Introduction to machine learning
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Meeting Today’s Food Safety Challenges at the National Institute of Food and Agriculture An Overview of Food Safety Programs Jan Singleton, PhD, RDN Director,
LÊ QU Ố C HUY ID: QLU OUTLINE What is data mining ? Major issues in data mining 2.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Regulations and Ethics. There are two sides to every issue… Do I look like a Frankenfood?
Food borne Illnesses are
What is Foodborne Illness?. Foodborne Illness AKA – foodborne disease What is it? – illness resulting from the consumption of food – commonly known as.
Food Safety Challenges and Benefits of New Technology Randall Huffman, Ph.D. Vice President, Scientific Affairs American Meat Institute Foundation USDA-
Knowledge and Learning in Complex Business Systems Zuobing Xu University of California, Santa Cruz (Silicon Valley Center) Ram Akella, Kristin Fridgeirsdottir,
Applications of one-class classification
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Tokyo Research Laboratory © Copyright IBM Corporation 2005SDM 05 | 2005/04/21 | IBM Research, Tokyo Research Lab Tsuyoshi Ide Knowledge Discovery from.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 Introduction to Software Engineering Lecture 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Use of Oriental Mustard and Allyl Isothiocyanate to Control Salmonella, Campylobacter and L. monocytogenes in Poultry Meat Products Amin Naser Olaimat.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Kitchen Safety Do Now: List 6 important Kitchen Safety rules that we’ve discussed this week on a piece of loose leaf paper.
Awareness Training: ‘HARPC’ for Food Safety Complimentary Presentation by Quality Systems Enhancement 1790 Wood Stock Road Roswell GA E. mail:
Building a Mock Universe Cosmological nbody dark matter simulations + Galaxy surveys (SDSS, UKIDSS, 2dF) Access to mock catalogues through VO Provide analysis.
1 Enviromatics Environmental sampling Environmental sampling Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Chapter – 8 Software Tools.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
Chapter 5: Information Retrieval and Web Search
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
AFLATOXIN REGULATORY ISSUES Garnett E. Wood, Ph.D. Food and Drug Administration Center for Food Safety and Applied Nutrition Center for Food Safety and.
On Farm Salmonella Control for the Broiler Industry – A U.S. Perspective J. Stan Bailey 1 USDA, Agricultural Research Service Athens, Georgia Phone: (706)
MACHINE LEARNING 10 Decision Trees. Motivation Parametric Estimation Assume model for class probability or regression Estimate parameters from all.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Food Safety & Sanitation Foods & Nutrition 1 What if a Penny Doubled everyday for a Month?
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
What happens in the body after the microbes that produce illness are swallowed? After they are swallowed, there is a delay, called the incubation period,
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
Compiling Information and Inferring Useful Knowledge for Systems Biology by Text Mining the Literature Anália Lourenço IBB – Institute for Biotechnology.
1 Webinar on: Establishing a Fully Integrated National Food Safety System with Strengthened Inspection, Laboratory and Response Capacity Sponsored by Partnership.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture Notes by Neşe Yalabık Spring 2011.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
© 2017 SlidePlayer.com Inc. All rights reserved.