Presentation is loading. Please wait.

Presentation is loading. Please wait.

0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton,

Similar presentations


Presentation on theme: "0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton,"— Presentation transcript:

1 0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton, New Zealand +64 7 839 7808 www.reeltwo.com

2 1 What is Machine Learning? creating computer programs that get better with experience learn how to make expert judgments discover previously hidden, potentially useful information (data mining) How does it work? user provides learning system with examples of concept to be learned induction algorithm infers a characteristic model of the examples model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!

3 2 Structured Learning WeightDamageDirtFirmnessQuality heavyhighmildhardpoor heavyhighmildsoftpoor normalhighmildhardgood lightmediummildhardgood Lightclearcleanhardgood normalclearcleansoftpoor heavymediummildhardpoor... Mushroom Data weight good dirt firmness poor heavy light normal mildclean hardsoft poor good good

4 3 Unstructured Learning data does not have fixed fields with specific values examples: images, continuous signals, expression data, text learning proceeds by correlating the presence or absence of any and all salient attributes Document Classification given examples of documents covering some topic, learn a semantic model that can recognize whether or not other documents are relevant prioritize them: i.e. quantify “how relevant” documents are to the topic not limited to keywords (nor is it misled by them) adapt to the user’s needs (ephemeral or long-term)

5 4 How Text Mining Works Users supply the system with training data Documents that are good examples of the desired category The system builds ‘classifiers’ Statistical models based on the training data The system classifies novel data Identifies other documents about the desired category Results are displayed or stored Files can be viewed, routed to end users or stored in databases

6 5 Classification System Client-specific categories Familiar Windows-style interface Drag-and-drop documents to create custom categories Classified documents are ranked by relevance View contents of individual documents – sentences are highlighted by their relevance to the category

7 6 The Initial Problem: Individual curators evaluate data differently Protein Modification MAPK-KK Cascade Activation of p38 MAP Kinase While scientists can agree to use the word "kinase," they must also agree to support this by stating how and why they use "kinase," and consistently apply it. Only in this way can they hope to compare gene products and find out if and how they are related. The Gene Ontology – A Good First Step The Initial Solution: The Gene Ontology (GO) – A controlled vocabulary with defined relationships between items. GO consists of more than 13,000 nodes, or ‘GO Terms’, divided into three main trees: Biological Process, Cellular Component and Molecular Function Of these, only about 3800 GO Terms are ‘active’ – that is, terms appended with more than just one or two publications.

8 7 The Gene Ontology Knowledge Discovery System GO KDS) bridges the gap by classifying all of MEDLINE. New documents are classified as they’re added Scientists can now annotate gene targets quickly and reliably GO KDS is updated along with GO and MEDLINE Enormous gap between GO- annotated docs (27,000) and full MEDLINE database (12 million entries). Updates lag behind. Scientists must understand and agree to use the GO Knowledge changes and alters definitions. GO is only a partial solution GO KDS – Filling the gaps in GO Using GO “as is” takes too long and delivers too little

9 8 Current GO term(s) openLocation of listed term in GO All sub-terms for the listed term: click on a term to further refine your search Enter a keyword to search in this GO category Opens abstract in separate window Color of stars identifies the GO branch: number of stars indicates confidence of category placement Original GO classifications (by domain-expert) KDS discovers novel classifications GO KDS Interface Tour

10 9 GO KDS Key Benefits  Quickly sort documents into most relevant categories to the user  Replace laborious annotation by domain experts with a trainable, automated system  Discover conceptual links between previously unrelated scientific domains  Identify key articles for pertinent research  Integrate public, private and proprietary documents www.go-kds.com

11 10 Drug Approval Collecting information Organizing/Collating documents Satisfying approval criteria Life Science Research Finding relevant literature Prioritizing articles/reports Discovering hidden connections Distributing information Patent preparation Searching patent databases Collecting relevant documents Synthesizing information How is document classification useful?

12 11 Intelligent Text Mining: Therapeutic Courses One Reel Two client is using Classification System to rapidly sort through large volumes of medical documentation in disparate therapeutic areas. The Problem: Client must generate E-Learning Courses from hundreds of pages of reports, literature and product documentation supplied by client Old Solution: Manually read through documents to find paragraphs related to ‘Diagnosis’, Etiology, Epidemiology etc. New Solution: Use Reel Two Classification System to build a custom taxonomy, then automatically classify and extract relevant document sections into Therapeutic Area categories

13 12 Intelligent Text Mining – Patent Analysis Search patent filings for the ideas or concepts behind one’s analysis – Explore state of prior art, competitive landscape or ‘innovation gaps’ – Overcome intentionally vague language in patent filings The Mechanism of Action listed for this patent is "Neurotransmitter release modulator." However Classification System identified that this chemical modulator binds to the acetylcholine receptor, which is the true mechanism of action, and classified this patent in “MoA: Acetylcholinesterase”. In an in vitro assay, 2-chloro-5-(3-(R)-pyrrolidinylmethoxy)-3-pyridinecarbaldoxime (Ia) exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0.012 nM. ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic; tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant; uropathic; gastrointestinal; antiaddictive; gynecological. MECHANISM OF ACTION - Neurotransmitter release modulator. Identifying ‘Mechanism of Action’ in life science patents Patents are classified according to a taxonomy built by the client: Alzheimer’s Patents MoA: 5-HT Inhibitor MoA: Acetylcholinesterase MoA: Antioxidant MoA: Antiviral… Example Project Sample Output

14 13 “Life Science Information Management will form the largest unmet need for IT companies in the 21st Century” Caroline Kovak, General Manager, IBM Life Sciences

15 14 1. Search for a particular GO term by opening one of the main branches Appendix: GO KDS Interface

16 15 2. ‘Drill down” through the taxonomy to find a term of interest. Click on that term. Appendix

17 16 3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search with this term.’ Appendix

18 17 4. Scroll down to view abstracts. Appendix

19 18 5. Discover conceptual links to other GO categories. Click on the category to add the term to your search. Appendix

20 19 6. View the data intersection between GO categories. Scroll through to view abstract. Appendix

21 20 7. GO terms identify concepts embodied in the abstracts, enabling quick review. Appendix

22 21 8. Select an abstract of interest, and click to open the complete abstract. Appendix

23 22 9. The abstract will open in a new window, allowing you to continue with your search, or to link directly to the journal. Appendix


Download ppt "0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton,"

Similar presentations


Ads by Google