A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Use Case: Populating Business Objects.
MapReduce.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Global Alignment and Collaboration Jo
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
CMPT-884 Jan 18, 2010 Video Copy Detection using Hadoop Presented by: Cameron Harvey Naghmeh Khodabakhshi CMPT 820 December 2, 2010.
Chapter 7 UNDERSTANDING AND DESIGNING FORMS. Input Forms: Content and Organization Need for forms Event analysis and forms Relationship between input.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
報告人 : 葉瑞群 日期 :2012/01/9 出處 : IEEE Transactions on Knowledge and Data Engineering.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Other formats for data Linked lists, Hash tables, JSON, Big Data, Hadoop & MapReduce. REST. Parallel processing exercise Homework: Plans for group sorting.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
生物資訊程式語言應用 Part 5 Perl and MySQL Applications. Outline  Application one.  How to get related literature from PubMed?  To store search results in database.
Introduction – Addressing Business Challenges Microsoft® Business Intelligence Solutions.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Tutorial session 3 Network analysis Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Master headline RDFizing the EBI Gene Expression Atlas James Malone, Electra Tapanari
A collaborative tool for sequence annotation. Contact:
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
NoSQL: Graph Databases. Databases Why NoSQL Databases?
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
NoSQL: Graph Databases
An Open Source Project Commonly Used for Processing Big Data Sets
Spark Presentation.
Map Reduce.
Extraction, aggregation and classification at Web Scale
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Cloud Distributed Computing Environment Hadoop
CS110: Discussion about Spark
Lesson 3 Bioinformatics Laboratory
Charles Tappert Seidenberg School of CSIS, Pace University
Network biology An introduction to STRING and Cytoscape
5/7/2019 Map Reduce Map reduce.
Map Reduce, Types, Formats and Features
Presentation transcript:

A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William A. Baumgartner Jr., Kevin Livingston, Lawrence Hunter (UC Denver)

Questions that could be answered using large corpora Second source of data for validation/corroboration – Ligand binding site validation – Verspoor et al Rough ideas/leads to ppi from co- occurence Protein co-occurrence fraction for use in Hanalyzer networks Mine more and more recent knowledge than available from curated on ontologies

Available Tools and Data Data – Large corpora: PMC OA, publisher-arranged collections – Curated Ontologies: PRO, GO, etc. Tools – UIMA for NLP Processing – Batch schedulers (SGE, Torque) to scale UIMA – Hadoop to collate data – RDF to represent knowledge – Triple Store (Franz AllegroGraph) to store and access large amounts of RDF data

Bio Trends: a Sample Integration Project Function: – Count occurrences of proteins in articles – Collate by date, and display on a web app. Design – UIMA over SGE for protein ID, store in RDF files – Read RDF files and collate with Hadoop Call out to Allegrograph for ID and attribute lookup – Format resulting data as JSON for availability to web app

Prepare Available Data Start with raw text: PMC Open Access: – 250k full-text journal articles Identify (annotate) interesting spans (genes) – UIMA pipeline, NERs: ABNER, BANNER, etc, concept mapper on PRO dictionary to noramlize – Output to RDF for various uses

Options to Analyze Data Load into triple store and query – Necessity for exploring queries with complex results over entire graph – Ex. Load individual files into in-memory store and query in small groups – Possible for exploring simple queries over many small regions of the graph: article related – Easier to federate Hybrid – Some data not available from RDF files, but the triple store.

Map-Reduce Inspired by Lisp functions “map” and “reduce” – Map applies a function to each element of a list (a1, a2,…an), f(x)  (f(a1), f(a2), …f(an)) – Reduce combines lists by applying a function successively (a1, a2,…an), f(x,y)  f(f(f(a1,a2),a3), a4) (1,2,…n), +  (((1+2) + 3) + 4)

Map Reduce on HashMaps Map can be used to transform from one kind of key, value to a different kind of key, value – (Filename, text)  (gene, count) Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key. – (gene, count)  (gene, count) – (BRCA1, 1), (BRCA1, 3), (BRCA1, 1)  (BRCA1, 5)

Hadoop: a distributed map- reduce on maps or hash tables Can divide into parallel friendly tasks by key Distributes files over network Reduces network traffic by performing computation where data is Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co- occurrence count). Reduce is used to collate results.

Results PMC OA Medline Abstracts

Screen Shot Grants:

Thank You / Questions Co-authors – William Baumgartner for data generation – Kevin Livingston for RDF and Clojure help Grants and PIs – Larry Hunter, UCDenver SOM NIH 2R01LM , NIH 2R01LM A1, NIH 5R01GM – Karin Verspoor, UCDenver SOM NIH R01 LM – Gully Burns, ISI NSF