Presentation on theme: "The Best Way to Get BIG DATA is By Starting Small Dr. Brand Niemann Director and Senior Data Scientist Semantic Community for Johns Hopkins University."— Presentation transcript:
The Best Way to Get BIG DATA is By Starting Small Dr. Brand Niemann Director and Senior Data Scientist Semantic Community for Johns Hopkins University School of Medicine and Modus Operandi December 12,
BIG DATA The new Digital Government Strategy is "treating all content as data." So big data = all your content: – But just a small sample to start a pilot. There are many Big Data Technologies to choose from and many early adopters are finding them more expensive than expected: – Use open source-free trials to pilot. There are many Big Data Problems to solve that could “boil the ocean”: – Use a data scientist to help build a team and community for a fast, inexpensive, and small semantic data science pilot. 2
Subcommittee on Networking and Information Technology Research and Development (NITRD Subcommittee) 3 & Web AddressWeb Address These three activities fostered Semantic Medline on the YarcData Graph Appliance for the White House Big Data Initiative.
Data Science Team Example: Chief Data Science Officer Chief Data Science Officer: – Dr. George Strawn, Director, White House OSTP NITRD/NCO: Semantic Medline could be the “killer” Semantic Web application for the US Federal Government Data Science Team: – Dr. Brand Niemann, Lead – Dr. Tom Rindflesch, NLM Semantic Medline Creator – Professor Kirk Borne, George Mason University Federal Big Data Senior Steering WG Workforce Training Initiative – Tim White, Director, YarcData Federal Global Head – Aaron Bossett, YarcData Federal Solution Architect – Dr. Eric Little, Modus Operandi Chief Scientist 4
Generic Problems How to get Big Data: – Unstructured (Natural Language Processing to Graph-RDF Triples) and Structured (Relational-RDF Triples) Where to store Big Data: – Graph-RDF Triples and Relational What to show about Big Data: – Statistics, Visualizations, and Network Graphs Note: RDF Triples make Big Data smaller, smarter, and integrated! – Semantic Medline on the YarcData Graph Appliance is an example of the best content on the best graph data store with the best visualization results so far (in my humble opinion)! Our Semantic Data Science Team delivered this for the recent White House Big Data Event: See Making the Most of Big DataMaking the Most of Big Data 5
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Work Flow 6
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Semantic Medline Database Application 7 See More Information:
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Visualization and Linking to Original Text 8
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Bioinformatics Publication 9 My Note: My SQL database for non-commercial use.
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Semantic Medline at NIH-NLM Current : Web based research tool. Transition: Current systems re-engineered to leverage Urika (less than 5 days). Purpose: Build a platform for users to perform increasingly complex analysis. Immediate Requirement : Replicate current capability. Future: Allow for increasingly complex analysis. Ability to capture and share analytics in addition to sharing data. Tailor Urika to less complex queries. 10
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: Graphs and Traditional Technologies Square peg, round hole: Current technology does not support efficient representation, storage, and interaction with complex graph structures Traditional relational models only add the an already complex structure Traditional hardware approaches do not support efficient access to highly interconnected graphs You don’t know what you don’t know: Efficient relational schemas require prior knowledge of the relationships between database fields Updating and modifying schemas frequently introduces delays and errors Problems in partitioning the problem: Distributed computing solutions are good…If your problem can be easily partitioned Graphs are not predictable; accessing graph nodes across large clusters can be unwieldy at best and does not work at scale CPU … 11
Real-time, Interactive Analytics on Large Graph Problems Large Shared Memory Architecture Up to 512 TB Large Shared Memory Architecture Up to 512 TB XMT2 Massively Multi- Threaded Processors 128 Threads XMT2 Massively Multi- Threaded Processors 128 Threads Scalable IO Up to 350TB per Hour Scalable IO Up to 350TB per Hour ? CPU … Business Challenge: Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: The YarcData Approach 12
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG: New Use Cases Schizophrenia – Current therapies target dopamine receptors Not entirely effective Side effects – Basic research is exploring glutamate and its NMDA receptor – Goal: can we use Semantic MEDLINE to discover that research trend in the scientific literature Cancer – With some exceptions, therapy is not effective Has not progressed significantly in 60 years – Scientific basis Traditionally – cancer cells More recently – non-cancer cells (immune system) – Immune system and cancer Connection noted in 1863 (Virchow) But not exploited until recently – Goal: look for trends in cancer immunotherapy 13 Note: See Two YouTube Video Demos: SchizoSchizo (7 minutes) and Cancer (21 minutes) Cancer Discovery Browsing Method for Exploiting Semantic MEDLINE Cooperative reciprocity Between system and human Issue query Inspect graph for “interesting” concept Use selected concept to seed another query Iterate until satisfied
Modus Operandi: Mantra, Performance, and Vision Mantra: – Speeding the Discovery, Integration, and Fusion of Information Performance: – SBIR Phase Three Successes: Wave Exploitation Framework (EF) – Wave EF: Government-off-the-shelf (GOTS) technology for intelligence applications that tackles the difficult problem of processing unstructured and semi-structured data – C4ISR Government Customers: U.S. Air Force, U.S. Army, U.S. Marine Corps, U.S. Navy, DARPA, DTRA, Missile Defense Agency, and Intelligence Agencies Vision: – Wave All-Source Semantic Fusion Engine: In development to support individual medical researchers/intelligence analysts to work with big data – Semedy (former Ontoprise founders): Reasoner and Triple Store 14
Modus Operandi: Finding the Right Needle in the Right Haystack Dyson said. “So a lot of what we’re doing is enabling that by making the data sources accessible and searchable.” “Our specialization is what we call ‘semantic technology,’ which is just a way of making the data smarter. We enrich the data with various tags to make it easier to find.” The software also provides what McNeight called data “provenance” which has to do with the traceability back to the source of the data - the really important aspect for intelligence personnel. “We don’t make decisions,” McNeight explained. “We just help (the analyst) to make decisions and to find the right data. He may only be interested in a certain person in a certain location at a certain time. We can bring that back to him across multiple databases.” Source: delivers-information-based-intelligence/http://www.spacecoastbusiness.com/modus-operandi- delivers-information-based-intelligence/ 15
Data Science Team Example: President of Modus Operandi President of Modus Operandi: Richard McNeight, President, Masters Degree in Artificial Intelligence & Computer Science, Board of Regents, Florida Institute of Technology University, Recognized for Entrepreneurial Leadership, and Recipient of Florida County Economic Development Grant for Big Medical Data Data Science Team: – Lee Watkins, Director of Bioinformatics & IT JHMI, and Dr. Brand Niemann, Semantic Community, Co-Leads – Dr. Eric Little, Modus Operandi Chief Scientist, Ontology and Wave All-Source Semantic Fusion Engine Development – Bryan Thompson and Michael Personick, SYSTAP Principals, Bigdata® Platform – Tim Barr, YarcData Medical Informatics, and Aaron Bossett, YarcData Federal Solution Architect – Others to be added as needed Advisors: – Dr. Tom Rindflesch, NIH/NLM Semantic Medline Creator – Dr. Richard Ford and Dr. Marco Carvalho, Florida Institute of Technology 16
Generated Semantic Graph (RDF) Trust/Provenance Algorithms Wave Ingest Streaming Data Batch Data Structured, Semi- structured, Unstructured Data High Performance Triple Store (Rya) Semantic Reasoner Accumulo DB vMDC Wave and the vMDC (virtual metadata catalog – which is a query translator for non-semantic queries) 17 An engine that can ingest any kind of data, transform that data into RDF graphs, then do a lot of semantic coolness with those graphs.
BLADE 2.0 Wiki Apps and Visualizations How Wave Drives the BLADE Semantic Wiki and Other Kinds of Analytic Visualizations 18 The wiki is just a way to view the entities in the model and make changes and see related content without having to type any SPARQL code or really know anything about the backend model structure – just point and click at the content you want to see.
Possible Scenario For medicine – the Blade 2.0 Semantic Wiki would allow different researchers to view the data collectively from within their areas of expertise, but connect them to other areas effortlessly. This means – scientist 1 could be looking up information on a given receptor on a cell, while scientist 2 is looking at proteomic information (perhaps not even knowing it is the underlying substance of that cell/receptor). Scientist 3 could add some new information about a given compound that shows reactions at the receptor site scientist 1 is studying. Upon entering that information, scientist 1 would see a new linked piece of data about their receptor related to the compound – and the cool part is scientist 2 would also see information about the connection between their protein structure and that compound. Scientist 3 would see the information about the protein related to their compound as well (since they were only looking at the receptor-compound connection). All 3 would basically have new linked information available to pursue if they wanted. Now imagine being able to do those kinds of joins in near-real-time with a simple tool across the entire corpus of the Semantic Medline data set. Kaboom! Source: Dr. Eric Little, Chief Scientist and Ontologist 19
Knowledge Base: Modus Operandi Web Intelligence in MindTouch 20 Practical Example of How to Get BIG DATA By Starting Small with Structured & Unstructured Data as Relational & RDF Triples Stored in Excel and Visualized in Spotfire.
Big Data in Memory: Innovation Story Met Jef Sharp, President, Panève: – Amazing fast access and massive storage – Big Data Supercomputer on My Mobile Device – John Hopkins University – Blackbook (CIA Cloud) I suggested: – Greylock Partners - #2 Data Scientist in the World (DJ Patil, Entrepreneur-in-Residence who built the first formal data science team at LinkedIn) Works for In-Q-Tel (Robert Ames, Senior VP for Technology, In-Q-Tel) Works for CIA (Gus Hunt, CTO, CIA) – Who Wants Big Data Supercomputer on Mobile Devices 21
Future: Possibility Panève’s ZettaLeaf & ZettaTree Products Scalable single level storage – Panève’s scalable single level storage model collapses the server, network, and storage by removing software and replacing them with memory system primitives. This eliminates all network and network-processing overhead associated with accessing storage and delivers a 10,000X increase in raw performance. 22