A Semantic Web Strategy for Big Data: ET SIG Presentations/Discussion

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Distributed Data Processing
Nokia Technology Institute Natural Partner for Innovation.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Management Information Systems, Sixth Edition
EuroCRIS Best Practices & Solutions Members Helping Members Move Forward.
Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
Tom Sheridan IT Director Gas Technology Institute (GTI)
Tableau Visual Intelligence Platform
1 Dr Alexiei Dingli Introduction to Web Science Conclusion.
Information and Business Work
DoDAF 3.0: A Web 2.0 and SOA Mashup!
Build VIVO in the Cloud NIH Workshop on Value Added Services for VIVO Brand Niemann Semantic Community March 25-26,
Build the Binary Group in the Cloud Brand Niemann Senior Enterprise Architect Binary Group August 5, Updated August 8,
ONS Big Data Project. Plan for today Introduce the ONS Big Data Project Provide a overview of our work to date Provide information about our future plans.
Tableau Visual Intelligence Platform
OMB Data Visualization Tool Requirements Analysis: Microsoft Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
WP5 Digital Business Ecosystem Alessandra Benvenuti, INSIEL SpA (Friuli Venezia Giulia Region) ADC Final Conference Venice, March 13 th 2012.
Big Data and Social Media & Web Analytics Innovation Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Big Data Innovation: Semantic Analytics 14 th SOA for eGovernment Conference Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist.
NIST Scientific Data for Data Science United Nations Open Data / Open Government Conference, April 26-28, Abu Dhabi
Linked Data Visualizations for Eurostat Linked Data Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Navigating the Maze How to sell to the public sector Adrian Farley Chief Deputy CIO State of California
1 Semantic Cloud Computing & Open Linked Data Pattern Brand Niemann Invited Expert to the NCIOC SCOPE and Services WGs September 22, 2009.
Big Data Conference: Analytics and Applications for Federal Big Data Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic.
1 Gov 2.0 for EPA: Pollution Prevention and Toxics In Support of the June 9-13, 2008 National Dialogue on How to Enhance Access to Environmental Information:
SOA Pilots: Federation of SOA and Semantic Medline Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Information Sharing Begins With Me Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Xperience 2013 Be Informed 4.2 Dr. Brand Niemann Director and Senior Data Scientist Semantic Community AOL Government Blogger.
What is Enterprise Architecture?
Search Engines and Information Retrieval Chapter 1.
Using Data Science as Evidence in Public Policy With Big Data and Elections Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist.
EPA Indicators of Our Health and Environment Updated and Improved Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic.
Web-Enabled Decision Support Systems
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
November 2003 Presented to “Commercializing RDF” Semantic Software Solutions for Enterprise Web Management International World Wide Web Conference 2004.
Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
C W3C Government Linked Data Working Group Cory Casanave 06/30/2011 Cory Casanave Cory-c at modeldriven dot com CEO, Model Driven Solutions Founder,
The 2012 EuroStat Regional Yearbook for Semantic Interoperability Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic.
Why Doesn't EPA Have a Self- Contained Statistical Unit?: A Tribute to Doug Engelbart Dr. Brand Niemann Director and Senior Data Scientist Semantic Community.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
Open DATA METI: All Content As Big Data Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Health Datapalooza IV: Child and Adolescent Health Data App Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
FITT Fostering Interregional Exchange in ICT Technology Transfer Communication & Collaboration Tools.
Research on US Federal Government Handling of Data Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Towards Web Semantics Spreadsheets and the US Government Lee Feigenbaum, Cambridge Semantics Brand Niemann, U.S. EPA SICoP Special Conference February.
Department of Commerce App Challenge: Big Data Dashboards Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community.
NGA Demo Participant Collaboration Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
NIEM 3.0 Data Analytics App Dr. Brand Niemann Director and Senior Data Scientist Semantic Community AOL Government Blogger.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Topic Maps introduction Peter-Paul Kruijsen CTO, Morpheus software ISOC seminar, april 5 th 2005.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Connecting People With Information Transforming the Way the DoD Manages Data M. David Allen OASD(NII)/DoD CIO May 23, 2006 “The.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Government and Industry IT: one vision, one community Vice Chairs April Meeting Agenda Welcome and Introductions GAPs welcome meeting with ACT Board (John.
U.S. Federal Government Handling of Data for Open Government Data in Japan Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist.
IoT Meets Big Data Standardization Considerations
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
BIG DATA. The information and the ability to store, analyze, and predict based on that information that is delivering a competitive advantage.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Application Software Chapter 6.
H3 Solutions and the Azure Government Cloud Team Up to Power Contextual Intelligence Platform – Where Big Data Meets Business Productivity MICROSOFT AZURE.
Case Study Modernizing an Operational Data Architecture
Chapter 1 Database Systems
Chapter 1 Database Systems
Big DATA.
TOOLS & Projects overview
Presentation transcript:

A Semantic Web Strategy for Big Data: ET SIG Presentations/Discussion Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ November 29, 2012 http://semanticommunity.info/Emerging_Technology_SIG_Big_Data_Committee/

Welcome and Agenda Welcome: Agenda: John Geraghty: MITRE and ET SIG Chair Johan Bos-Beijer: GSA, Government Chair of the Big Data Committee, and SIG GAP Agenda: 8:30 – Networking 9:00 - ET SIG Business Meeting – John Geraghty 9:30 – Semantic Technology Presentations / Discussion - Brand Niemann (Semantic Community),  Eric Little (Orbis Technologies) and Victor Pollara (Noblis) Remarks by Dr. Tom Rindflesch (NLM) and Dr. George Strawn (OSTP/NITRD) 10:40 - Close http://semanticommunity.info/Federal_SOA/14th_SOA_for_E-Government_Conference_October_2_2012#SOA_Pilots

Aligning Outcomes, Accelerating Change and Creating Agility in Tough Financial Times 10. Confront Our Reality – Web 2.0 and Semantic Web 9. All About the Data – Data Scientists and Semantic Web 8. Power of Alignment 7. Enterprise View – Semantic Web and BPMN 6. Fix Process First 5. Status Quo No Longer Gets a Bye – Semantic Web 4. Execution Oriented Organization 3. Secure Information Sharing – Semantic Web (Tag Data) 2. All About the Workforce 1. All About Change David Wennergren , Assistant Deputy Chief Management Officer, Department of Defense http://semanticommunity.info/AOL_Government/Government_Information_and_Analytics_Summit#KN-1_Aligning_Outcomes.2C_Accelerating_Change_and_Creating_Agility_in_Tough_Financial_Times

CTO Views on Government Big Data and the Top Five Big Data Solutions of 2012 Need a Recipe for the Analysts Desktop for Real-Time Analytics Interested in Our Own Metadata, Not Someone Else’s Really Interested in Just Seven Things: Person, Place, Time, Event, Concept, Importance, and, Source Using the Semantic Web, But Not the Cray Graph Computer Yet Gus Hunt, Chief Technology Officer for the Chief Information Officer, Central Intelligence Agency http://semanticommunity.info/@api/deki/files/13765/CIA_Gus_Hunt_Gov_Summit.pdf http://semanticommunity.info/AOL_Government/Government_Information_and_Analytics_Summit#BD_2_CTO_Views_on_Government_Big_Data_and_the_Top_Five_Big_Data_Solutions_of_2012

Purpose Current projects that are implementing Semantic Web Standards and Technologies.  Reusable solutions for datasets communicated between machines (Big Data). Coordination with the Big Data Committee, Collaboration & Transformation SIG and the Advanced Mobility Working Group. Thoughts on next steps and resources for interested organizations.

The Semantic Web The Semantic Web, a collection of technologies designed to add meaning and enable intelligent search across the Web, is now taking shape as a strategy to help navigate the challenges of Big Data, including how to find insights and leverage the ever-increasing volume of data available to us from sensors, social media posts, videos, pictures, and purchase transactions, just to name a few. Source: Turning Big Data Into Big Benefits, MAKE IT A TRIPLE -A Semantic Web Strategy for Big Data by Frank P. Coyle Cutter IT Journal Vol. 25, No. 10 October 2012.

Big Data Big Data is not a precise term; rather it’s a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Source: Leadership Council for Information Advantage. Big Data: Big Opportunities to Create Business Value. EMC, 2011 http://www.emc.com/microsites/cio/ar...ties-Value.pdf.

Big Data Options for working with Big Data include data mining and warehousing, statistical analysis, machine learning, column-oriented analytics databases (such as those used by Zynga and Groupon), and the increasingly popular parallel processing options built around the open source Hadoop project and Google’s MapReduce frameworks for distributing processing across thousands of computationally independent computers. However, such options are most effective when the Big Data is homogeneous and can be efficiently partitioned across multiple computers. If there are multiple data sources, stored in a variety of formats, then a large-scale Hadoop MapReduce effort will not be able to uncover linkages, since each computer in a Hadoop cluster will only be searching a subset of the data. What is needed is a technology suited to making linked data connections. And for that, we turn to the Semantic Web.

Big Data Big data is best described by four Vs: Volume (massive size), Velocity (real-time), Variety (unstructured, structured, video, etc.), and Value (worth the hardware, software, and talent to deal with). Social media is all the stuff that takes up so much of our valuable time: Email, Tweets, Facebook, LinkedIn, etc.). Data Science Analytics is getting something useful from all of that and being able to communicate that effectively to others that are less technical and justify it to those that are more technical than we are. http://semanticommunity.info/AOL_Government/Big_Data_Innovation

Searching For Meaning: The Semantic Web The word “semantics” comes to us from the Greek semantika, the study of meaning. At the core of the Semantic Web is a data model called the “Resource Description Framework” (RDF), a simple language for stating simple facts about things. At the core of RDF are mini-sentences called “triples.” These mini-sentence triples consist of a subject (what we are talking about), a predicate (or property), and an object (or value). 

RDF Triples Give Rise to a Web of Connections

SPARQL: The Semantic Query Language But RDF triples do not stand alone. Asking questions about RDF triple data is the province of SPARQL, an RDF query language capable of querying RDF databases. SPARQL (which stands for Simple Protocol and RDF Query Language) includes capabilities to retrieve and manipulate data stored in RDF format. SPARQL was standardized by the W3C as an official Recommendation in 2008. A search for “Individuals who know each other and like sushi” can be resolved by the following SPARQL query: SELECT ?x ?y WHERE { ?x :knows ?y. ?x :likes :Sushi. ?y :likes :Sushi. } MY COMMENT: SPARQL does not yet support queries over physically distributed data sources.

The Challenge of Big Data At the 2012 World Economic Forum Annual Meeting in Davos, Switzerland, Big Data was a featured topic, with a report entitled “Big Data, Big Impact.” In March 2012, the US federal government announced a $200 million research program for Big Data computing. Even Dilbert’s boss in the Scott Adams comic strip has gotten into the act, describing Big Data as “It comes from everywhere. It knows all.” No longer do we need to work hard to get data into our systems, but for database queries that attempt to aggregate data from disparate sources with similar time stamps.

Opening Statement by Ralph Hughes Looking Past The Hype MapReduce Is Not Our Silver Bullet Replacing Fear With Discipline “We must remind ourselves that new technologies frequently get overhyped by the media and vendors, and that our search for a silver bullet often leads to profound disappointment. We will need time and discipline to see what Big Data can realistically offer. A disciplined approach should begin with compelling use cases that express clearly attainable business impacts. Only by articulating realistic objectives can we rationally choose a technical solution from the several competing Big Data technologies. Moreover, any Big Data solutions must integrate into our existing strategies for “not-so-Big Data,” so that the information flood from the coming “Internet of everything” calmly fills our carefully architected BI ecosystems with usable data rather than washing them away.” MY COMMENT: My AOL Gov story on the Big Data Innovation Conference in Boston last September was that most Hadoop implementations were costing companies 50 times more than expected. http://semanticommunity.info/AOL_Government/Big_Data_Innovation Source: Turning Big Data into Big Benefits, Cutter IT Journal Vol. 25, No. 10 October 2012.

Big Data Breakdown Much of the Big Data comes at us from a wide variety of sources not well suited for storage in relational tables. Structurally, the existing corpus of data in the world falls into three categories: Structured Data: Spreadsheets and Data Tables Semi-Structured Data: XML and JSON Unstructured Data: Information that does not fit easily into a set of database tables or XML or JSON. Estimates are that 80% of the data in organizations is unstructured. How can companies leverage the knowledge contained in the disparate data sources to their advantage? One approach is to use semantic technology, specifically RDF triples and SPARQL.

Data Falls Into Three Categories: Structured, Semi-structured, and Unstructured

From Data To Actionable Knowledge Converting data to triples opens the door to integration with other linked data. Unstructured data makes things a bit more complex, as its lack of associated semantic tags or indicators gives us little to work with. However, when the unstructured data has a Web URL, we can at least begin to talk about the data using our mini-sentence triples. MY NOTE: Semantically enhanced the unstructured data. With our data in a common RDF format, we can now utilize SPARQL to ask questions and uncover useful relationships. MY COMMENT: There are other ways to do this without SPARQL!

Converting Data to Triples Opens Door to Integration with Other Linked Data

Semantic Data For Free One side benefit of moving data to a common RDF triple format is the ability to make connections with other linked data sets stored as Semantic Web triples. One popular, widely accessed triple store is DBpedia, which is based on content extracted from Wikipedia. DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related data sets. As of August 2012, the English version of the DBpedia knowledge base describes 3.77 million things. The data set consists of 1.89 billion pieces of information (RDF triples), out of which 400 million were extracted from the English edition of Wikipedia; 1.46 billion were extracted from other language editions. From this data set, information spread across multiple pages can be extracted. For example, book authorship can be linked from pages about the work or the author. MY COMMENT: Put it into one Wiki Page and the multiple linked data indices into a Spreadsheet and Spotfire!

Data Models Come Full Circle At first glance, converting data from relational data tables and other structured sources into simple triples may seem like a radical idea. Recall that triples combine to form a network of linked data. Databases fall into three major types: hierarchical, network, and relational. Hierarchical databases were limited in their ability to connect data elements across different hierarchies. Network databases, which are similar to hierarchical databases, except that they allow links between hierarchies, mirroring the graphs that arise from triples. Because of Oracle and other companies, the word “database” has come to mean “relational database” for most business IT folks. Today, the rise of Big Data is challenging the hegemony of the relational model because the relational data model has difficulty incorporating semi-structured or unstructured data into its tables. Thus, the thinking about data models is coming back around to the utility of the linked-data, network model.

Data Models Come Full Circle MY COMMENT: The combing of triples from multiple relational data tables is difficult because most do not contain the semantic relationships and they must be added and usually with rules to create a semantic model. MY COMMENT: High-quality statistical data tables are a better place to start than open data because they and related tables (e.g. 1500 or so in the US Annual Statistical Abstract) are more likely to have semantic relationships.

My 5-Step Method So what I like to do to illustrate (data science) and explain (data journalism) is the following (like a recipe): Put the Best Content into a Knowledge Base (e.g. MindTouch) The Japan Statistical Yearbook 2012 Put the Knowledge Base into a Spreadsheet (Excel) Linked Data to Subparts of the Knowledge Base Put the Spreadsheet into a Dashboard (Spotfire) Data Integration and Interoperability Interface Put the Dashboard into a Semantic Model (Excel) Data Dictionaries and Models Put the Semantic Model into Dynamic Case Management (Be Informed) Structured Process for Updating Data in the Dashboard

Real-time Analytics for Small Data, Big Data, and Huge Data! http://semanticommunity.info/AOL_Government/AWS_Public_Sector_Summit_2012

A Japan METI Open Data Dashboard The Open Government Data/Linked Data Initiatives in the US (Data.gov) and elsewhere started with the wrong data. I recommended that Data.gov start with the best US government data from the Federal Statistical Agencies (e.g. US Census Bureau) and make it the standard practice of high quality data and metadata rendered in the new linked open data way. Experience has shown that to be the case with the statistical community calling Data.gov and similar efforts essentially ‘IT projects.’ The solution is to first render a countries high quality statistical data in the new way and then try to render the open data the same way as much as possible. I have done this for the US and Europe, and now for Japan. http://semanticommunity.info/A_Japan_METI_Open_Data_Dashboard

Framework for the Evolution of Working with Open Government Data http://semanticommunity.info/Build_SEMIC.EU_in_the_Cloud/SEMIC_2012_Conference

Big Data Committee Knowledge Capture and Work Force Training (GMU) MY COMMENT: This uses Semantic Technologies for: Well-defined Web Addresses Searching, and Linking http://semanticommunity.info/Emerging_Technology_SIG_Big_Data_Committee http://semanticommunity.info/A_Spotfire_Gallery/Spotfire_Learning_Network

Basic Concerns Problems Statements Example Solutions Semantic Web It is too new! Be Informed does not call it Semantic Technology. Gartner and Forrester call it Dynamic Case Management. Government It does not want to change! Dutch government funded ($40K) Be Informed for ten years to develop the leading eGov solution that its government agencies and businesses are required to use. Contractors They do not want to share the risk! Be Informed does fixed price work

Dutch frontrunners in global E-government According to the 2012 United Nations E-government Survey rankings, the Netherlands rank first among European countries, and second worldwide. Be Informed is proud to have contributed considerably to this success. Environmental Desk Online Be Informed facilitated an online portal for permit application, as part of a government project bringing hundreds of building permit application processes down to 1 process leading to 1 permit. Government benefits include savings of approximately € 100 million in administrative costs in the first year alone, providing one stop shopping for citizens and companies. More information: http://www.omgevingsloketonline.nl Be Informed operates at the heart of these developments. It’s business process platform has facilitated a number of citizen centric initiatives which crossed various public institutions boundaries in the citizens’ best interest. New to Holland: More information: http://www.newtoholland.nl Regelhulp: More information: http://www.regelhulp.nl Milieu Centraal: More information: http://www.milieucentraal.nl Source: PDF

Eric Little Eric Little is currently Director of Information Management at Orbis Technologies, Inc., in Orlando, FL.  He received a Ph.D. in Philosophy and Cognitive Science in 2002 from the University at Buffalo, State University of New York.  He later received a Post-Doctoral Fellow in the University at Buffalo’s Department of Industrial Engineering developing ontologies for multisource information fusion applications  (2002-04).  Dr. Little then worked for several years as Assistant Professor of Doctoral Studies in Health Policy & Education and Director of the Center for Ontology and Interdisciplinary Studies at D'Youville College, Buffalo, NY (2004-2009).  He left academia in 2009 to work as Chief Knowledge Engineer at the Computer Task Group (CTG) before joining Orbis.

Victor Pollara Dr. Pollara is a Senior Principal Scientist at Noblis’ in the Health Innovation mission area. He applies several decades of experience in theoretical computer science, bioinformatics, knowledge extraction from text, and algorithm design to develop computational solutions for complex, data-driven problems.  His current work is focused on applying formal modeling and semantic technologies to large, heterogeneous data sets and experimenting with Noblis’ Cray XMT2 as a multi-billion triplestore server.

Tom Rindflesch Thomas C. Rindflesch has a Ph.D. in linguistics from the University of Minnesota and conducts research in natural language processing at the National Library of Medicine. He leads a research group focused on exploiting the Library’s resources to support development of advanced information management technologies in the biomedical domain.

George Strawn Dr. George O. Strawn is the Director of the National Coordination Office (NCO) for the Federal government’s multiagency Networking and Information Technology Research and Development (NITRD) Program. He also serves as the Co-Chair of the NITRD Subcommittee of the National Science and Technology Council. The NCO reports to the Office of Science and Technology Policy (OSTP) within the Executive Office of the President.