Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

Similar presentations

Presentation on theme: "Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community"— Presentation transcript:

1 Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community December 13, 2014 1

2 Overview Master Data Repository Tool RFI: Due December 30 th GPO's FDsys: 1 Billion Items Served and Ready For More Big Data: Google Page Rank Data Science: Questions, Data Mining, Invert Bath Tub, and Digital Government Strategy DTIC Site Map, Thesaurus, and Subject Categories Knowledge Base: MindTouch Knowledge Base Spreadsheet Linked Data Index: Excel Analytics and Visualization: Spotfire Semantic Search: Semantic Insights Results: Conclusions and Next Steps 2

3 Master Data Repository Tool RFI: Due December 30 th DTIC’s goal is to consolidate, unify, manage, control, search, analyze, and disseminate scientific and technical data using a single tool. If they really want one tool then the GPO FEDSYS is probably the closest I have worked on in the past. DTIC Needs Data Scientists to Build a Data Ecosystem with Data Science. Map RFI Requirements to Digital Government Strategy to Show How This Data Science Pilot Meets and Exceeds Those Requirements. 3

4 GPO's FDsys: 1 Billion Items Served and Ready For More But, “We are not resting on our laurels,” said GPO’s Chief Technology Officer Ric Davis. A major refresh of FDsys now is in the planning stages, which will include an updated search engine and improved support for mobile devices. FDsys uses Extensible Markup Language (XML) and an ISO standard format for archival information to enable searching across multiple collections, a feature not available in the original GPO website, GPO Access. “It was really a flat store of files with a search engine on top of it,” LaPlant said of the old Access. “We needed something to better manage and preserve.” The agency is evaluating cloud-based technology for FDsys as part of its upcoming major refresh, along with a new an open source search engine, Solr, which promises fault tolerant performance on a large scale. In 2012 GPO began replacing its 30-year-old composition engine called Microcomp with XML Professional Publisher (XPP) to enable the direct XML formatting of documents for both electronic and print publication. This eliminated the step of transforming documents for publication in XML. My Comment: So I taught XML Training at GPO, showed them how to “author once and use many” (print, CD, Web, mobile), and suggested MindTouch, the state-of- the-art Wiki with Solr (Lucene) in the Amazon Cloud that I use for all of my work, so I am still ahead of them after 15 years! I use four tools: MindTouch, Spotfire, Semantic Insights, and Be Informed. 4

5 Big Data: Google Page Rank PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: – PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. PageRank is now one of 200 ranking factors that Google uses to determine a page’s popularity. Google Panda is one of the other strategies Google now relies on to rank popularity of pages. Even though PageRank is no longer directly important for SEO purposes, the existence of back-links from more popular websites continues to push a webpage higher up in search rankings. My Comment: Why not create big data pages that are data in relational and graph format? 5

6 Data Science: Questions, Data Mining, Invert Bath Tub, and Digital Government Strategy Answer Four Questions: How is the data collected? Where is it stored? What are the results? Why should we believe them? Follow Data Mining Process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Invert the Activity Level Bathtub: Collection (Easy and Fast) Analysis (Maximize Time Spent) Communications (Easy and Fast) Digital Government Strategy: Unstructured is Structured Unstructured and Structured Are Integrated Well-defined URLs Content (XML, Java, and APIs with Non- Web Formats Like PDF Converted) Data Ecosystem 6

7 DTIC Site Map 7

8 DTIC Thesaurus 8

9 DTIC Subject Categories 9

10 Knowledge Base: MindTouch 10 Data Science for DTIC Data Ecosystem

11 Knowledge Base Spreadsheet Linked Data Index: Excel 11

12 Analytics and Visualization: Spotfire 12 Web Player

13 Semantic Search: Semantic Insights 13 My Note: I requested use of Research Assistant and Research Librarian on DTIC Content.

14 Results: Conclusions and Next Steps A Data Scientist Has Built a DTIC Data Ecosystem That Answers Four Basic Questions, Supports Data Mining, Inverts the Bath Tub, and Complies With the Digital Government Strategy. The DTIC Data Ecosystem Was Built From the DTIC Web Site Map and Satisfies the RFI Requirements. The DTIC Data Ecosystem Provides Sematic Search and Visualizations in MindTouch, Excel, and Spotfire. Semantic Community Has Requested the Use of Research Assistant and Research Librarian Betas from Semantic Insights For Use on DTIC Content. 14

Download ppt "Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community"

Similar presentations

Ads by Google