Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

Similar presentations


Presentation on theme: "Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science."— Presentation transcript:

1 Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science for Agency Initiatives 2015 August 3, 2015 1

2 Activities 906 Members on July 28 th and 13 New Members on July 22 nd – a Daily Record! Member Mary Galvin had John Patrick Junior this month. Dr. Tom Rindflesh, NIH/NLM Semantic Medline on August 17 th on Glucan. Data Science for EPA Hydraulic Fracturing Webinar, September 1 st. OSTP/NSF Data Science Meetup of Meetups, November 6 th, Ballston, VA. Steve Hanmer, Mission Source, co-planning Data Science for Data Act Datathon Meetup. He attended the Data Act Datathon and Forum this week and will report. Jonathan Hines, ORNL science writer, doing a story on Semantic Medline and the ORNL CADES – Compute and Data Environment for Science. Dr. David Booth, Yosemite Project (Semantic Interoperability of EHRs), Cambridge Semantic Web Meetup Founder, Accepted to Speak with Date TBD. Attended Algorithms for Geospatial Data Analysis and Data Owls Meetups. 2

3 Algorithms for Geospatial Data Analysis and Data Owls Meetups I am not able to help with a blog for the Wednesday Meetup because there is not enough information to write a blog. My slide 3 (that I posted to your Meetup) shows the information I need for a blog, and collect beforehand for my Meetup blogs. In this case my research since the Meetup shows both authors could have accessed and used the actual data from the EIA. An example of what I am saying is my data science blog for our Monday August 3rd Meetup. Listen to CFPB Data Manager, get Consumer Complaint Database, and see Data Science on that data set! 3

4 Data Mining - Data Science – Data Publication Process Data Mining Process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Data Science Process: Data Preparation Data Ecosystem Data Story Data Science Questions: How was the data collected? Where is the data stored? What are the data results? and Why should we believe the data results? Data Science Data Publication: Knowledge Base Spreadsheet Index Web & PDF Tables to Spreadsheet Data Browser Dynamically Linked Adjacent Visualizations 4

5 Data Science Data Publication: Data Browser 5

6 Data Science Data Publication: Dynamically Linked Adjacent Visualizations 6

7 USGS geochem.csv Data Problem 1 Sophia, In Brand Niemann's presentation to the Big Data group, he mentioned trouble with geographic coordinates in the file geochem.csv located at http://mrdata.usgs.gov/geochem/geochem.csv. I've examined this file in Microsoft Excel 2010, plotting the latitude against longitude, and I don't see any anomalies. If there is any other information that might help to clarify the problem Brand had, I'd be happy to investigate further, but with the available evidence it looks like a software problem with the tools he was using. Peterhttp://mrdata.usgs.gov/geochem/geochem.csv My Note: I also did a scatter plot in Spotfire when the Map Tool did not work. Peter (cc Brand), Thanks for following up on this. I have included Brand so that he can reply with a more thorough response. I was also very interested to know why there was a discrepancy with the geographic coordinates. It would be helpful to know the source of the issue. Thanks, Sophia 7

8 USGS geochem.csv Data Problem 1 Peter, The problem is that the geochem.csv treats Latitude and Longitude as Categorical Data and not Numerical Data as does say the MRDS.csv, etc. A sophisticated program like Spotfire is sensitive to that important difference. Brand Brand, Your statement makes no sense. CSV files are plain text, with the rows specified as lines, and the columns delimited by commas. There is no type information, no category information, nothing at all to which a program reading these data can be "sensitive" other than the actual values in the field. Instead, it is the obligation of the person operating the software to understand the information in the data file, and apply that understanding in the use of software. That includes substantive knowledge of the meaning of the fields as well as the simple technical observations that one can make by examining the values contained in each field. That's why we have documentation. So first of all, when you had trouble, you should have investigated further with other software (Excel, for example), then you should have contacted me if you continued to have trouble using the data. It was irresponsible for you to claim that the problem you encountered is in the data. Peter 8

9 USGS geochem.csv Data Problem 2 Peter, Please download a free trial of Spotfire and import the two csv files: geochem.csv and MRDS.csv and you will see what I am talking about. I can come to the USGS and show you this if you would like. This is data science. Brand Brand, You have to understand the data, and you have to use the data responsibly. It is not up to the software to do that work for you. My suspicion is that your program treaded the coordinates differently than numbers because some of the rows have no coordinates--they're the geochemical analyses of materials standards used to ensure that the sample measurements are correct, and are used by knowledgeable specialists to assess the accuracy and precision of the data values. But you didn't look at the data, otherwise you would have seen this. That's not science of any sort. A scientist examines the evidence with which he or she works, and tries to understand what the evidence is, where it came from, and what it means. Peter Peter, I did look at multiple USGS data sets with the premier data science tool (IMHO) and reported what I found. I am telling you how you could verify my results and learn something about data science. The choice is up to you. Brand 9

10 Data Science Data Curation for Sustainable Data Science Meetups of Meetups I just finished four data science ecosystems: RDA Climate Data Challenge (July 15): http://semanticommunity.info/Data_Science/Data_Science_for_RDA_Climate_Change_D ata_Challenge http://semanticommunity.info/Data_Science/Data_Science_for_RDA_Climate_Change_D ata_Challenge RDA Information Week 2016 (Ebola Response and Nepal Earthquake) (July 17): http://semanticommunity.info/Data_Science/Data_Science_for_Global_Ebola_Response _Data http://semanticommunity.info/Data_Science/Data_Science_for_Global_Ebola_Response _Data USDA Microsoft Innovation Challenge (July 27): http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision_Farming_ Business#Story http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision_Farming_ Business#Story US Data Act (July 28): http://semanticommunity.info/Data_Science/Data_Science_for_the_DataAct_Datathon 10

11 Collaboration for Data Science Win-Wins USDA Open Government Data Training, Innovation Competition, and Online Course in Data-Driven Farming: http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision _Farming_Business#Story http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision _Farming_Business#Story Many Curated Government Data Sets and Data Science Products: http://semanticommunity.info Pick an Agency and/or a Data Set and Look for a Meetup on That: http://www.meetup.com/Federal-Big-Data-Working-Group/ Mentor Startups Partnership with Eastern Foundry: http://www.meetup.com/Federal-Big-Data-Working- Group/events/223140032/ http://www.meetup.com/Federal-Big-Data-Working- Group/events/223140032/ 11

12 USDA Collaboration Chronology March 16th: USDA CIO and ACDO on Open Data Plan and Roundtable Meetup March 25th: Government Technology & Innovation Incubator for Big Data Analytics II Meetup at Eastern Foundry May 18th: USDA Data Science MOOC Meetup May 21 st, USDA Open Data Quarterly Submission to OMB on USDA Data Usage provided (USDA Data Science MOOC) July 21st, Data-Driven Farming Online Course Announced by HeatSpring and Semantic Community July 27th: USDA Microsoft Innovation Challenge Submission on Farm Data Dashboards July 29th, Partnerships Sought for Data-Driven Farming Online Course September 17th: Big Data Science for Precision Farming Business Online Course Meetup and Commercial Examples: Farmers Business Network, FarmLogs, etc. October 26-December 18th, Data-Driven Farming Online Course with Partners 12

13 https://www.farmersbusinessnetwork.com/ 13

14 Agenda 6:30 p.m. Welcome and Introduction (New Tutorial and Mentoring) Slides Data Science for Agency Initiatives 2015SlidesData Science for Agency Initiatives 2015 7:15 p.m. Brief Member Introductions 7:30 p.m. Chad Tompkins, Section Chief, Data Section, Office of Consumer Response (suggested by (Linda F. Powell, Chief Data Officer, Consumer Financial Protection Bureau) Consumer Complaint Database Slides (not cleared for public release)Consumer Complaint Database 8:15 p.m.​ Open Discussion 8:45 p.m. Networking 9:00 p.m. Depart Listen to CFPB Data Manager, get Consumer Complaint Database, and see Data Science on that data set! 14


Download ppt "Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science."

Similar presentations


Ads by Google