Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science – ITWS/CSCI/ERTH 4350/6350

Similar presentations


Presentation on theme: "Data Science – ITWS/CSCI/ERTH 4350/6350"— Presentation transcript:

1 Data Science – ITWS/CSCI/ERTH 4350/6350
Brief review of data and information acquisition (curation) and metadata/ provenance – management, data and metadata formats Peter Fox Data Science – ITWS/CSCI/ERTH 4350/6350 Review, September 26, 2017

2 Reading Assignments Changing Science: Chris Anderson
Rise of the Data Scientist Where to draw the line What is Data Science? An example of Data Science If you have never heard of Data Science BRDI activities Data policy Self-directed study (answers to the quiz) Fourth Paradigm, Digital Humanities

3 Rise of the Data Scientist

4 Metaphor Anatomy study of the structure and relationship between body parts Physiology is the study of the function of body parts and the body as a whole.

5 Overused Venn diagram of the intersection of skills needed for Data Science (Drew Conway)
Anatomy Physiology ? Missing Anatomy

6 Data Science Anatomy (as an individual)
Data Life Cycle – Acquisition, Curation and Preservation Data Management and Products Forms of Analysis, Errors and Uncertainty Technical tools and standards

7 Data Science Physiology (in a group)
Definition of Science Hypotheses, Guiding Questions Finding and Integrating Datasets Presenting Analyses and Viz. Presenting Conclusions

8 Needs (this is our mantra)
Scientists should be able to access a global, distributed knowledge base of scientific data that: appears to be integrated appears to be locally available But… data is obtained by multiple means (models and instruments), using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed. And created in a manner to facilitate its generation NOT its use. And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology

9 Back to the TSI time series…
Many other examples

10 Data pipelines: we have problems
Data is coming in faster, in greater volumes and forms and outstripping our ability to perform adequate quality control Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision We often fail to capture, represent and propagate manually generated information that need to go with the data flows Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects The task of event determination and feature classification is onerous and we don't do it until after we get the data And now much of the data is on the Internet/Web (good or bad?)

11 Fox VSTO et al.

12 Yes, it all was/ is about Provenance
Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility, or be verified, explained, etc. Who? What? Where? Why? When? How?

13 Data Management reading
Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, pp. 273 – 284 (2000) Data Management and Workflows Metadata and Provenance Management Provenance Management in Astronomy Web Data Provenance for QA W3C PROV

14 Management Creation of logical collections Physical data handling
Interoperability support Security support Data ownership Metadata collection, management and access. Persistence Knowledge and information discovery Data dissemination and publication Derived from Data Management Systems for Scientific Applications IFIP Conference Proceedings; Vol. 188 Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software Pages: Year of Publication: 2000 ISBN: Reagan Moore Kluwer, B.V. Deventer, The Netherlands, The Netherlands

15 Modes of collecting data, information
Observation Measurement Generation Driven by Questions Research idea Exploration

16 Acquisition Learn / read what you can about the developer of the means of acquisition Even if it is you (the observer) Beware of bias!!! Document things Have a checklist (see Management) and review it often Be mindful of who or what comes after your step in the data pipeline

17 Example 2 ‘The goal of the data collection was to explore the relative intensity of the wavelengths in a white-light source through a colored plastic film. By measuring this we can find properties of this colored plastic film.’ ‘We used a special tool called a spectrometer to measure the relative intensity of this light. It’s connected to a computer and records all values by using a software program that interacts with the spectrometer.’ Lessons Noise from external light, inexperience with the software, needed to get help from experienced users, more metadata than expected, software used different logical organization, ...

18 Example 3 ‘The goal of my data collection exercise was to observe and generate historical stock price data of large financial firms within a specified time frame of the years 2007 to This objective was primarily driven by general questions and exploration purposes – in particular, a question I wanted to have answered was how severe the ramifications of the economic crisis were on major financial firms.’ Lessons Irregularities in data due to company changes (buy-out, bankrupt), no metadata – had to create it all, quality was very high, choice of sampling turned out to be crucial, …

19 Data (and Metadata) Formats
We covered some (not all) ASCII, UTF-8, ISO Self-describing formats Table-driven Markup languages and other web-based Database Graphs “Unstructured”

20 The Examples – important to watch the explanation of these slides in the recorded lecture
MONTHLY.PLT and MONTHLY

21 Example – good or bad? MONTHLY.PLT and MONTHLY

22 Example – good or bad? Where is the data? Where is the provenance?
SW48-T0271.asc

23 Example – good or bad? SW48-T0271.asc

24 RDF http://www.w3.org/RDF/ - Resource Description Framework
Read the introduction and overview Graph representation and encoding RDF the model and RDF/XML the encoding Many tools, and very good language support Is the foundation of ‘data on the web’, see JSON-LD (JSON for Linked Data) We cover this more in a later class Semantic web – a way of looking at the web as a web of data not a web of documents, 3 parts: Data representation – RDF Querying – SPARQL Context – OWL (Web Ontology Language) Best way to represent data in a way that also preserves semantics

25 Metadata formats Fall into three categories
Unstructured and disconnected With the data ‘Close’ to the data See the ASCII example and contrast this with the netCDF example Structure around metadata is very important Vocabulary (constraints) are also very useful We dream of contextual metadata…

26 Dublin Core DCMI is an open organization engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. ISO Standard of February 2003 ANSI/NISO Standard Z of May 2007 IETF RFC 5013 of August 2007 Metadata element set - Metadata terms -

27 Time ISO 8601 specifies numeric representations of date and time.
helps to avoid confusion in international communication due to different national notations increases the portability of computer user interfaces Good read: In XML encodings, see xsd:datetime

28 Spatial representation
ISO 19115:2003 defines the schema required for describing geographic information and services It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data ISO 19115:2003 is applicable to: the cataloguing of datasets, clearinghouse activities, and the full description of datasets geographic datasets, dataset series, and individual geographic features and feature properties. From

29 More markup languages GML - Geography Markup Language – developed as a way to standardize geographic representations (to facilitate interoperability) ISO 19136:2007 Stores data and metadata Because it focuses on coordinates, is important as representing structural elements, such as points, lines, polygons used in a specific discipline Features application schema to represent roads, rivers, etc. Is stored statically as well as generated dynamically

30 Markup languages KML – Keyhole Markup Language – developed as an interlingua for a specific application, i.e. Google Earth Currently stores data and metadata XML tag and nesting provides for embedding structure and associations between metadata and data Uses other markup languages, e.g. GML Currently, KML 2.2 utilizes certain geometry elements derived from GML These elements include point, line string, linear ring, and polygon. Can contain links (external) to other content Increasingly is now generated dynamically rather than being a storage format KMZ – compressed version of KML

31 Provenance in this data pipeline Provenance is metadata in context
What context? Who you are? What you are asking? What you will use the answer for? As soon as you even think about semantics and knowledge encoding/ representation, knowledge is everywhere…. Fox VSTO et al.

32 At the least Keyword-value pair But preferably more structure…
Obs_start_time=“Mon 1 Sep :22:30 EDT” Observer=“Peter Fox” But preferably more structure…


Download ppt "Data Science – ITWS/CSCI/ERTH 4350/6350"

Similar presentations


Ads by Google