Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open data and data curation

Similar presentations


Presentation on theme: "Open data and data curation"— Presentation transcript:

1 Open data and data curation
11 Open data and data curation Hamish James Statistics New Zealand Loss of knowledge about the dataset is the key problem we face. Datasets are easy to break apart or put together. It is not enough to be able to find a dataset, you also need to be able to drill down into the dataset and examine individual records or variables. This number means nothing without context. For datasets, that context is provided by creating metadata for the dataset

2 Outline Setting the scene Open data
How open data and data curation are related

3 Quick definitions data open data data curation information structured
digital analogue unstructured data open data data curation

4 Defining data 14. This definition of data places no restrictions on the topics data may refer to, or on the purposes towards which data may be used. Data may be collected about anything. Some examples of data include sets of measurements captured by scientific instruments, responses extracted from questionnaires and forms, and structured catalogue entries about physical or digital objects, such as books and artworks. The broad scope of what may be treated as data is also expanding as modern information technology and management techniques provide ways of creating structure in previously unstructured information sources. With such a wide definition of data, readers should bear in mind that their understanding of the term may be different from that of others. 4 Data consists of sets of structured values that can be organised, analysed and manipulated by a software application or some other means of calculation. This includes data collected directly through surveys and administrative systems, as well as data created or compiled by aggregating or reanalysing other sources. A defining characteristic of data is that it is machine-readable.

5 Open data, data curation
Open data is a philosophy based on the idea that that data is more valuable if more people can use it, and that technology has made the cost of sharing data negligble Data curation is a field of research and work focusing on the long-term management of data, built on the argument that the opportunity cost of losing data is high Open data highlights benefits Data curation worries about costs

6 data knowledge value

7 Focus of open data activities
Data collected and held by governments Data collected or generated through publically funded research DataPrinciples

8 Reasons to make data open
The underlying purposes of making publically funded data more accessible are to: inform decision making by government, businesses and communities increase transparency and accountability in government decision making assist informed participation by the public in government decision making promote economic development through the innovate application of data collected for one purpose to other tasks gain greater value from research data

9 Barriers to reuse of government data
Agency culture (reluctance or hostility to data sharing) Funding constraints Ensuring data confidentiality Shared ownership Poor dissemination practices Last year we did what government departments do best, we produced a paper! Barrier to reuse of government held data The paper drew on a review of published international and national commentary and evidence along with a small number of New Zealand government agencies that are already making data available for reuse. Government agencies may be disinterested in, or actively opposed to the release of information and data. Although most evidence relates to access to official information or freedom of information laws, it seems likely that a culture of restricting access to information will also encompass restricting access to data. Data reuse initiatives can require significant funding. The costs of developing and maintaining data discovery and access systems are much easier to measure than the benefits that flow from data reuse, and this may make it difficult to construct strong business cases for investment. Concerns that allowing data reuse may breach privacy or confidentiality can make agencies reluctant to allow data to be reused. General legislation, such as the Privacy Act, and specific legislation, such as the Statistics Act, imposes requirements on government agencies to protect data. A lack of attention to how data is disseminated can greatly complicate reuse of the data. The adoption of standard machine-readable formats, consistent discovery metadata, and simple licensing arrangements will make it easier for others to reuse data. Barriers to the reuse of data can emerge where government agencies do not have sole ownership of the data they wish to make available. Data that is produced as a result of a public-private partnership can be difficult to make publicly available for re-use because of the potential threat to the commercial interests of the private partner. Data may have direct sale value, or may embody intellectual property of value. Maori could view a range of government-held datasets as holding customary knowledge and expect access to be restricted.

10 Open Government Data Principles
Government data shall be considered open if it is made public in a way that complies with the principles below: Complete. All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations. Primary. Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms. Timely. Data is made available as quickly as necessary to preserve the value of the data. Accessible. Data is available to the widest range of users for the widest range of purposes. Machine processable. Data is reasonably structured to allow automated processing. Non-discriminatory. Data is available to anyone, with no requirement of registration. Non-proprietary. Data is available in a format over which no entity has exclusive control. License-free. Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed. OECD Principles and Guidelines for Access to Research Data from Public Funding

11 Characteristics of open data
Free and open access to the data Freedom to redistribute the data Freedom to reuse the data No restriction of the above based on who someone is (e.g. their nationality) or their field of endeavour (e.g. commercial or non-commercial) c.f.

12

13 Creative Commons licence conditions
Attribution Share-alike No derivative works NZGOAL (New Zealand Government Open Access and Licensing Framework) advocates the use of Creative Commons licences Non-commercial Creative Commons

14 Linked data Linked data uses semantic web approaches (especially RDF) to describe data and make it accessible to machines – a web of linked data RDF ‘triples’ are used to describe things Subject – predicate – object Hamish – is a – presenter

15 Linking Open Data dataset cloud

16 What is missing?

17 46 Loss of knowledge about the dataset is the key problem we face.
1717 46 Loss of knowledge about the dataset is the key problem we face. Datasets are easy to break apart or put together. It is not enough to be able to find a dataset, you also need to be able to drill down into the dataset and examine individual records or variables. This number means nothing without context. For datasets, that context is provided by creating metadata for the dataset

18 Data needs context 1818 Simple examples of metadata for a data value. The loss of one pieces of this metadata could render the data useless.

19 Examples “Which town or city in the UK has the highest proportion of students?" “Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?” ed-data-and-reality.html

20 re/use render explain Technology: Documentation: Hardware Standards
Formats Software Documentation: Standards Meaning Interpretation

21 data knowledge value Technology to render data
Documentation to explain

22 What is missing? Context
Data is not self-describing Who provides the description? What does it cost to provide the description? How much of the description is held as tacit knowledge? Expert’s personal knowledge Rules and meaning encoded into the data and software

23 I'm not sure exactly what it was that you are thinking of
I'm not sure exactly what it was that you are thinking of? However here are just a few cases where I had to do some extra work when the information was not available: Firstly in the Ag Economic Survey I had to trawl through correspondence in off-site boxes to 'cement' together a list of the codes and categories that related to the pre-1989 local govt areas used in the valuation rolls which was not in cars. (that took about hours work spread over weeks. Problem was often you had parts of lists but not a list where you had codes against descriptions. At least this information was able to be used for the older Ag Production later. Farmtypes lists at the front of the Ag Production publications in the 1970's and 80' were not always what was in the actual data or published tables but by doing SAS frequency analysis I reconciled totals against the published tables to confirm the codes. Similar analysis was done for the breeds data some years against published tables. The Agriculture Production (Annual Census of Farms Year Ended 30 June 1979) study had no questionnaire or other proxy for linecodes, so we did 1980 and 1978 before and then did frequency analysis on all three and confidently confirmed most linecodes using published tables and the straddling years for reality checks. This took about 20 hours work just to do this. Then the work was as per other similar ingests. General comment is that when you are not familiar with data it takes a lot of analysis to try and 'validate' linecode and category descriptions. After some investment it gets easier after the initial investment of time if there are common variables across years but then suddenly they all change and you have to slow down and get familiar with a new set of linecodes. I have attached 1979 DDI for now.Let me know if you need anymore, cheers (See attached file: 5081_MD.xml)

24 Data curation Data curation involves: = open data = data curation
Data management Adding value to data Data sharing for re-use Data preservation for later re-use = open data = data curation

25 Digital Curation Centre

26 DDI Alliance

27 Open data brings benefits and risks
more users highlights data curation failures justifies data curation costs pressure for more user support expands expert community increases risk of poor analysis

28 Complementary ideas Actively curated data will:
Remain technologically accessible Be easier to understand (and therefore use) Data curation will benefit from data being made more open: Data that is in active use tends to remain usable Widely used data is better understood than isolated data

29 Thank you Contact details Hamish James Manager, Information Management


Download ppt "Open data and data curation"

Similar presentations


Ads by Google