Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond Barriers Exporting data quality assessments from Spain Arturo H. Ariño 1,2, Francisco Pando 3, Javier Otegui 1,4, Cristina Villaverde 3, Katya Cezón.

Similar presentations


Presentation on theme: "Beyond Barriers Exporting data quality assessments from Spain Arturo H. Ariño 1,2, Francisco Pando 3, Javier Otegui 1,4, Cristina Villaverde 3, Katya Cezón."— Presentation transcript:

1 Beyond Barriers Exporting data quality assessments from Spain Arturo H. Ariño 1,2, Francisco Pando 3, Javier Otegui 1,4, Cristina Villaverde 3, Katya Cezón 3 (1) Department of Environmental Biology, University of Navarra, Pamplona, SpainDepartment of Environmental BiologyUniversity of Navarra (2) Science Committee of the Global Biodiversity Information Facility (GBIF), Copenhagen, DenmarkGlobal Biodiversity Information Facility (GBIF) (3) The Spanish coordinating unit of GBIF (GBIF.ES,) CSIC, Madrid, SpainGBIF.ES (4) Now at the University of Colorado-Boulder, Boulder, CO, USAUniversity of Colorado-Boulder Biodiversity Information Standards (TDWG) Florence, Italy, October 2013

2 DATA QUALITY (DQ) IN SPAIN Collaborative, convergent effort GBIF.ES acting on DQ since its inception: – Realization of DQ issues as data came in – Focus on training; courses and seminars to help prevention – Tools to facilitate DQ control Research group at UNAV – Conceptual interest in DQ – In-depth assessment of DQ in available data – Tools to facilitate DQ analysis National and global vision

3 WHAT, WHERE, WHEN OUR TARGET DATA Primary Biodiversity Data Record PBR

4 PBR: HIGH DQ, FIT FOR ANY USE Megaptera novaehollandiae Adult female, live Off North Truro, MA, USA N, W :47 GMT Arturo H. Ariño Aboard Dolphin VI Canon Eos 450D, 200 mm lens un

5 PBR: DQ JUST ENOUGH FOR WHAT USE? Nautilus pompilus 4 specimens Off Palau Islands 1921 Legit :unknown Det.: J.A. Salinas Collection: JDR at MZNA un

6 FITNESS-FOR-USE FFU (1) defines whether data can be used for a specific purpose FFU (1) defines whether data can be used for a specific purpose Useful compromise for publishing data Useful compromise for publishing data FFU not equal to data quality FFU not equal to data quality QualityFitness-for-use Intrinsic to dataDepends on intended use ConceptualPragmatical Good quality predicting good FFUGood FFU not predicting good quality (1) Hill et al. 2010

7 Publisher: SwedishPublisher: German Publisher: French Publisher: British Publisher: Norwegian PUBLISHERS: IT’S ABOUT ‘OUR’ DATA Otegui, Robles & Ariño, eBiosphere, London, UK. Publisher: Parisien Publisher: Spanish

8 SO, DO WE PUBLISH RIGHT? Publishers take some responsibility for ensuring DQ If DQ not enough, at least inform users about FFU: – Disclose what you know – Research what you do not know – Fix what you can Corollary: Pursue DQ for the data you’re responsible for

9 FFU and DQ ASSESSMENT (UNAV) In 2006 AHA started analyzing our own DB for FFU, creating pattern-detection visualizations – First reported in TDWG-2006 (St. Louis) In 2008 we started to analyze raw & processed GBIF data – Building on works by Chapman, Yesson, Wieckzorek, etc., changing scope and perspective. Started publishing reports (state of the data) in 2009 Commissioned by GBIF.ES to analyze Spanish data (2010) Teamed up with GBIF-Sec, 2011 Created BIDDSAT tool, 2012

10 GBIF.ES AND DQ Main concern, as gbif.es hosts and manages data for many providers Strong training program, started in 2007 Develops a validation tool to help publishers ensuring standards compliance (DarwinTest) Produces a DQ index for the data being published through the node (ICA) Maintains a repository for DQ : Biodiversity Data Quality Hub (BDQ)

11 THEMES Analytics, assessments DQ and FFU tools Training

12 ASSESSMENTS: AN EXAMPLE Well-known geographical biases Spatial patterns appearing in the geographical distributions Were they consistent? Discovery of time-related issues

13 Distribution of records through day of year in gbif.es Modified from Otegui, Ariño, Encinas,Pando 2013

14 THE PYRENEAN RANGE > data points > species > localities

15 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences Modified from Ariño et al. 2012

16 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences 1970 Modified from Ariño et al. 2012

17 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences 1980 Modified from Ariño et al. 2012

18 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences 1990 Modified from Ariño et al. 2012

19 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences 2000 Modified from Ariño et al. 2012

20 PATTERNED DATA GROWTH All georeferenced data at 1/100 degree – Occurrences 2010 Modified from Ariño et al. 2012

21 Timeless Data All georeferenced data at 1/100 degree – Occurrences Modified from Ariño et al. 2012

22 Preserved specimens Observations Pre Post-2000 UNDATED

23 Modified from Otegui et al, 2013 All GBIF data Some date element wrong Some date element missing All date elements missing

24 Bad data Good data

25 Otegui, Ariño, Gaiji & Chavan, 2013 Publishers with wrong records One publisher Impact of single publishers on general DQ-T

26 Recalculated fron Ariño & Otegui, 2010 and Otegui et al. 2013

27 CONTROL AND FFU TOOLS AT INDEXING Gaiji et al., 2013

28 TOOLING: CURB PROBLEMS, HUNT ISSUES Pre-Upload Data validation: DarwinTest Overall DQ assessment: ICA Offence identification: BIDDSAT

29 DarwinTest MS-Access-based validation tool Easy GUI, uses forms Checks and validates DarwinCore files for many common error types Allows correction directly from the validation forms Allows automated coordinate transformation and generalization/obfuscation for sensitive data

30 DarwinTest DOWNLOADS Source code at SourceForge under CCAS license:

31 DarwinTest IN ACTION

32 The ICA Apparent Quality Index Three components (taxonomy, georeferencing, dates) Calculated on dataset from DT Improving over time

33 RULE-BASED FILTERS: AUTOMATED REJECTION

34 Latitude Longitude 30.37, , , , , , , , , , , , , , , , , , , , , , , , Otegui & Ariño, Proceedings of the TDWG 2009 Annual Conference, Montpellier, FR PATTERN-BASED FILTERS: DQ SPOTTING

35 - + Otegui et al., 2012 FILTERS CANNOT GET ALL

36 BIDDSAT Tool to detect space-time and other patterns Applicable to data publishers sharing data through GBIF Uses tailored visualizations Open source: https://github.com/jotegui/BIDDSAThttps://github.com/jotegui/BIDDSAT Bioinformatics, DOI: /bioinformatics/BTS359

37

38 0 100 Percentage of completeness Number of collections Source: BIDDSAT DATA COMPLETENESS

39 0 100 Percentage of completeness Number of collections Wrong implementation of exchange standards (DwC) – solvable Data loss – not solvable Limited room for improvement Fuente: BIDDSAT DATA COMPLETENESS

40 1/Jan31/Dec 1/Mar 1/Feb 1/Apr 1/May 1/Jun 1/Jul 1/Aug 1/Sep 1/Oct 1/Nov 1/Dec Fall Winter Spring Summer 1750 Year Cronhorogram. Introduced by Ariño & Otegui, 2008, TDWG

41 Source: BIDDSAT

42 -+ Hebdogram. introduced by Ariño & Otegui, Proceedings of TDWG

43 2008/05 Data Provider Codename: BORODIN

44 Data Provider Codename: BORODIN 2009/092008/05

45 TRAINING: PROMOTE DQ Extensive training at gbif.es, first seminar in 2007 International – strong recruitment in the Americas Materials made publicly available Repository of DQ materials: BDQ

46 DQ TRAINING Started 2007 On-site workshops: – III GBIF Workshop on Biodiversity Database Quality (2009) – Etc. On-line workshops: – E-learning at GBIF.ES: IV Workshop on Biodiversity Database Quality (2013) – Etc. Video recordings of workshops

47 DQ TRAINING

48 DQ TRAINING: ON SITE

49 DQ TRAINING: ONLINE Started 2010, 7 courses so far Enrollment: 130 students, 16 countries Two components: – ATutor, open during the actual courses – AContent, permanent repository for the courses SCORM package

50 BIODIVERSITY DATA QUALITY HUB DQ Resource Locator. Proposed at the GBIF European Node Meeting, 2011 Compatible with GBIFS’ ORC Includes tools, thesauri, training materials, experiences, Prezi presentation Allows for resource submission

51

52

53 EXPORTING DQA AND DQC BENEFITS ALL Gaps can be discovered: effort steered – Completeness – Taxonomical overlook Meaningful patterns identified – Some patterns are artifacts – Actual patterns separated from noise

54

55 Fungi TAXONOMY TREEMAP Recalculated from Gaiji et al., 2013

56 CLASSIFICATION ACCORDING TO: ‘Scaryplot’, from Ariño & Robles, 2006 Redrawn from Ariño, Otegui & Robles, 2009 PROVIDERSP2K GBIF RECORDS SAMPLE EVOLUTIONARY DATA

57 SPAIN’S STAKES AT PBR: DATA IS USED Number of papers using data retrieved from GBIF Report by Ariño, 2013 for the GBIF Science Committee (unpublished)

58 CITED LITERATURE Ariño AH, Otegui J (2008) Sampling Biodiversity Sampling. In Weitzman A, Belbin L (eds), Proceedings of TDWG (2008). Biodiversity Information Standards (TDWG), Fremantle, AU, p Proceedings of TDWG (2008 Ariño AH, Otegui J (2009) Meta-análisis de los datos de biodiversidad suministrados a través de gbif.es. Universidad de Navarra; Available: Ariño AH, Otegui J, Villarroya A, Pérez de Zabalza A (2012) Primary Biodiversity Data Records in the Pyrenees. Environmental Engineering and Management Journal, 11(6), 1059–1075.Environmental Engineering and Management Journal Ariño AH, Robles E (2008) Variable-Level Nomenclators. In Belbin L, Rissoné A, Weitzman A (eds), Proceedings of TDWG, St Louis, Mossuiri. P. 53. Proceedings of TDWG, 2006 Gaiji S, Chavan V, Ariño AH, Otegui J, Hobern D, Sood R, Robles E (2913) Content assessment of the primary biodiversity data published through GBIF network: Status, Challenges and Potentials. Biodiversity Informatics, 8(2): Content assessment of the primary biodiversity data published through GBIF network: Status, Challenges and Potentials. Hill AW, Otegui J, Ariño AH, Guralnick RP (2010) GBIF Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network. GBIF (p. 25). Global Biodiversity Information Facility.GBIF Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network Otegui J, Ariño AH (2009) Have Standards Enhanced Biodiversity Data? Global correction and acquisition patterns. In Weitzman A (ed), Proceedings of TDWG, Montlelier, FR. P Proceedings of TDWG, Otegui J, Ariño AH (2013) BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics, 28(16): BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): e doi: /journal.pone Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). Otegui J, Ariño AH, Chavan V, Gaiji S (2013) On the Dates of the GBIF Mobilised Primary Biodiversity Data Records. Biodiversity Informatics, 8(1), 173–184.On the Dates of the GBIF Mobilised Primary Biodiversity Data Records Otegui J, Robles E, Ariño AH (2009) Noise in Biodiversity Data. Poster presented at e-Biosphere 09. e-Biosphere Conference 2009, Conference Abstracts, London; McLeod N. & J. Edwards, eds. P. 190.Noise in Biodiversity Data

59 T H E E N D THANK YOU WITH SPECIAL THANKS TO: THE PEOPLE AT THE GBIF SECRETARIAT (COPENHAGUEN, DENMARK) THE PEOPLE AT THE SPANISH COORDINATION UNIT OF GBIF (GBIF.ES) ROYAL BOTANICAL GARDEN, HIGHER COUNCIL OF SCIENTIFIC RESEARCH (MADRID, SPAIN) THE PEOPLE AT THE DEPARTMENT OF ENVIRONMENTAL BIOLOGY (AMBIUN) THE UNIVERSITY OF NAVARRA (PAMPLONA, SPAIN) PARTS OF THIS WORK PRODUCED BY THE FRIENDS OF THE UNIVERSITY OF NAVARRA ASSOCIATION No bytes were seriously harmed while preparing this PPTX. (And copies exist of those who actullay were anyway). This file used 1328 watt-hours and 16 cups of black coffee. All images, plots and analyses by the authors except where otherwise noted PPTX © 2013 A.H. Ariño, University of Navarra WITH SPECIAL THANKS TO: THE PEOPLE AT THE GBIF SECRETARIAT (COPENHAGUEN, DENMARK) THE PEOPLE AT THE SPANISH COORDINATION UNIT OF GBIF (GBIF.ES) ROYAL BOTANICAL GARDEN, HIGHER COUNCIL OF SCIENTIFIC RESEARCH (MADRID, SPAIN) THE PEOPLE AT THE DEPARTMENT OF ENVIRONMENTAL BIOLOGY (AMBIUN) THE UNIVERSITY OF NAVARRA (PAMPLONA, SPAIN) PARTS OF THIS WORK PRODUCED BY THE FRIENDS OF THE UNIVERSITY OF NAVARRA ASSOCIATION No bytes were seriously harmed while preparing this PPTX. (And copies exist of those who actullay were anyway). This file used 1328 watt-hours and 16 cups of black coffee. All images, plots and analyses by the authors except where otherwise noted PPTX © 2013 A.H. Ariño, University of Navarra BIDDSAT, WWW. UNAV. ES / UNZYEC / MZNA / BIDDSAT /, WWW. NCBI. NLM. NIH. GOV / PUBMED / S OON IN A PDF N EAR Y OU. WWW. UNAV. ES / UNZYEC / MZNA / BIDDSAT / WWW. NCBI. NLM. NIH. GOV / PUBMED /


Download ppt "Beyond Barriers Exporting data quality assessments from Spain Arturo H. Ariño 1,2, Francisco Pando 3, Javier Otegui 1,4, Cristina Villaverde 3, Katya Cezón."

Similar presentations


Ads by Google