Presentation is loading. Please wait.

Presentation is loading. Please wait.

TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.

Similar presentations


Presentation on theme: "TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental."— Presentation transcript:

1 TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental

2 TDWG- Lisbon Oct 2003 Background ERIN/CRIA speciesLink FAPESP/Biota

3 TDWG- Lisbon Oct 2003 Species Data Museum/Herbarium Observation Survey

4 TDWG- Lisbon Oct 2003 Data Error Names Geocode Altitude Collectors Dates

5 TDWG- Lisbon Oct 2003 Adding Data to the Database Software –Biota –BRAHMS –Specify –BioLink –EGaz –Etc.

6 TDWG- Lisbon Oct 2003 On-line Tools BioGeomancer (http://www.biogeomancer.org/)http://www.biogeomancer.org/ CRIA-localidade (http://www.cria.org.br/localidade/)http://www.cria.org.br/localidade/ Guidelines –MANIS http://dlp.cs.Berkeley.edu/manis/GeorefGuide.html) http://dlp.cs.Berkeley.edu/manis/GeorefGuide.html –HISPID –Data Cleaning and Validation

7 TDWG- Lisbon Oct 2003 Data quality - fitness for use

8 TDWG- Lisbon Oct 2003 Recording Accuracy and Error Additional Accuracy Fields –Preferably in meters (Point-Radius) Documenting Validation tests –Who –What –How

9 TDWG- Lisbon Oct 2003 Methods for geocode validation Internal Database Checks Outliers in Geographic Space - GIS Outliers in Environmental Space - Models Statistical outliers

10 TDWG- Lisbon Oct 2003 Internal Database Checks Internal inconsistencies Checking one field against another –Text location vs geocode Checking one database against another –Gazetteers –DEM –Collectors

11 TDWG- Lisbon Oct 2003 Geographic outliers - GIS Country, State, named district, etc.

12 TDWG- Lisbon Oct 2003 Geographic outliers - GIS

13 TDWG- Lisbon Oct 2003 Geographic Outliers - GIS Collectors – location vs date

14 TDWG- Lisbon Oct 2003 Environmental Outliers Cumulative Frequency Curves

15 TDWG- Lisbon Oct 2003 Acacia orites - 19 records - 9 Temperature parameters Reverse Jack-knife

16 TDWG- Lisbon Oct 2003 Outliers in climate space (T=0.95(√n)+0.2) where ‘n’ is the number of records

17 TDWG- Lisbon Oct 2003 FloraMap CIAT (Columbia) PCA Cluster Analysis $US100 Modelling 10-minute grids

18 TDWG- Lisbon Oct 2003 Principal Components Analysis - FloraMap Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Principal Components Analysis B. Specimen record. C. Mapped specimen. D. Climate profile

19 TDWG- Lisbon Oct 2003 Cluster Analysis - FloraMap Image from FloraMap (Jones and Gladkov 2001) showing use of Cluster Analysis to identify an outlier in Rauvolfia littoralis specimen data. A.Cluster Analysis B. Principal Components Analysis. C. Mapped specimen. D. Climate profile. E. Specimen record

20 TDWG- Lisbon Oct 2003 Diva-GIS Free Simple GIS Modelling (BIOCLIM/Domain) Data Cleaning Tools

21 TDWG- Lisbon Oct 2003 Diva-GIS – Coordinate Check Using Diva-GIS to check coordinates by comparing a file of point specimen records (red) against a polygon of Bolivian provinces. Input dialogue box is shown at A, where it can be seen that “STATE” in the point file has been set to the equivalent “DEPARTMENT” in the polygon file (Hijmans et al. 2003).

22 TDWG- Lisbon Oct 2003 Points outside Polygon – Diva GIS Results from Diva-GIS (Hijmans et al. 2003) showing point records that fall outside all polygons in the Bolivian provinces polygon file. The highlighted record shows the linking between the results dialogue box and the mapped record

23 TDWG- Lisbon Oct 2003 Mismatched Provinces – Diva GIS Results from Diva-GIS (Hijmans et al. 2003) showing point records that do not match set relationships between the specimen point file and the polygon of Bolivian provinces. The highlighted record where the geocoding on the specimen record causes it to fall in the wrong province

24 TDWG- Lisbon Oct 2003 Assign Coordinates – Diva GIS Results from Diva-GIS (Hijmans et al. 2003) showing point records with geocodes automatically assigned. A. Unambiguous geocodes found by the program and assigned. B. Ambiguous geocodes identified. C. Appropriate geocodes not found.

25 TDWG- Lisbon Oct 2003 Multiple possibilities – Diva GIS Results from Diva-GIS (Hijmans et al. 2003) showing alternate geocodes for a record where use of the Gazetteer has produced a number of credible alternatives.

26 TDWG- Lisbon Oct 2003 Cumulative Frequency Curves - DivaGiS Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile

27 TDWG- Lisbon Oct 2003 Bioclimatic Envelop – Diva GIS Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Bioclimatic Envelope from BIOCLIM to identify outliers in climate space. In this case the percentile cut off is set at 95. Red points on the envelope correspond with red points on the map, green points in the envelope correspond with yellow points on the map

28 TDWG- Lisbon Oct 2003 ANUCLIM $AUD1000 (with data files) Modelling (BIOCLIM / ESOCLIM) Cumulative Frequency Curves Parameter Extremes

29 TDWG- Lisbon Oct 2003 Cumulative Frequency - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the species accumulation curve with an identified outlier (labelled “bad”). Information from the “bad” record is displayed at the top of the log file (from Houlder et al. 2000).

30 TDWG- Lisbon Oct 2003 Parameter extremes - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the parameter extremes (top) and associated species accumulation curve (bottom) (from Houlder et al. 2000

31 TDWG- Lisbon Oct 2003 Statistical Tests Outliers in Latitude Outliers in Altitude Outliers in collectors range/day or week –Especially 17 th, 18 th and 19 th Century collections

32 TDWG- Lisbon Oct 2003 Thank You… Questions?


Download ppt "TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental."

Similar presentations


Ads by Google