Presentation is loading. Please wait.

Presentation is loading. Please wait.

Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey,

Similar presentations


Presentation on theme: "Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey,"— Presentation transcript:

1 Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA

2 Overview Review of OBIS DQ-issues Review of OBIS DQ-issues Review of existing DQ methods Review of existing DQ methods Case study: detecting outliers in multidimensional data Case study: detecting outliers in multidimensional data Discussion and future directions Discussion and future directions

3 Data Quality (DQ) DQ problems can be generated in every steps of the data life cycle:

4 DQ problems (I) Data gathering: Data gathering: instrument failures; false identifications instrument failures; false identifications geo-referencing geo-referencing Data storage Data storage key metadata missing erroneous data entry; database default values masquerading as real values

5 DQ problems (II) Data delivery: data corruption due to encoding conversion Data delivery: data corruption due to encoding conversion Data integration: duplicated records Data integration: duplicated records Data retrieval: missing values Data retrieval: missing values Data analysis/cleaning: inappropriate models used, etc. Data analysis/cleaning: inappropriate models used, etc.

6 DQ solving-a process-based approach DQ solving is an essential component of data analysis and thus part of the data life cycle DQ solving is an essential component of data analysis and thus part of the data life cycle A. It builds foundation for analysis and modeling A. It builds foundation for analysis and modeling B. It provides feedback to improve the whole data life cycle B. It provides feedback to improve the whole data life cycle C. It could lead to more DQ problems if not carefully executed C. It could lead to more DQ problems if not carefully executed

7 DQ solving methods Harvest metadata close to data Harvest metadata close to data Built-in integrity check and double data entry Built-in integrity check and double data entry Model-based approach: Model-based approach: a) statistical b) heuristic

8 OBIS DQ Study Metadata-related problems Metadata-related problems DQ on scientific names DQ on scientific names Integrity checking Integrity checking Redundant records detection Redundant records detection Outliers detection- a case study Outliers detection- a case study Outliers sometimes represent erroneous data We are examining data mining tools for detecting erroneous data points

9 DBSCAN-a clustering tool DBSCAN is density-based in feature space DBSCAN is density-based in feature space It deals with high dimensional data It deals with high dimensional data There is no need to specify cluster numbers There is no need to specify cluster numbers It identifies outliers during the clustering process It identifies outliers during the clustering process It is a fast algorithm and freely available It is a fast algorithm and freely available M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases

10 A diagram of DBSCAN Core Border Outlier  = 1unit MinPts = 5

11 Total points distribution

12 Result from DBSCAN

13 Limitation of the method Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. Other methods to identify erroneous distribution data ? How about using environmental data as proxies? Other methods to identify erroneous distribution data ? How about using environmental data as proxies?

14 Can we get some more information?

15 Limitations of using environmental variables Risk of imposing a rigid model at the time of pre- processing Risk of imposing a rigid model at the time of pre- processing Risk of losing valuable outliers Risk of losing valuable outliers Risk of circular logic in later analyses Risk of circular logic in later analyses

16 Discussions Why don’t you use more environmental variables? Why don’t you use more environmental variables? Can you use DBSCAN on environmental variables directly? Can you use DBSCAN on environmental variables directly?

17 Possible improvements Define multiple methods as DQ components Define multiple methods as DQ components Assign bootstrap weights Assign bootstrap weights Present outlier candidates to experts Present outlier candidates to experts Update weights based on user feedback Update weights based on user feedback

18 Summary Many data quality problems can arise during the whole data life cycle. Many data quality problems can arise during the whole data life cycle. Preliminary checking can eliminate a lot of simple errors Preliminary checking can eliminate a lot of simple errors Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates


Download ppt "Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey,"

Similar presentations


Ads by Google