CSS 420 Introduction to GIS

CSS 420 Introduction to GIS
Data Quality and Error CSS 420 Introduction to GIS

Error Whenever you work with spatial data (or any data for that matter) you will deal with some sort of error due to the many steps involved in creating spatial data. Spatial data is just an abstraction of what is really there. Because of this abstraction, we can expect error due to: How we conceptualize the data in the first place How we collect the data How we present the data Additionally, there are other sources of error such as: Obvious Errors Errors in natural variation Errors in data processing

Data Quality GIS IS A GARBAGE MAGNIFIER
GARBAGE IN / GARBAGE OUT MOST FAILED GIS PROJECTS ARE DUE TO POOR PLANNING AND POOR DATA QUALITY

Obvious Error The errors we just discussed are illustrative of the general types of obvious errors you would encounter when using geospatial information. As a geospatial analyst, you will have to give thought as to how to correct those errors before proceeding with a project. Also, as a geospatial analyst, you should always approach a project with the obvious sources of error we just discussed firmly on you mind. Therefore, when given a task to perform, and the associated data, the following should act as a good checklist: Is the data current? Were the data mapped at the correct scale? Do they have the same accuracies? What is the resolution of the data? Will it support the kinds of analysis we want to perform? Do we have all the data for the project areas, or is there some data missing? If we need other data sets, are they available, or will we have trouble getting them?

Components of Data Quality
Positional Accuracy Attribute Accuracy Logical Consistency Resolution Completeness

Spatial Accuracy As we previously stated, positional accuracy relates to the coordinate values for the geographic objects. But, even positional accuracy is divided into two different categories: Absolute accuracy: refers to the actual X,Y coordinates of a geographic object. If one knows the correct position of the geographic object, they can compare the differences with the position represented in the geographic database. Typically, absolute accuracy will measure the total different between an object, or the difference in the X coordinate and the difference in the Y coordinate. Relative accuracy: refers to the displacement of two or more points on a map (in both the distance and angle), compared to the displacement of those same points in the real world. The figures on the right show two different maps of the Cornell campus and the City of Ithaca. The top map, a USGS quadrangle, has an absolute accuracy of around 40 feet. That is, the coordinates for a building on the quadsheet are probably within 40 feet of their real world coordinates. The bottom map, a photogrammetrically derived map of the same area has an absolute accuracy of about 2.5 feet.

Relative Accuracy Even though the USGS quadrangle has much less absolute accuracy than the photogrammetrically derived map, if were were to zoom into an area and measure the distance between two points, the relative distance, and the angle would be fairly similar. In this case, the distance along Tower Road is only about 15 feet different, and the azimuth of the road is virtually identical.

Positional Accuracy

Attribute Accuracy

Logical Consistency Representation of data that does not make sense
Road in the water Contours that cross or end Features on steep slopes

Resolution Generalization may improperly represent size and shape
Cartographic Asthetics Entire regions may be eliminated (islands, peninsulas, etc.)

Completeness Fragmented coverage of many developing countries
Soils Vegetation Must determine methods for uniformity

Obvious Errors The statement “to err is human” is very applicable to creating spatial data. Humans make a lot of errors. Typing in the wrong value in a computer is a common mistake that humans make. However, there are other sources of obvious error besides human error: Age: a map is a representation of real-world objects at a given point in time. The reliability of a dataset typically goes down as it gets older. This is especially true of data that would frequently change such as housing within a city. Many GIS projects take years to complete, and it is entirely possible that much of the data collected in the beginning of a project may be out of date by the end of the project. Map Scale: In general, larger scale maps show more detail than smaller scale maps. Also, larger scale maps tend to have greater accuracy than smaller scale maps, especially maps within the “same family” such as the differences between 1:250,000, 1:100,000 and 1:24,000 USGS maps. Computers, and GIS software really don’t care what data you give it. That being the case, a GIS will process any of your data, whether the processing is appropriate or not. Therefore, you can combine data from different scales rather easily, however, doing so may not be a good idea due to the different accuracies of the products. Data Format: The way we represent data also presents an obvious source of error. For example, a raster map of landuse represented by 10 meter grid cells will differ significantly from a raster map of landuse represented by 100 meter grid cells. The following is a grid of landuse values around Ithaca, New York. You can see the differences in representation between a map with 10 meter grid cells, 30 meter grid cells, and 100 meter grid cells. Aerial Coverage: Many data sets may not have uniform coverage. That is, there may be pieces missing in one section. Accessibility: Not all data sets are equally accessible. For example, land resources in one country may be available, but are considered a state secret in another country. Also, due to the recent events of September 11, 2001, some data are unavailable due to security reasons.

Problems with Age The following maps show the different land cover types between 1968 and You can see how the data has changed over 30 years, and why using older data might present a problem.

Obvious Sources of Error
Areal Coverage Many data sets do not have a uniform coverage of information NASSAU COUNTY BASEMAP SUFFOLK COUNTY PARCELS

Problems with Format 10 meter 30 meter 100 meter
You can see the different way in which data is represented when using different formats. In this case, 10, 30, and 100 meter grid cells are used. Question: How is the land cover different between the different formats: A. There isn’t any noticeable difference B. There is greater detail in the 100 meter grid cells C. The 100 meter grid cells are more generalized 10 meter 30 meter 100 meter

Errors Due to Natural Variation
You can see why each of the previous error types are called Obvious Errors. But there are other types of errors that are not so obvious, and oftentimes overlooked. Nonetheless, you will have to be aware of these kinds of errors too. The errors are termed errors in natural variation, and take the form of: Positional Errors Due to Natural Variation: there are natural variations in materials that might make them less accurate. For example, a paper map stored in a humid room will actually shrink. The shrinking of the material is is virtually unnoticeable by a user, but depending upon the scale of the map, the real world errors could be quite large. QUESTION: A small amount of map shrinkage will cause greater error in a: Small scale map Large scale map Variations Due to Equipment: Some equipment may not measure information correctly, or may have slight variations from measurement to measurement. For example, a temperature gauge or pH meter may have slightly different readings when measuring the same location. If you’ve ever measured your blood pressure on one of the automatic machines in the drug store, you have probably noticed that two readings taken after one another can be different. While some of this is based on your own fluctuations in blood pressure, the machines themselves have some variability. The variations of measurements are often related to two important concepts called precision and accuracy…

Errors Resulting from Natural Variations from Original Measurements
Positional Accuracy Result of poor field work, media shrinkage and expansion, poor vectorization (line digitizing) Correction through rubbersheeting Accuracy of Content Attribute errors caused by miscoding, or faulty equipment (thermometer, pH meter) Sources of Variation in Data: Data entry or output faults

Errors Resulting from Natural Variations from Original Measurements
Measurement Error Accuracy vs. Precision Accuracy: extent to which an estimated value approaches the true value Precision: measure of dispersion of observations about a mean Accuracy vs. Precision example Laboratory Errors Results of World-wide Laboratory Exchange Program Same soil samples in different laboratories exceeded: 11% for clay content

Accuracy and Precision
Accuracy is defined as displacement of a plotted point from its true position in relation to an established standard while Precision is the degree of perfection; or repeatability of a measurement. For mapping, accuracy is associated with position of an object to its true position. Precision is then the ability to repeat a measurement, or how likely you are to return to the same location time and time again. The figures to the right illustrate the differences between accuracy and precision. Therefore, if there are natural variations in either the instruments used for measurement, or the object you are measuring, the accuracy or precision may be effected. 4

Errors Arising Through Processing
Numerical Errors in the Computer Numerical precision PC ARC/INFO is Single Precision Some GIS are using Integer values to store coordinates and large areas may not be stored precisely. Scaling a triangle Faults Arising Through Topological Analysis Assumes Source data is uniform Digitizing procedures are infallible Map overlay is only concerned with line intersection Boundaries can be sharply defined and drawn

Raster to Vector GIS allows you to convert raster and vector features between one another. For example, we can take a raster feature and convert it to vector format. Or, we can take a vector feature and convert it to raster. But, as the examples show, depending upon the resolution of the features, the representation of the geographic objects may be quite different. In some cases, you can see how the raster version of the map actually caused some buildings to “merge” together. Vector Data of Buildings Vector data converted to raster with 10’ grid cells Raster data converted back to vector, using 10’ grid cells

Errors in Data Processing
Digitizing Data: Once again, scale presents a problem with digitized data. On a soil map, drawn at a scale of 1:100,000, a 1 mm wide line (the thickness of a sharp pencil) would actually represent 100 meters on the ground. Or, as shown in the example below, the road edge on the USGS quadrangle is actually 4 meters wide in some spots. Spatial Analysis: Some GIS functions such as overlay present problems such ambiguous locations, and the concept of “sliver polygons”. Also, converting data from raster to vector format will also introduce errors. Each of the examples are shown in the illustrations below. Width of edge of pavement is greater than 4 meters

Errors Associated with Spatial Analysis
Errors in Digitizing a Map Source errors Distortion Boundaries drawn on a map have a “thickness” 1 mm line 1.25 m wide on 1:250 map 100m wide on 1:100000 Estimates show that 10% of a 1:24000 soil map may represent the boundary lines alone Digital Representation Curves are approximated by many vertices Boundaries are not absolute, but should have a confidence interval

Sliver Polygons In the following example, there are two polygons. When we overlay the two of them, the resulting polygon has not only the logical intersection between the two polygons, but also many small polygons that are probably due more to the fact that the representation of the polygon boundaries are slightly different. These smaller, or sliver polygons, represent spatial errors in the data.

Boundary Problems Definitely in Definitely out Possibly in Possibly out Ambiguous (on the digitized border line)

Polygon Overlay and Boundary Intersection McAlpine & Cook Study (1971) 3 1: maps with 7, 42, and 101 polygons Overlay derived 304 polygons 38 % polygons were less than 3.8 sq. km Sliver Polygons “Fuzzy” Creep Error Zones Rasterizing a Vector Map

“Union” of Dryden Land Use and Tompkins County Parcels
661 polygons sq. m Smallest Polygon sq. m Clipped Parcels 2831 polygons sq. m Smallest polygon 9.2 m 99.% above 500 sq. m Union Coverage 6335 polygons sq. m Smallest Polygon sq. m 13% of polygons less than 500 sq. m

Error Propagation Universal Soil Loss Equation
A = R * K * L * S * C * P A annual loss R measure of erosion 297 +/- 72 K erodibility of soil .1 +/- .05 L slope /- .045 S slope percent /- .122 C cultivation parameter .5 +/- .15 P protection measures .5 +/- .1 Testing showed that using multiplicative models with data that could not be accurately specified may not be appropriate Better to utilize additive models because the error propagation is lower

Map Accuracy Assessment

Goal of Map Accuracy The purpose of accuracy assessment is to allow a potential user to determine the map's "fitness for use" for their application Spatial Accuracy Thematic Accuracy Topological Accuracy Temporal Accuracy

What Kinds of Map Accuracy?
Don’t be surprised to have an experienced geospatial analyst give you a puzzled look when you say the data is very accurate. The reason for his or her puzzled look is because there are many different categories of map accuracy. The different categories are: Spatial Accuracy: refers to the positional/coordinate accuracies within geospatial data. Maps created at different scales will have different different levels of generalization, and subsequently different positional accuracies. Thematic Accuracy: refers to the accuracies of the attributes that describe a geographic feature. Depending upon how information was collected, there can be misinterpretation of particular geographic objects, or errors in entering the data in the computer. Topological Accuracy: refers to the geometric connectivity of the data. Poorly digitized data may include gaps, or unconnected line segments. Temporal Accuracy: refers to how accurate the information is over a given period of time. Obviously, a map is only a snapshot of reality for the time in which the data was collected. Therefore, some assessment of how the geographic objects may change over time is important.

Criteria for Accuracy Assessment
Don’t be surprised if early in your career you hear something like: “we looked at the data, and it looks pretty good”, or “the map seems to be pretty accurate”. All too often, data is accepted for projects based on a purely subjective view of the data. There have been many occasions where I have personally worked on a project where the data was previously accepted because it “looked pretty good”. Unfortunately, I often had to tell the project manager that the data quality is insufficient to support the GIS application they want to perform. The amount of “bad” data that actually gets accepted by GIS organizations would be funny, if it weren’t so tragic. And in many cases, the taxpayers who originally funded the data collection effort are unaware that their hard earned tax dollars were wasted in this way. Therefore, there must be a better method for assessing accuracy than just saying “the data looks pretty good”. Therefore, any map accuracy method you perform should follow the following criteria: Scientifically sound: In other words, accuracy assessment should not be “voodoo” science. It should be based on sound methods of statistics, or other measurements. You want people to have the ability to “recreate” your method, just to validate the results. Economically feasible: The cost for accuracy assessment should be only a fraction of the cost for collecting the data in the first place. But cost isn’t just money, its also time. An accuracy assessment should also be able to be performed in a relatively short period of time. Nationally acceptable: You should try to adopt procedures that are already in place, and accepted. For example, many mapping organizations already have established guidelines for accuracy. The criteria for the accuracy assessment should reflect the need to balance the requirements for rigor and defensibility with practical limitations of cost and time. For example, a method for accuracy assessment that winds up costing more than the original data collection itself is probably too extreme.

Thematic Accuracy Geographic objects typically have attribute information associated with them. But a real question is whether the attributes are any good. The basic purpose of thematic map accuracy is to compare the attribute information related to the geographic data with the actual attribute information in the real world. For example, in the land cover map we are creating in this course, it is important to know if the areas that you reported as forest are actually forest, or another type of land cover. When performing thematic accuracy assessment, we should be focused on the: Nature of the errors: in other words, what kinds of information are confused? Did we confuse agriculture with forest, or extractive with developed land? Frequency of the errors: how often do the errors occur? Magnitude of errors: how bad are the errors? For example, if we confuse old-growth with second-growth forest, perhaps that is not as bad as confusing water with forest. Source of errors: we also want to understand why the error occurred. Perhaps there is something in our process we can change that would enable us to avoid the error in the future? The remainder of this section will place thematic map accuracy assessment in the context of the work we have been conducting in this course. Therefore, our frame of reference will be the land cover map we’ve been producing. However, the concepts are easily transferable to other kinds of data.

Criteria for Assessment
Scientific and programmatic criteria for the assessment - The criteria for the accuracy assessment reflect the need to balance the requirements for rigor and defensibility with practical limitations of cost and time. In general, the assessment methods must be: scientifically sound economically feasible nationally applicable coordinated and consistent with other federal efforts

Thematic Accuracy The basic idea is to compare the predicted classification (supervised or unsupervised) of each pixel with the actual classification as discovered by ground truth. Four kinds of accuracy information: Nature of the errors: what kinds of information are confused? Frequency of the errors: how often do they occur? Magnitude of errors: how bad are they? E.g., confusing old-growth with second-growth forest is not as 'bad' an error as confusing water with forest. Source of errors: why did the error occur?

Simple Confusion Matrix
Ground Classification A B C Map Classification 10 2 3 20 4 1 Overall map accuracy = total on diagonal / grand total

Map Accuracy Terms errors of omission (map producer's accuracy) = incorrect in column / total in column. Measures how well the map maker was able to represent the ground features errors of commission (map user's accuracy) = incorrect in row / total in row. Measures how likely the map user is to encounter correct information while using the map

Determining Map Accuracy
Statistical test of the classification accuracy for the whole map or individual cells is possible using the kappa index of agreement. This is like a X² test except that it accounts for chance agreement

Accuracy Assessment Overall Accuracy: total number of correctly classified elements divided by the total number of reference elements Accuracies of Individual Categories Producer’s Accuracy: number of correctly classified elements divided by the reference elements for that category (omission) User’s Accuracy: correctly classified elements in each category by the total elements that were classified in that category (comission)

Simple Confusion Matrix
Ground Classification A B C Map Classification 10 2 3 20 4 1 Overall map accuracy = total on diagonal / grand total

Results of Example Overall accuracy: ( )/( )= 40/50 = 80% Error of commission for class A: (2+3)/(10+2+3) = 5/15 = 33% error Error of omission for class A: (0+4)/(10+0+4) = 4/14 = 29% error

Accuracy Example

What is Cohen’s Kappa A measure of agreement that compares the observed agreement to agreement expected by chance if the observer ratings were independent Expresses the proportionate reduction in error generated by a classification process, compared with the error of a completely random classification. For perfect agreement, kappa = 1 A value of .82 would imply that the classification process was avoiding 82 % of the errors that a completely random classification would generate.

kappa is 1 for perfectly accurate data (all N cases on the diagonal), zero for accuracy no better than chance

Accuracy Calculations for Onscreen Digitizing Method
Field Data Agr Dev Ext For Wet Total User’s Accuracy Agriculture 49 1 6 57 86% Developed 2 11 8 21 52% Extractive 100% Forest 12 7 87 5 112 78% Wetland 4 80% 63 19 102 10 201 Producer’s Accuracy 58% 85% 40% Map Data Overall accuracy is 78% , kappa = 0.64 Overall accuracy for digitizing tablet method is 76%, kappa = 0.59

Accuracy Examples Example error matrix derived from New Mexico GAP project (Thompson et al. 1996). Rows represent mapped cover types and columns represent observed cover type. Sample size is indicated in the last column. Mapped types have been aggregated for simplicity. 1000 2000 3000 4000 5000 6000 9000 n 1000 Tundra 44.9 26.53 2.04 6.12 20.41 49 2000 Forest 1.12 73.18 5.03 9.5 7.82 3.35 179 3000 Woodland 3.17 16.93 31.75 13.76 21.69 2.12 10.58 189 4000 Shrubland 4.86 9.93 31.35 43.49 0.22 10.15 453 5000 Grassland 4.48 11.3 32.2 42.64 0.43 8.96 469 6000 Riparian 13.68 7.37 21.05 17.89 32.63 95 9000 Other 1.82 14.29 15.2 5.17 65.53 329 Total 1763

Accuracy Examples Table 4. Accuracy by cover type for the example presented in Table 3. Cover type Number polygons Number sampled Accuracy Standard error 1000 Tundra 124 49 44.90% 5.53% 2000 Forest 3922 179 73.18% 3.23% 3000 Woodland 5456 189 31.75% 3.33% 4000 Shrubland 6077 453 31.35% 2.10% 5000 Grassland 12693 469 42.64% 2.24% 6000 Riparian 292 95 17.89% 9000 Other 1836 329 63.53% 2.40%

Fuzzy Accuracy Assessment
There is a fundamental problem with the confusion matrix: the ground data may not be just 'correct' but 'somewhat correct'... a problem of classification (1) absolutely wrong, (2) understandable but wrong, (3) reasonable, acceptable but there are better answers, (4) good answer, (5) absolutely right

Fuzzy Accuracy Assessment
Confusion matrix is expanded to answer two more precise questions: How frequently is the map category the best possible choice? How frequently is the map category acceptable?

Errors in Data Processing
If the standard phrase garbage in / garbage out applies to computer processing, then GIS can be thought of as a garbage magnifier. That is, if you put a little garbage in, you will get a lot of garbage out!! Some of the data processing errors you might expect to find include: Numerical errors in the computer: different GIS products may store their data with different precision. Performing mathematical operations in single precision format, vs., double precision or integer format will create different results, as shown in the following example Digitizing data: When digitizing a curve, a user can place many vertices to approximate the curve, or only a few vertices, as shown below.

CSS 420 Introduction to GIS

Similar presentations

Presentation on theme: "CSS 420 Introduction to GIS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSS 420 Introduction to GIS

Similar presentations

Presentation on theme: "CSS 420 Introduction to GIS"— Presentation transcript:

Similar presentations

About project

Feedback