Presentation on theme: "Data, pre-processing and exploration"— Presentation transcript:
1 Data, pre-processing and exploration Chapter ThreeData, pre-processing and exploration
2 Chapter Overview Data, data types and operations Properties of various data setsData source and data warehouseIssues of data qualityData pre-processing operationsData summary and visualisationOnline analytic processing (OLAP)Data exploration and visualisation in Weka
3 Data, Data Types and Operations Data object and attributesData object or instance: individual independent recording of a real life object/event.Characterised by its recorded values on a fixed set of features or attributesFeature or attribute: a specific property or characteristic of the data object.Measurement: assigning a valid value to an attribute according to an appropriate measurement scale.Collection: collecting measurement results or recorded values
4 Data, Data Types and Operations Data object and attributes (cont’d)An example123, “John Smith”, “03/02/1990”, 20, “male”, 1.82, 78ID number, collectedNamecollectedBirthday collectedAgecalculatedGender collectedBody heightmeasuredBody weight
5 Data, Data Types and Operations Data object and attributes (cont’d)Measurement and measurement errorsPrecision: the closeness of measurements to one another, represented by the standard deviation of the measurements, e.g. repeated measure of body temperatureBias: a systematic variation of measurements from the intended quantity measurement, only known when external reference available, e.g. bias in weight measure instrumentAccuracy: the closeness of the measure to the true value, indicated by the number of significant digits used in the measurement, e.g. measure of money: pound vs. pennyCollection errorsIncorrect data recording at the point of entry, e.g. “Hongpo Do” as for “Hongbo Du”
6 Data, Data Types and Operations Attribute domain types and operationsCategorical/Qualitative typesNominal, e.g. Gender (M, F)A set of names: no concept of order nor differenceOperators applicable: =, 1:1 transformation permissible, e.g. ID: 11 e901Ordinal, e.g. Grade (A, B, C, D, E)A set of names: with order but no concept of differenceOperator applicable: =, , <, >, , Order-preserving transformation permitted,e.g. Grade: A First, B Second, C Third, D Pass, E BarePass.
7 Data, Data Types and Operations Attribute domain types and operationsNumeric/Quantitative typesInterval, e.g. Temperature in CA set of numeric values: both order and difference existOperators applicable: =, , <, >, , , +, -e.g. temperature (F and C), calendar yearTransformation new = a*old + b permitted, e.g. F CRatio, e.g. LengthA set of numeric values: order, difference and ratioThe set has an absolute zeroOperator applicable: =, , <, >, , , +, -, , Transformation new = a*old permitted, e.g. meter feet
8 Data Sets Various forms Table of records Ordered data Graph-based data Relational tableJoin of relational tablesNumerical spreadsheet (data matrix)Boolean strings (document-term matrix)Ordered dataTime series and temporal sequenceData sequenceSpatial dataGraph-based dataNon record-based data
9 Data Sets Various forms (illustrated) GGTTCCGCCTTCAGCC CCGCGCCCGCAGGG… Data MatrixRelational TableTransaction DatabasePage1link1link2Page2link3Page4wwwzzzzPage3xxxxyyyyWeb StructureSpatial DataGGTTCCGCCTTCAGCCCCGCGCCCGCAGGG…Data Sequence
10 Data SetsPropertiesType: file structure, e.g. ARFF for Weka, DAT for See5Size: measured in terms of the total number of records or total number of bytes, e.g. small (MB), medium (GB) and large (TB)Dimensionality: number of attributesSparsity:Values are skewed to some extreme or sub-rangesAsymmetric values (some are more important than others)ResolutionRight level of data detailsRelated to the intended purpose
11 Data Sets Properties (example insurance data set) Type: ARFF Dimensionality: 7Asymmetric: Y/NSkewed?Resolution: detailedSize: records
12 Data Source and Data Warehouse Sources of dataLocal data source availableLocal operational systems from different departmentsThird-party external data sourceEnterprise/Organisational data warehouseAn organisational database for decision makingA central data repository separate from operational systemsEnforcing organisation-wide data consistency and integrationProviding data details as well as data summarisationProviding data values as well as meta-dataEquipped with data analysis and reporting toolsAs a data source for data mining
13 Data Source and Data Warehouse Star schema for data warehouseCentral fact tableDimension tablesLimited use of join operationsPart(p#, pname, weight, colour)Supply(s#, p#, pj#, qty)Supplier(s#, sname, city, status)Project(pj#, jname, status, date)
14 Issues of Data Quality Main quality indicators Accuracy: data recorded with sufficient precision and little biasCorrectness: data recorded without error and spurious objectsCompleteness: any parts of data records missingConsistency: compliance with established rules and constraintsRedundancy: unnecessary duplicatesUsing the indicators to quantify quality of a data setImproving quality if possible
15 Issues of Data Quality Some examples Accuracy & correctness with the road accident reports in Exercise 1.3(c).Completeness with the UK family expenditure surveys in Exercise 1.3(a).Incompleteness introduced by data integration using outer join operationConsistency in questionnaires, e.g. eating fruit & veg.Q1: “give the fruit&veg portion consumed yesterday”: 2Q2: “give the fruit&veg portion consumed today:” 3Q3: “do you eat more today than yesterday?” No.Redundancy in a local company’s database of 40,000 records about 15,000 client companies.
16 Issues of Data Quality Why is quality important? “Garbage in, garbage out!”Total data quality control requires a cultural change (comparing with total product quality control)For data mining, tackling the quality issue at the data source cannot be always expectedBy cleaning the data as much as possibleBy developing and using more tolerate mining solutionsData quality is relevant to the intended purpose of data mining, e.g. Do spelling errors in student names really matter when only the increase/decrease of student numbers in particular subject areas over the years is of interest?
17 Data Pre-processing Overview Purpose: for speedy, cost-effective and high quality outcomes of data miningPre-processing tasks (not all are independent from each other)Data aggregationData samplingDimension reductionFeature selectionFeature creationDiscretisation/binarisationVariable transformationDealing with missing values
18 Data Pre-processing Data aggregation What: to summarise low level data details to higher level data abstractionWhy: to reduce the time of mining, to rescale data values, and to discover more stable patternsHow:By generalisation using a given concept hierarchyBy applying aggregate functions (e.g. count, sum, average)Dropping some attributes
19 Data Pre-processing Data sampling What: selecting a subset of the given data setWhy: to make it possible to use sophisticated mining algorithms within a time limit.Caution: the sample must be representative of the original data setHow:Random samplingStratified samplingProgressive samplingWith or without replacementDatapopulationSamplingmethodSelectedsubset
20 Data Pre-processing Feature selection What: reducing dimensionality by selecting a subset of attributesPurposes:To remove/reduce redundant featuresTo remove irrelevant features with no useful information for the mining taskHow:Manually with common sense and domain knowledgeLetting the mining solution to select suitable features (the embedded approach)Filter and wrapper approachesattributesSubsetselectionOne subsetevaluationStoppingcriterionokNot okSelectedsubsetValidate withMining task
21 Data Pre-processing Data dimension reduction What: reduce redundancy implied among attributese.g. are all 9600 dimensions for a 120x80 pixel image necessary?Curse of dimensions: as dimensionality increasesData become more diverse, and any patterns are getting less significant and more peculiar.The processing time may increase substantially.Why: to reduce redundancy and effects of the curseHow:Linear algebra techniquesPrincipal component analysis (PCA)Independent component analysis (ICA)Single value decomposition (SVD)Feature selection (as described before)
22 Data Pre-processing Feature creation What: to create a new set of features from the original featuresPurpose: in the new feature space, meaningful and relevant patterns can be extracted more easily. The number of features may be reduced.How:Using feature extraction methods to extract new features from the existing ones, e.g. extracting colour, texture and shape from image of pixel valuesMapping data to a new space, e.g. wavelet transformation of pixel values of images to a frequency domainConstructing new features from the existing ones using domain knowledge, e.g. using transaction dates to construct a new feature customer tenure that indicates the loyalty of the customer to the company
23 Data Pre-processing Data discretisation What: to convert continuous attribute values to discrete categorical valuesThe purposes:Requirement for some data mining solutionsBetter data mining results (not always)How:Deciding how many categories to have and where split points should beMapping values to categoriesDetermine the number & locations of the split pointst1t2t3t4Mapping values within each sub-range to a category label
24 Data Pre-processing Data discretisation (cont’d) Discretisation methods:Unsupervised: without concern to the outcome of a specific attribute, normally used for clustering and association rule mininge.g. equal width, equal depth, clusteringSupervised: with respect to the outcome of the class attribute, normally used for classificationSimple methods: sorting according to the class attribute, and then discretising the attribute values for each class.Sophisticated methods: the discretisation of the attribute values purifies the outcome of the class, e.g. using entropy to measure the degree of purity, and deciding the split points recursively, similar to decision tree inductionMerging methods, merging small intervals into a larger one with a stop criterion
25 Data Pre-processing Data binarisation What: to convert discrete categorical values to binary Boolean attribute valuesThe purpose: the same as for discretisationHow:Convert m categorical values to values in [0, m-1]Convert each to binary number of n bits where n = log2mUse m asymmetric binary variables to represent each of m values
26 Data Pre-processing Variable transformation What: transform all values of an attribute to other valuesThe purposes:Remove the effect of the outlier valuesMake the result data visualisation more interpretableMake the values more comparableHow:Transformation using functione.g. log(x)Standardisation/normalisatione.g. division-by-range
27 Data Pre-processing Handling missing values What: to treat attributes with null valuesThe purposes:Improve data qualityBetter mining resultsHow:Elimination (may not always be possible)Using sensible default, e.g. Spending Amount is set to 0By data imputationAverage, median, or mode of the whole data populationAverage, median or mode of the nearest neighboursPostponing the handling and making the mining methods adaptive to missing values
28 Data Exploration Exploring data before mining Knowing data is essential for successful data miningPurposes:Better understanding of the characteristics of dataBetter decision over data pre-processing tasksEven being able to discover some hidden patternsCategories of data exploration techniquesSummary statistics: using a small set of descriptors to describe the characteristics of a large data setData visualisation: using graphical or tabular forms to reveal hidden data patternsOnline Analytic Processing (OLAP)Data exploration and exploratory data analysis (EDA)
29 Data Exploration Summary statistics Frequency and mode for categorical attributes:Frequency of valueMode: the most frequently occurred valuePercentiles for ordinal or continuous attributes:Given an attribute x and an integer p (0p100), the percentile xp is a value of x such that p% observed values of x are less than xp.Mean and median for continuous attributes:Mean and medianMedian is a better indication of “average” when data distribution is skewed or outliers are presentTrimmed mean and median (after trimming top and bottom p%)
30 Data Exploration Summary statistics (cont’d) Measures of spread: RangeVariance (2)Standard Deviation ()Absolute average deviation (AAD)Multivariate summary statisticsMean vectorMatrix of covarianceCorrelation
31 Data Exploration Data visualisation Rationale: human eyes are good at spotting patterns, particularly visual patterns.Major ways of visualising dataTabular formGraphical formPoints and linksVisual representation must be related to the data types of the attributesVisualising data as well as all its implicit relationshipsThe visualisation must be comprehensibleThe visualisation of data must tell the truth
32 Data Exploration Data visualisation techniques Pie Chart Parallel Dimension ChartStem & Leaf PlotBar ChartScatter PlotStar Dimension Chart
33 Data Exploration Online analytic processing (OLAP) Interactive reporting toolTreating a data set as a multidimensional hypercubeFast operation and fast result deliveryA typical OLAP query:“For each product, find its market share in its category today minus its market share in its category in 1994”Result of the OLAP query:
34 Data Exploration OLAP: Multidimensional hypercube Jan Feb Dec March BuckinghamMilton KeynesNorthampton199820001999Total Customer = 5Customer NamesMarchMilton Keynes1999
35 Data Exploration OLAP: Hierarchies winter spring summer Buckingham Milton KeynesNorthampton199820001999autumnJanuaryFebruaryMarchwinterAprilMayJunespringJulyAugustSeptembersummerOctoberNovemberDecemberautumnJanFebDecMarchBuckinghamMilton KeynesNorthampton199820001999
36 Data Exploration OLAP: Operations Pivoting Slicing and dicing Selecting attributes to define the cubeVisually rotating the cube to show a faceSlicing and dicingSelecting a part of a cubeVisually slicing a segment of a cube along a dimensionRolling-upMoving up along a hierarchyDrilling-downMoving down along a hierarchyPerforming aggregate functions while rolling-up or drilling-down
37 Data Exploration in Weka Explorer ARFF file formatData set nameNumeric attribute names and typesSchema sectionCategorical attribute name and valuesData sectionOne data record per line;Values separated by “,”;“?” represents unknown.
38 Data Exploration in Weka Explorer Glance of an opened data setSummarystatisticsVisualisation of value distribution
39 Data Exploration in Weka Explorer Visualisation in Weka (limited)
40 Data Exploration in Weka Explorer Filters for pre-processingMany filtersSupervised/unsupervisedAttribute/instanceChoose followed by parameter setting in command line
41 Chapter SummaryThe domain types determine the validity of operations applied.Transformation from one domain to another must preserve the domain characteristics.Data sets can be of various forms and from different sources.Data warehouse serves as a data source for data mining.Data quality is relevant to the intended application purpose.Data pre-processing operations are essential for good mining.Knowing the data is important for good data mining.Understanding of data is achieved via exploring, summarising and visualising data.OLAP serves as a data exploration and summarisation tool.
42 References Read Chapter 3 of Data Mining Techniques and Application Useful further referencesTan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining, Addison-Wesley, Chapters 2 and 3
Your consent to our cookies if you continue to use this website.