2Chapter Overview Roles of data, information and knowledge Background of data miningWhat is data mining?Main data mining objectivesData mining and other related disciplinesCurrent state of data miningPromises and challengesA brief preview of data mining tool Weka
3Data, Information and Knowledge Data (D)Isolated factual recording of separate objects and eventsEnables the recording of the seen eventsInformation (I)Fact of meaningful context represented by relationships between isolated data itemsInformation enables the responding to the seen eventsKnowledge (K)Verified known information that is accommodated into the business processEnable the anticipation of the unseen eventsDIK
4Data Mining: The Background Computerisation of operations in commercial, governmental and scientific organisations has resulted in large volumes of operational data, e.g.Itemised telephone billsBank statementsSupermarket transactionsShare pricesScientific experimental data setsPublished web pagesCCTV video footages……
5Data Mining: The Background Facts:Storing the data is an operational necessityStoring the data has become easy and affordableData acquisition is fully or partially automatic and fastConsequences:The speed of data comprehension does not match the speed of data acquisitionMany commercial database management systems (DBMSs) are not equipped with data comprehension and analysis tools.We may be data rich, but information poor.
6Data Mining: The Background An intriguing quotable quote:“I know half the money I spend on advertising is wasted, but I can never find out which half!”Lord LeverhulmePresident of Unilever
7Data Mining: What it isKnowledge discovery in databases (KDD) refers to the efficient process of searching through large volumes of raw data in databases to find potentially useful information that is implicitly embedded in the data. Data Mining is an integral step of KDD that discovers hidden patterns from an input data set.Useful information; leading to a course of action or an understanding of dataNon-trivial implicit information; not the raw data, nor the result of a simple data summaryReal life databases; not laboratory generated data setsEfficient novel discovery methods; expected to be scaled up and applied to large databases
8Data Mining: Useful Information Example 1 (A well-known example, not a joke):Customers who purchase beer are also likely (say 90%) to purchase nappies.Example 2 (May already be in practical use in credit card applications):If 20,000 Customer’s Salary 40,000 pounds and Customer has a house, then Customer is a safe customer.
9Data Mining: Non-trivial Information Putting the “search for information” into a spectrum:sophisticationLow end ofHigh end ofData retrievalOnline analytic processingData miningRetrieval of stored dataTrivial data aggregationWritten in standard SQLInteractive reporting on stored dataSummarisation and drilling along different attributesWritten in extended SQLDiscovery of hidden and embedded patternsDiscovery algorithmsWritten in programming language probably with the assistance of SQL
10Data Mining: Real-life Databases Characteristics of a real-life databaseThe size may be extremely largeThe dimensionality can be very highAttributes can be of different data typesData quality can be very poorData may exist in pieces and isolated in different systemsValue distribution can be extremely skewedDatabase content can be dynamic and evolvingData may lack traditional record-based structureData are available on second storage media
11Data Mining: Efficient Algorithms Discovering interesting patterns supported by given facts can be computationally hard because many discoveries are combinatorial problems. Trivial algorithms may take too long.A discovery algorithm is considered efficient if its execution time and memory requirement are comparable to those of sorting algorithms; otherwise, it is unlikely to scale up well enough to cope with data sets of large sizes.Efficient discovery algorithms may be hard to find. Using advanced hardware, optimising the implementation of the algorithms and developing approximate solutions can be viable alternative options.
12Data Mining Objectives ClassificationUsing existing data to form a classification model and then using the model to assign an appropriate class label for a data record (e.g. safe vs. risky customers)EstimationSimilar to classification but to assign a value to an output variable of a data record (e.g. estimated house value)PredictionSimilar to classification and estimation, but more concerned with future outcome of the output (e.g. tomorrow’s weather)DescriptionGeneral description of data characteristics (e.g. customer profile)
13Data Mining & Other Disciplines Machine Learning(Artificial Intelligence)StatisticsInductive & deductivelearning methodsData analysis theoriesmethods and measuresDATA MININGFast storage structures &retrieval operationsData mining comes from three existing areas: statistics, machine learning and databases.DatabaseManagement
14Data Mining: Current State Many data mining algorithms have been developed or adaptedMany data mining software tools have been built and are in useA cross-industry methodology has been formedBesides general solutions, more application-oriented data mining solutions are being developedMore and more organisations are either doing their own data mining or hiring consultants to do the jobData mining has been extended to web mining and text mining
15Data Mining: Current State Some nuisancesMining cookiesSpyware and miningwareIntrusion to privacySome serious problems“Big Brother is watching”Unfair advantages in trading practice e.g. high-frequency trading (HFT)Abuse of personal dataEthical concerns
16Data Mining: Promises Areas of data mining application: Finance and insuranceMarketing and salesMedicineAgricultureSociety, politics and economicsScienceEngineeringLaw enforcementMilitary and intelligence (classified)
17Data Mining: Challenges Faced Some difficult problems to solveExtremely large data setsExtremely high dimensionalities (curse of dimensions)Combinatorial problems and fast algorithmsMeaningful evaluation of the patternsDiscovery of changing and evolving patternsIntegration of data mining techniquesComprehensibility of patternsData pre-processingMining non-standard complex data such as multimedia materials
18Weka: A Brief Introduction OverviewJava tool set developed at Univ. of Waikato (NZ)Free to download and used by manyA wide range of learning and data pre-processing methods and algorithms, with Java APIOffering a GUI (Explorer) and a command-line (Simple CLI) interface to the toolsExperimenter module to assist the evaluation of classification techniquesKnowledgeFlow module to enable batch-processing style discovery and incremental miningSome visualisation facilities
19Weka: A Brief Introduction Weka ExplorerFor investigative interactive data mining with small size data setsPreprocess, Classify, Cluster, Associate, Select Attributes and Visualise pages
20Weka: A Brief Introduction Weka Simple CLIWeka facilities as Java classesCalling the Java functions as commands
21Weka: A Brief Introduction Weka ExperimenterComparing performances of different classification solutions on a collection of data sets
22Weka: A Brief Introduction Weka KnowledgeFlowSetting up a flow of knowledge discovery in a diagramOverview of the entire discovery project
23Chapter SummaryImportance of data in operation and importance of information and knowledge in decision-makingData rich does not mean information richData mining: automatic or semi automatic data understanding and decision supportTo classify, to estimate, to predict and to describeData mining closely relates to database, statistics and machine learningData mining: from technology towards applicationA lot of potential uses and a lot of challenges to faceWeka: excellent tool to support teaching data mining
24References Read Chapter 1 of Data Mining Techniques and Applications Useful further referencesHan & Kamber, Chapter 1Berry & Linoff, Chapter 1 (business-like)Kdnuggets: