DATA MINING LECTURE 2 What is data?. What is Data Mining? Data mining is the use of efficient techniques for the analysis of very large collections of.

Slides:



Advertisements
Similar presentations
ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
Advertisements

Data Engineering.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
Lecture Notes for Chapter 2 Introduction to Data Mining
ETEC 100 Information Technology
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 2 = Finish Chapter “Introduction and Data Collection”
DATA MINING LECTURE 1 Introduction. What is data mining? After years of data mining there is still no unique answer to this question. A tentative definition:
Data Mining – Intro.
Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.
MS Access 2007 IT User Services - University of Delaware.
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
DATA MINING LECTURE 1 Introduction. What is data mining? After years of data mining there is still no unique answer to this question. A tentative definition:
Access 2007 Database Application Managing Business Information Effectively BCIS 1 and 2.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Introduction to database systems
1.What is this graph trying to tell you? 2.Do you see anything misleading, unclear, etc.? 3.What is done well?
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Statistics for the Social Sciences Psychology 340 Spring 2009 Review of SPSS basics.
Probability & Statistics – Bell Ringer  Make a list of all the possible places where you encounter probability or statistics in your everyday life. 1.
Chapter 3 Digital Representation of Geographic Data.
How do we represent the world in a GIS database?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Entity-Relationship (ER) Modelling ER modelling - Identify entities - Identify relationships - Construct ER diagram - Collect attributes for entities &
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Chap 1-1 Chapter 1 Introduction and Data Collection Business Statistics.
Database Management Systems.  Database management system (DBMS)  Store large collections of data  Organize the data  Becomes a data storage system.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
1 Database & DBMS The data that goes into transaction processing systems (TPS), also goes to a database to be stored and processed later by decision support.
What is Data? Attributes
ITGS Databases.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Part I Yippee! I’m in Statistics Chapter 1 Statistics or Sadistics?: It’s Up to You.
Microsoft Access.  What is Data ?  Data vs. Information  Database History.  What is a Database?  Examples for Small and Large Databases.  Types.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
In Stat-I, we described data by three different ways. Qualitative vs Quantitative Discrete vs Continuous Measurement Scales Describing Data Types.
CHAPTER 1 – INTRODUCTION TO ACCESS Akhila Kondai September 30, 2013.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
1 Information System Analysis Topic-3. 2 Entity Relationship Diagram \ Definition An entity-relationship (ER) diagram is a specialized graphic that illustrates.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Εισαγωγή στην Εξόρυξη Δεδομένων Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Resource Management Lecture 8. Traditional File Processing Data are organized, stored, and processed in independent files of data records In traditional.
CENG 770. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University.
DATA MINING LECTURE 1 Introduction.
Geog. 314 Working with tables.
Data Preliminaries CSC 600: Data Mining Class 1.
GO! with Microsoft Office 2016
Databases Chapter 16.
Lecture Notes for Chapter 2 Introduction to Data Mining
GO! with Microsoft Access 2016
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques Course Outline
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Lecture 2: Data Representation
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

DATA MINING LECTURE 2 What is data?

What is Data Mining? Data mining is the use of efficient techniques for the analysis of very large collections of data and the extraction of useful and possibly unexpected patterns in data. “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst” (Hand, Mannila, Smyth) “Data mining is the discovery of models for data” (Rajaraman, Ullman) We can have the following types of models Models that explain the data (e.g., a single function) Models that predict the future data instances. Models that summarize the data Models the extract the most prominent features of the data.

Why do we need data mining? Really huge amounts of complex data generated from multiple sources and interconnected in different ways Scientific data from different disciplines Weather, astronomy, physics, biological microarrays, genomics Huge text collections The Web, scientific articles, news, tweets, facebook postings. Transaction data Retail store records, credit card records Behavioral data Mobile phone data, query logs, browsing behavior, ad clicks Networked data The Web, Social Networks, IM networks, network, biological networks. All these types of data can be combined in many ways Facebook has a network, text, images, user behavior, ad transactions. We need to analyze this data to extract knowledge Knowledge can be used for commercial or scientific purposes. Our solutions should scale to the size of the data

What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: name, date of birth, height, occupation. Attribute is also known as variable, field, characteristic, or feature For each object the attributes take some values. The collection of attribute-value pairs describes a specific object Object is also known as record, point, case, sample, entity, or instance Attributes Objects Size (n): Number of objects Dimensionality (d): Number of attributes Sparsity: Number of populated object-attribute pairs

Types of Attributes There are different types of attributes Numeric Examples: dates, temperature, time, length, value, count. Discrete (counts) vs Continuous (temperature) Special case: Binary/Boolean attributes (yes/no, exists/not exists) Categorical Examples: eye color, zip codes, strings, rankings (e.g, good, fair, bad), height in {tall, medium, short} Nominal (no order or comparison) vs Ordinal (order but not comparable)

Numeric Relational Data If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points/vectors in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an n-by-d data matrix, where there are n rows, one for each object, and d columns, one for each attribute TemperatureHumidityPressure

Numeric data For small dimensions we can plot the data We can use geometric analogues to define concepts like distance or similarity We can use linear algebra to process the data matrix Thinking of numeric data as points or vectors is very convenient

Categorical Relational Data Data that consists of a collection of records, each of which consists of a fixed set of categorical attributes ID NumberZip CodeMarital Status Income Bracket SingleHigh MarriedLow DivorcedHigh SingleMedium

Mixed Relational Data Data that consists of a collection of records, each of which consists of a fixed set of both numeric and categorical attributes ID Number Zip CodeAgeMarital Status IncomeIncome Bracket Single250000High Married30000Low Divorced200000High Single150000Medium

Mixed Relational Data Data that consists of a collection of records, each of which consists of a fixed set of both numeric and categorical attributes ID Number Zip Code AgeMarital Status IncomeIncome Bracket Refund Single250000HighNo Married30000LowYes Divorced200000HighNo Single150000MediumNo

Mixed Relational Data Data that consists of a collection of records, each of which consists of a fixed set of both numeric and categorical attributes ID Number Zip Code AgeMarital Status IncomeIncome Bracket Refund Single250000High Married30000Low Divorced200000High Single150000Medium0 Boolean attributes can be thought as both numeric and categorical When appearing together with other attributes they make more sense as categorical They are often represented as numeric though

Physical data storage Tab or Comma separated files (TSV/CSV), Excel sheets, relational tables Assumes a strict schema and relatively dense data Flat file with triplets (record id, attribute, attribute value) A very flexible data format, allows multiple values for the same attribute (e.g., phone number) JSON, XML format Standards for data description that are more flexible than relational tables There exist parsers for reading such data.

Examples JSON EXAMPLE – Record of a person { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": " " }, "phoneNumbers": [ { "type": "home", "number": " " }, { "type": "office", "number": " " } ], "children": [], "spouse": null } XML EXAMPLE – Record of a person John Smith nd Street New York NY home fax male

Set data Each record is a set of items from a space of possible items Example: Transaction data Also called market-basket data TIDItems 1Bread, Coke, Milk 2Beer, Bread 3Beer, Coke, Diaper, Milk 4Beer, Bread, Diaper, Milk 5Coke, Diaper, Milk

Set data Each record is a set of items from a space of possible items Example: Document data Also called bag-of-words representation Doc IdWords 1the, dog, followed, the, cat 2the, cat, chased, the, cat 3the, man, walked, the, dog

Vector representation of market-basket data Market-basket data can be represented, or thought of, as numeric vector data The vector is defined over the set of all possible items The values are binary (the item appears or not in the set) TIDItems 1Bread, Coke, Milk 2Beer, Bread 3Beer, Coke, Diaper, Milk 4Beer, Bread, Diaper, Milk 5Coke, Diaper, Milk TID Bread Coke Milk Beer Diaper Sparsity: Most entries are zero. Most baskets contain few items

Vector representation of document data Document data can be represented, or thought of, as numeric vector data The vector is defined over the set of all possible words The values are the counts (number of times a word appears in the document) Doc Id the dog follows cat chases man walks Doc IdWords 1the, dog, follows, the, cat 2the, cat, chases, the, cat 3the, man, walks, the, dog Sparsity: Most entries are zero. Most documents contain few of the words

Physical data storage Usually set data is stored in flat files One line per set I heard so many good things about this place so I was pretty juiced to try it. I'm from Cali and I heard Shake Shack is comparable to IN-N-OUT and I gotta say, Shake Shake wins hands down. Surprisingly, the line was short and we waited about 10 MIN. to order. I ordered a regular cheeseburger, fries and a black/white shake. So yummerz. I love the location too! It's in the middle of the city and the view is breathtaking. Definitely one of my favorite places to eat in NYC. I'm from California and I must say, Shake Shack is better than IN-N-OUT, all day, err'day

Ordered Data Genomic sequence data Data is a long ordered string

Ordered Data Time series Sequence of ordered (over “time”) numeric values.

Graph Data Graph data: a collection of entities and their pairwise relationships. Examples: Web pages and hyperlinks Facebook users and friendships The connections between brain neurons In this case the data consists of pairs: Who links to whom

Representation Adjacency matrix Very sparse, very wasteful, but useful conceptually

Representation Adjacency list Not so easy to maintain : [2, 3] 2: [1, 3] 3: [1, 2, 4] 4: [3, 5] 5: [4]

Representation List of pairs The simplest and most efficient representation (1,2) (2,3) (1,3) (3,4) (4,5)

Types of data: summary Numeric data: Each object is a point in a multidimensional space Categorical data: Each object is a vector of categorical values Set data: Each object is a set of values (with or without counts) Sets can also be represented as binary vectors, or vectors of counts Ordered sequences: Each object is an ordered sequence of values. Graph data: A collection of pairwise relationships