Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

Slides:

Advertisements

Similar presentations

String Similarity Measures and Joins with Synonyms

Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Effectively Prioritizing Tests in Development Environment

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Aki Hecht Seminar in Databases (236826) January 2009

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

SESSION 5.1 In this session we will be exploring Pattern Match Queries List-Of-Values Queries, Non-Matching Values in Queries Both the “And” and the “OR”

1 UTeach Professional Development Courses. 2 UTS Step 1 Early exposure to classroom environment (can be as early as a student’s first semester)

Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.

A Comprehensive Computer Application for Managing Sample Planning, Electronic Data Manipulation and Data Validation and Verification in Support of the.

IMSS005 Computer Science Seminar

A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.

Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Definition of Computational Science Computational Science for NRM D. Wang Computational science is a rapidly growing multidisciplinary field that uses.

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

Introduction. 2COMPSCI Computer Science Fundamentals.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

In this chapter, you learn about the following: ❑ Anomalies ❑ Dependency and determinants ❑ Normalization ❑ A layman’s method of understanding normalization.

Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford

Introduction to Science Informatics Lecture 1. What Is Science? a dependence on external verification; an expectation of reproducible results; a focus.

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

1 Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He Tok Wang Ling Dept. of Computer Science School of Computing National.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

B+ Trees: An IO-Aware Index Structure Lecture 13.

CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.

1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

DAY 14: ACCESS CHAPTER 1 RAHUL KAVI October 8,

CS 101 – Course outline Binary representations –Numbers –Text –Images –Linear vs. nonlinear information Excel –Formulas, functions –Tools: Solver, Goal.

CSCD 433/533 Advanced Computer Networks Lecture 1 Course Overview Spring 2016.

Big Data Programming II Course Introduction CMPT 733, SPRING 2016 JIANNAN WANG.

CS 784: Advanced Topics in Data Management This semester’s focus: Data Science AnHai Doan.

Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.

INFO 340 Lecture 3 Relational Databases. Based on the relational model, grounded in mathematic set theories. Three basic elements: Relation, Tuple, and.

Zachary Starr Dept. of Computer Science, University of Missouri, Columbia, MO 65211, USA Digital Image Processing Final Project Dec 11 th /16 th, 2014.

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Development History Granularity Transformations

CIS 336 strCompetitive Success/tutorialrank.com

CIS 336 str Education for Service-- tutorialrank.com.

Data Mining: Concepts and Techniques Course Outline

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Database Management Systems (CS 564)

Big Data Programming II Course Introduction

Lecture 12: Data Wrangling

Lecture 8-2: Introduction to Data Visualization

CS246: Information Retrieval

Query Optimization.

CMPT 120 Lecture 2 - Introduction to Computing Science – Problem Solving, Algorithm and Programming.

An Efficient Partition Based Method for Exact Set Similarity Joins

Presentation transcript:

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG

Outline Motivation ◦Asking “Why” before you learn something Data Integration and Cleaning ◦Getting the big picture Entity Resolution ◦Learning how to solve a particular problem 02/02/16JIANNAN WANG - CMPT 733 2

Want to become a data scientist? One definition 02/02/16JIANNAN WANG - CMPT 733 3

Making it More Specific Domain Knowledge ◦Use domain knowledge to ask questions ◦E.g., Will Donald Trump become the president of USA? Math/Statistics ◦Use math/statistics knowledge to come up with a solution ◦E.g., Design an algorithm that can make the prediction based on social media data Computer Science ◦Use programming skills to build a data-processing pipeline ◦E.g., The pipeline takes social media data as input, and outputs the prediction 02/02/16JIANNAN WANG - CMPT No domain knowledge? Then, you have to have strong teamwork and communication skills Want to know more statistics? Read CH 4-7 in “Data Science from Scratch” What is a data-processing pipeline?

Data-processing Pipeline 02/02/16JIANNAN WANG - CMPT What you think you do? What you really do? Analysis/ Modeling Cleaned Data Predictive Model Analysis/ Modeling Visualization Data Cleaning Data Integration Data Collection

Example: Assignment Mark Prediction 02/02/16JIANNAN WANG - CMPT Data Collection ◦Collect background information Data Cleaning ◦Missing values, Inconsistent values Data integration ◦Integrate with CourSys to get assignment marks Modeling ◦Build a linear regression model Visualization ◦Present results to non-technical persons

Outline Motivation ◦Asking “Why” before you learn something Data Cleaning and Integration ◦Getting the big picture 02/02/16JIANNAN WANG - CMPT 733 7

Dirty Data Problems From Stanford Course: 1)parsing text into fields (separator issues) 2)Missing required field (e.g. key field) 3)Different representations (iphone 2 vs iphone 2 nd generation) 4)Fields too long (get truncated) 5)Formatting issues – especially dates 6)Licensing issues/Privacy/ keep you from using the data as you would like? 02/02/16JIANNAN WANG - CMPT 733 8

Data Cleaning Tools Python ◦Missing Data (Pandas)Missing Data ◦Deduplication (Dedup)Deduplication OpenRefine ◦Open-source Software ( ◦OpenRefine as a Service (RefinePro)RefinePro Data Wrangler ◦The Stanford/Berkeley Wrangler research project ◦Commercialized (Trifacta)Trifacta 02/02/16JIANNAN WANG - CMPT 733 9

Data Integration Problem 02/02/16JIANNAN WANG - CMPT First NameLast NameMark MichaelJordan50 KobeBryant48 NameBackground Mike JordanC++, CS, 4 years Kobe BryantBusiness, 2 years Data Source 1 (from CourSys) Data Source 2 (from survey) NameMarkBackground Michael Jordan50C++, CS, 4 years Kobe Bryant48Business, 2 years Data Integration??? Integrated Data

Data Integration: Three Steps Schema Mapping ◦Creating a global schema ◦Mapping local schemas to the global schema Entity Resolution ◦You will learn this in detail later Data Fusion ◦Resolving conflicts based on some confidence scores Want to know more? ◦Anhai Doan, Alon Y. Halevy, Zachary Ives. Principles of Data Integration. Morgan Kaufmann Publishers, 2012.Principles of Data Integration 02/02/16JIANNAN WANG - CMPT

Outline Motivation ◦Asking “Why” before you learn something Data Integration and Cleaning ◦Getting the big picture Entity Resolution ◦Learning how to solve a particular problem 02/02/16JIANNAN WANG - CMPT

Entity Resolution 02/02/16JIANNAN WANG - CMPT

Output of Entity Resolution 02/02/16JIANNAN WANG - CMPT IDProduct NamePrice r1r1 iPad Two 16GB WiFi White$490 r2r2 iPad 2nd generation 16GB WiFi White$469 r3r3 iPhone 4th generation White 16GB$545 r4r4 Apple iPhone 3rd generation Black 16GB$375 r5r5 Apple iPhone 4 16GB White$520, (r 3, r 5 )(r 1, r 2 )

Entity Resolution Techniques 02/02/16JIANNAN WANG - CMPT Jaccard(r1, r2) = 0.9 ≥ 0.8 Matching Jaccard(r4, r8) = 0.1 < 0.8 Non-matching

Similarity-based Suppose the similarity function is Jaccard. Problem Definition 02/02/16JIANNAN WANG - CMPT

Filtering-and-Verification Step 1. Filtering ◦Removing obviously dissimilar pairs Step 2. Verification ◦Computing Jaccard similarity only for the survived pairs 02/02/16JIANNAN WANG - CMPT

How Does Filtering Work? What are “obviously dissimilar pairs”? ◦Two records are obviously dissimilar if they do not share any word. ◦In this case, their Jaccard similarity is zero, thus they will not be returned as a result and can be safely filtered. How can we efficiently return the record pairs that share at least one word? ◦To help you understand the solution, let’s first consider a simplified version of the problem, which assumes that each record only contains one word 02/02/16JIANNAN WANG - CMPT

A simplified version Apple Banana Orange Banana 02/02/16JIANNAN WANG - CMPT Suppose each record has only one word. Write a SQL query to do the filtering. r1r1 r2r2 r3r3 r4r4 r5r5 SELECT T1.id, T2.id FROM Table T1, Table T2 WHERE T1.word = T2.word Output: (r1, r2), (r3, r5) #comparisons: ( n apple ) 2 +(n banana ) 2 +(n orange ) 2 = 4+4+1=9

A general case Suppose each record can have multiple words. 02/02/16JIANNAN WANG - CMPT Apple, Orange Apple Banana Orange, Apple Banana r1r1 r2r2 r3r3 r4r4 r5r5 Flatten Apple Orange Apple Banana Orange Apple Banana r1r1 r1r1 r3r3 r4r4 r5r5 r2r2 r4r4 1.This new table can be thought of as the inverted index of the old table. 2.Run the previous SQL on this new table and remove redundant pairs.

Not satisfied with efficiency? Exploring stronger filter conditions ◦Filter the record pairs that share zero token ◦Filter the record pairs that share one token ◦….◦…. ◦Filter the record pairs that share k tokens Challenges ◦How to develop efficient filter algorithms for these stronger conditions? 02/02/16JIANNAN WANG - CMPT Jiannan Wang, Guoliang Li, Jianhua Feng. Can We Beat The Prefix Filtering? An Adaptive Framework for Similarity Join and Search. Can We Beat The Prefix Filtering? An Adaptive Framework for Similarity Join and Search. SIGMOD 2012:85-96.

Not satisfied with result quality? 02/02/16JIANNAN WANG - CMPT M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003Adaptive duplicate detection using learnable string similarity measures

Summary 02/02/16JIANNAN WANG - CMPT

Assignment 4: Entity Resolution Part 1. Similarity Joins (required) Part 2. Where To Go From Here (optional) 02/02/16JIANNAN WANG - CMPT Deadline: 11:59pm, Feb 15th

About Final Project My initial thoughts ◦You have to integrate data from at least two data sources ◦You have to come up with some interesting questions to investigate ◦You have to give an excellent talk on your project proposal ◦… You will get a detailed project instruction in two weeks 02/02/16JIANNAN WANG - CMPT