A Deferred Cleansing Method for RFID Data Analytics IBM Almaden Research Center Jun Rao Sangeeta Doraiswamy Latha S. Colby University of California at.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Supervisor : Prof . Abbdolahzadeh
C6 Databases.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Real World Applications of RFID Mr. Mike Rogers Bryan Senior High School Omaha, NE.
RFID Inventory System Shaun Duncan, Thomas Keaten, Auroop Roy.
Chapter 1: The Database Environment
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Created by the Community for the Community Building a RFID solution in BTS 09.
ETL By Dr. Gabriel.
10. Creating and Maintaining Geographic Databases.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Chapter 4: Organizing and Manipulating the Data in Databases
PRESENTED BY: LASONYA SHELBY 04/18/2010 LSTE 7309 The Importance of Databases.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Clifford Poulard Team C – Cliff Po
Agent-based Device Management in RFID Middleware Author : Zehao Liu, Fagui Liu, Kai Lin Reporter :郭瓊雯.
CS 395 Internship in Computing Presentation RFID Complete By Konstantin G. Khavanskii.
Data Profiling
1 CS 430 Database Theory Winter 2005 Lecture 1: Introduction.
STORING ORGANIZATIONAL INFORMATION— DATABASES CIS 429—Chapter 7.
Effectively Validate Query/Report: Strategy and Tool Steven Luo Sr. System Analyst Barnes & Noble Session id:
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Chapter 1 In-lab Quiz Next week
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
RFID In Retail Neco Can Vice President - Partner ATTEVO Inc.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
SQL Server 7.0 Maintaining Referential Integrity.
Chapter 1 Database Systems
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Data Warehouses BUAD/American University Data Warehouses.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
MIS 327 Database Management system 1 MIS 327: DBMS Dr. Monther Tarawneh Dr. Monther Tarawneh Week 2: Basic Concepts.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Migrating From Relational To Object-Oriented Databases Masood Asif, Kenny Dunlop, Gerard Given & Grant Stalker.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Prepared By Aakanksha Agrawal & Richa Pandey Mtech CSE 3 rd SEM.
Data resource management
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Olga Papaemmanouil, Brandeis University Nga Tran, Brandeis University Mitch Cherniack, Brandeis University.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Foundations of Business Intelligence: Databases and Information Management.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
CS240A: Databases and Knowledge Bases Temporal Databases Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
A New OLAP Aggregation Based on the AHC Technique DOLAP 2004 R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue.
What is RFID? Radio frequency identification (RFID) is a wireless form of automated identification technology. RFID is sometimes called dedicated short-range.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
ETL Design - Stage Philip Noakes May 9, 2015.
Physical Changes That Don’t Change the Logical Design
Informix Red Brick Warehouse 5.1
Chapter 1 Database Systems
Unidad II Data Warehousing Interview Questions
Chapter 4 Summary Query.
The Relational Model Textbook /7/2018.
Chapter 1 Database Systems
CodePainter Revolution Trainer Course
Data Warehousing Concepts
Database SQL.
Presentation transcript:

A Deferred Cleansing Method for RFID Data Analytics IBM Almaden Research Center Jun Rao Sangeeta Doraiswamy Latha S. Colby University of California at Los Angeles Hetal Thakkar

RFID and Its Applications Radio Frequency Identification –Radio-based barcode –Becoming widely used in supply-chain, asset tracking … –Standardization based on Electronic Product Code (EPC) EPCRtimeReaderBiz_loc e1t1r1warehouse e2t1r1warehouse e1t hoursr2distribution center e2t hoursr2distribution center e1t hoursr3store ………… Analytics on RFID data –simple: where is e1 at time t1 +50? –complex: average time spent per hop in the supply chain

RFID Data Tends to be Dirty Various types of anomalies –Physical: radio interference, media type, etc Redundant reads : (e1, t1, r1, l1) (e1, t1+2 secs, r1, l1) False reads : (e1, t1, r1, l1) ---> (e1, t1, r2, l2) Missing reads : (e1, t1, r1, l1) <--- (e1, t1+3, r2, l2) (e1, t1 + 10, r3, l3) –Logical: tend to be application dependent (e1, t1, r1, back room) (e1, t1+2, r2, sales floor) (e1, t1+5, r1, back room) (e1, t1+9, r2, sales floor) Small number of anomalies ---> large error in analysis Cleaning RFID data is imperative!

Eager Cleansing vs. Deferred Cleansing Conventional approach to cleansing is eager –At the edge server: de-dup, smoothing, … –Before loading into a warehouse (ETL) have more context than the edge –Clean once, reuse at query time –Typically reducing data size downstream –Best strategy if applicable Sometimes eager cleansing is not applicable –Don’t know how to clean until analyzing the data –More than one cleaned version (app-dependant anomalies) –Law enforcement (pharmaceutical e-pedigree tracking ) We propose deferred cleansing –Load everything –Clean at query time –Has runtime overhead, but offers flexibility –Complementary to eager cleansing

Overview of Our Approach DATABASE CLEANSING RULES ENGINE RULES TABLE EPC READS TABLE USER RULE 1 6 USER QUERY QUERY REWRITE ENGINE

Outline Cleansing rules and their implementation Query rewrite over cleansing rules Experimental results Conclusion

EPC sequences, each of which has all reads of a EPC in rtime order –Very useful for cleansing as well as querying RFID Data Characteristics Duplicate removal: with v1 as ( select biz_loc as loc_current, max(biz_loc) over (partition by epc order by rtime asc rows between 1 preceding and 1 preceding) as loc_before from R ) select * from v1 where loc_current != loc_before or loc_before is null; (e1, t1, r1, l1) (e1, t1+2 secs, r1, l1) EPCRtimeReaderBiz_loc e1t1r1warehouse e1t1+ 30 hoursr2distribution center e1t hoursr3store e2t1r1warehouse e2t hoursr2distribution center ………… Many sequence-based languages proposed But SQL/OLAP (standardized in SQL 99) can do sequence processing!

Exploit SQL/OLAP for Sequence- based Cleansing Pros –more efficient (compared with self-joins) –standardized (supported by major DB vendors) –integrated: parallelism, optimization Cons –complex syntax Solution –specify cleansing rules in a simpler language (based on SQL-TS) have impact on query rewrite as well –implement rules in DBMS using SQL/OLAP

Cycle Rule Scenario Back room (X) Sales floor (Y) case (epc1) PatternConditionAction (A, B, C)A.biz_loc=C. biz_loc and A.biz_loc != B.biz_loc DELETE B [X Y X Y X Y] [X Y] CLUSTER BY epc SEQUENCE BY rtime an ordered list of singleton references target reference

Reader Rule Scenario docking door (reader D) warehouse (has location tag) forklift (reader X) r1 (readerD) r2 (readerX) PatternConditionAction (A, *B)B.reader = ‘readerX’ and B.rtime – A.rtime < t2 mins DELETE A SQL/OLAP implementation max(case when reader = 'readerX' then 1 else 0 end) over (… range between 1 macro sec following and t2 min following) as has_readerX_after B is a set reference t2 mins X

Missing Rule Scenario L1 L2 L3 case (epcC) X X pallet (epcP) X X X r1. PatternConditionAction (X,A,Y)A.is_pallet=1 and ( (X.is_pallet=0 and A.biz_loc=X.biz_loc and A.rtime-X.rtime<5 mins) OR (Y.is_pallet=0 and A.biz_loc=Y.biz_loc and Y.rtime-A.rtime<5 mins) ) MODIFY A.has_case_nearby=1 r2. PatternConditionAction (A,*B)A.is_pallet=0 or (A.has_case_nearby=0 and B.has_case_nearby=1) KEEP A (X)

Query RFID Data over Cleansing Rules Q=σ s (R) Q[C] is the answer to Q with respect to rule C Naïve implementation: Q[C] = σ s (Ф C (R)), where Ф C is cleans input using rule C Traditional predicate pushdown through view not directly applicable Can we do this Q[C] = Ф C (σ s (R))? (incorrect)

t1-2 t1 t1+2 case on forklift r1(readerD) r2(readerX ) Example 1 σ s (Ф C (R)): {} e1 = σ rtime<t1 (Ф C (σ rtime<t1+5 (R))) (expanded rewrite) Ф C (σ s (R)): {r1} PatternConditionAction (A, *B)B.reader = ‘readerX’ and B.rtime – A.rtime < 5 mins DELETE A Reader rule Q1:σ rtime<t1 (R) ]

t2-2 t2 t2+2 case r3 (loc1) r4 (loc1) Q2:σ rtime>t2 (R) [ Example 2 σ s (Ф C (R)): {} Ф C (σ s (R)): {r4} PatternConditionAction (E, F)E.biz_loc = F.biz_locDELETE F Duplicate rule e2=σ rtime>t2 (Ф C (R  epc Π epc (σ rtime>t2 (R)))) (Join-back rewrite, always applicable)

Rewrite Summary Expanded rewrite –work at rule level, instead of SQL/OLAP level –collect conditions in cleansing rules referencing target reference –keep only position preserving conditions –run transitivity between surviving rule conditions and query conditions –predicates derived on target reference can be pushed down Choose the rewrite between expanded and join-back Extended to support multiple rules and join queries

Experimental Setup steps (100) biz_step desc type comment parent(s*50) child_epc parent_epc locs (13k) gln desc site state city comment caseR(s*1.5k) epc rtime reader biz_loc biz_step EPC_info(s*50) epc product lot manufacture_date, expiration_date comment product (1,000) product manufacturer comment palletR(s*30) epc … RFID Data Schema

Queries and Rules q1. “Dwell” analysis: average staying time between adjacent locations. with v1 as ( select biz_loc as current_loc, rtime, max(rtime) over (… rows 1 preceding) as prev_time, max(biz_loc) over (… rows 1 preceding) as prev_loc from caseR where rtime <= T1 ) select l1.loc_desc, l2.loc_desc, avg(rtime-prev_time) from v1, locs l1, locs l2 where v1.prev_loc = l1.gln and v1.current_loc = l2.gln group by l1.loc_desc, l2.loc_desc q2. Site analysis select p.manufacturer, count(distinct s.type), count(distinct c.reader) from caseR c, steps s, locs l, epc_info i, product p where c.biz_step=s.biz_step and c.biz_loc=l.gln and c.epc=i.epc and i.product=p.product and c.rtime >= T2 and l.site = ‘distribution center 2’ group by p.manufacturer rule name 1. reader on case reads 2. duplicate on case reads 3. replacing on case reads 4. cycle on case reads 5. missing on case+pallet reads 1 GB base data Varying anomaly percentage –implemented by inversing the rules DB2 UDB V8.2 Indexes on queries attributes

Single Rule, 10% anomalies, Varying Selectivity Both rewrites are more efficient than naïve Cleansing overhead comes from sort and scalar aggregates in SQL/OLAP – sort required by cleansing is shared by q1 Tradeoffs between expanded and join-back rewrite –Expanded can’t use all predicates in the query; Join-back has to do extra joins Cleansing overhead amortized over joins and aggregate

10% selectivity, 10% anomalies, Varying Rules Additional overhead per extra rule is moderate –sort required in SQL/OLAP is amortized in multiple rules “Missing rule” adds the most overhead –Has to sort both case reads as well as pallet reads

Conclusion Proposed a deferred cleansing approach to RFID data –Complementary to eager cleansing –Has overhead, but offers flexibility SQL-TS based cleansing rules for simplicity SQL-OLAP implementation for efficiency Two query rewrites exploit query predicates and guarantee correctness Experimental results show deferred cleansing is affordable for typical analytical queries

Extended SQL-TS Cluster by (epc) and sequence by (rtime) define sequences Pattern defines an ordered list of references –a reference with no * sign refers to a single row –a reference with a * sign refers to a set of rows Where clause specifies condition on attributes in references –existential semantic on set reference Action is defined on a singleton reference (target reference) DEFINE [rule name] ON [table name] FROM [table name] CLUSTER BY [cluster key] SEQUENCE BY [sequence key] AS [pattern] WHERE [condition] ACTION [DELETE | MODIFY | KEEP] AS (A, B) WHERE A.biz_loc =B.biz_loc DELETE B