Download presentation

Presentation is loading. Please wait.

Published byKathleen Fryar Modified about 1 year ago

1
1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2
2 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy

3
3 Health Personal medical details Disease history Clinical research data Banking Bank statement Loan Details Transaction history Finance Portfolio information Credit history Transaction records Investment details Insurance Claims records Accident history Policy details Outsourcing Customer data for testing Remote DB Administration BPO & KPO Retail Business Inventory records Individual credit card details Audits Manufacturing Process details Blueprints Production data Govt. Agencies Census records Economic surveys Hospital Records Motivation 1: Data Privacy in Enterprises

4
4 Motivation 2: Government Regulations CountryPrivacy Legislation AustraliaPrivacy Amendment Act of 2000 European UnionPersonal Data Protection Directive 1998 Hong KongPersonal Data (Privacy) Ordinance of 1995 United KingdomData Protection Act of 1998 United StatesSecurity Breach Information Act (S.B. 1386) of 2002 Gramm-Leach-Bliley Act of 1999 Health Insurance Portability and Accountability Act of 1996

5
5 Motivation 3: Personal Information s Searches on Google/Yahoo Profiles on Social Networking sites Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations Documents on the Computer / Network

6
6 Losses due to Lack of Privacy: ID-Theft 3% of households in the US affected by ID-Theft US $5-50B losses/year UK £1.7B losses/year AUS $1-4B losses/year

7
7 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy

8
8 Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source Agrawal, Srikant, Thomas SIGMOD 2005

9
9 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments

10
10 Horizontally Partitioned Personal Information p 1 p 2 p n Table T for analysis at server Client C 1 Original Row r 1 Perturbed p 1 Client C 2 Original Row r 2 Perturbed p 2 Client C n Original Row r n Perturbed p n EXAMPLE: What number of children in this county go to college?

11
11 Vertically Partitioned Enterprise Information IDC1C1 John 1 Alice 5 Bob 18 IDC1C1 John 1 Alice 7 Bob 18 IDC2C2 C3C3 John 279 Alice 536 IDC2C2 C3C3 John 359 Alice 537 IDC1C1 C2C2 C3C3 John 1359 Alice 7537 Original Relation D 1 Perturbed Relation D’ 1 Original Relation D 2 Perturbed Relation D’ 2 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

12
12 Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P 1 and P 2 and P 3 and …. P k where P 1 and P 2 and P 3 and …. P k Eg Find # of people between age[30-50] and salary[80-150] i.e. COUNT T ( P 1 and P 2 and P 3 and …. P k ) Goal: provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes

13
13 Perturbation Example: Uniform Retention Replacement HEADS: RETAIN TAILS: REPLACE U.A.R. FROM [1-5] BIAS=0.2 Throw a biased coin Heads: Retain Tails: Replace with a random number from a predefined pdf Tails Heads Tails

14
14 Retention Replacement Perturbation Done for each column The replacing pdf need not be uniform Best to use original pdf if available/ estimable Different columns can have different biases for retention

15
15 Single Attribute Example What is the fraction of people in this building with age 30-50? Assume age between Whenever a person enters the building flips a coin of with heads probability p=0.2. Heads -- report true age RETAIN Tails -- random number uniform in reported PERTURB Totally 100 randomized numbers collected. Of these 22 are How many among the original are 30-50?

16
16 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments

17
17 Analysis 80 Perturbed 20 Retained Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

18
18 Analysis Contd. 64 Perturbed, NOT Age[30-50] 16 Perturbed, Age[30-50] 20 Retained 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t.

19
19 Analysis Contd. Since there were 22 randomized rows in [30-50] =6 of them come from the 20 retained rows. 16 Perturbed, Age[30-50] 64 Perturbed, NOT Age[30-50] 6 Retained, Age[30-50] 14 Retained, NOT Age[30-50]

20
20 Scaling up Total RowsAge[30-50] ?30 Thus 30 people had age in expectation.

21
21 Multiple Attributes (k=2) P 1 =Age[30-50], P 2 =Salary[80-150] QueryEstimated on TEvaluated on T` count(¬P 1 ٨ ¬P 2 ) x0x0 y0y0 count(¬P 1 ٨ P 2 ) x1x1 y1y1 count(P 1 ٨ ¬P 2 ) x2x2 y2y2 count(P 1 ٨ P 2 ) x3x3 y3y3

22
22 Architecture

23
23 Formally : Select count(*) from R where Pred p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate Pred ( in example) a = 1-b

24
24 Transition matrix (1-p)a + p(1-p)b (1-p)a(1-p)b+p Count T ( : P)Count T ( P) Count T’ ( : P)Count T’ (P) = i.e. Solve xA=y A 00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A 00 = (1-p)a+p

25
25 Multiple Attributes For k attributes, x, y are vectors of size 2 k x=y A -1 Where A=A 1 A 2 .. A k [Tensor Product] A i is the transition matrix for column i

26
26 Error Bounds In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9 Given T ! T’, with n rows f(T) is (n,,) reconstructible by g(T’) if |f(T) – g(T’)| < max (, f(T)) with probability greater than (1- ). f(T) =2, =0.1 in above example

27
27 Theoretical Basis and Results Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, , ) estimator for f if n > 4 log(2/ )(p ) -2, by Chernoff bounds Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

28
28 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Inversion method Iterative method Privacy Guarantees Experiments

29
29 Iterative Algorithm [AS00] Iterate: x p T+1 = q=0 t y q (a pq x p T / ( r=0 t a rq x r T )) [ By Application of Bayes Rule] Initialize: x 0 =y Stop Condition: Two consecutive x iterates do not differ much

30
30 Iterative Algorithm We had proved, Theorem: Inversion Algorithm gives the MLE Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < x i, 8 0 < i < 2 k -1 Models the fact the probabilities are non-negative Results better as shown in experiments

31
31 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

32
32 Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach More subtle differential privacy in the thesis

33
33 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

34
34 Experiments Real data: Census data from the UCI Machine Learning Repository having rows Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and Error metric: l 1 norm of difference between x and y. L 1 norm between 2 probability distributions Eg for 1-dim queries |x 1 – y 1 | + | x 0 – y 0 |

35
35 Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

36
36 The error in the iterative algorithm flattens out as its maximum value is bounded by 2 Error as a function of Number of Columns: Iterative Algorithm: Zipf Data

37
37 Error as a function of Number of Columns Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns Census Data

38
38 Error as a function of number of Rows Error decreases as as number of rows, n increases

39
39 Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. Privacy Preserving OLAP is Practical

40
40 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy

41
41 Anonymizing Tables: ICDT05 Creating tables that do not identify individuals for research or out-sourced software development purposes Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu

42
42 Probabilistic Anonymity: (submitted) Lodha, Thomas Achieving Anonymity via Clustering: PODS06 Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu

43
43 Data Privacy Value disclosure: What is the value of attribute salary of person X Perturbation Privacy Preserving OLAP Identity disclosure: Whether an individual is present in the database table Randomization, K-Anonymity etc. Data for Outsourcing / Research

44
44 Original Dataset IdentifyingSensitive SSNNameDOBGenderZip codeDisease 614Sara03/04/76F94305Flu 615Joan07/11/80F94307Cold 629Karan05/09/55M94301Diabetes 710Harris11/23/62M94305Flu 840Carl11/23/62M94059Arthritis 780Amanda01/07/50F94042Heart problem 619Rob04/08/43M94042Arthritis

45
45 Randomized Dataset IdentifyingSensitive SSNNameDOBGenderZip codeDisease 101Amy03/04/76F94305Flu 102Betty07/11/80F94307Cold 103Clarke05/09/55M94301Diabetes 104David11/23/62M94305Flu 105Earl11/23/62M94059Arthritis 106Finy01/07/50F94042Heart problem 107George04/08/43M94042Arthritis

46
46 Quasi-Identifiers Uniquely identify you! Sensitive DOBGenderZip codeDisease 03/04/76F94305Flu 07/11/80F94307Cold 05/09/55M94301Diabetes 12/30/72M94305Flu 11/23/62M94059Arthritis 01/07/50F94042Heart problem 04/08/43M94042Arthritis Quasi-identifiers: approximate foreign keys

47
47 k-Anonymity Model [Swe00] Modify some entries of quasi-identifiers each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers Individual records hidden in a crowd of size k

48
48 Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

49
49 Suppressing all entries: No Utility AgeSalary Amy * * Brian * * Carol * * David * * Evelyn**

50
50 2-Anonymity with Clustering AgeSalary Amy[25-29][50-100] Brian[25-29][50-100] Carol[25-29][50-100] David[35-39][ ] Evelyn[35-39][ ] Cluster centers published 27=( )/3 70=( )/3 37=(35+39)/2 115=( )/2 Clustering formulation: NP Hard

51
51 First cut: ICDT05 Combinatorial Algorithms Nearest-neighbour edge Other edges

52
52 K-Anonymous partitions

53
53 Second cut: PODS06 Metric space clustering Combinatorial algorithm on graphs O(k) approximation Ω(k) lower bound Assume points in a Metric Space - Factor 2 approximation for maximum cluster size metric - Matching Lower bound of 2 - Constant Factor Approximations for median and cellular metrics too

54
54 Clustering Metrics 10 points, radius 5 20 points, radius points, radius 15

55
55 r-center Clustering: Minimize Maximum Cluster Size 2d

56
56 Cellular Clustering: Linear Program Minimize c ( i x ic d c + f c y c ) Sum of Cellular cost and facility cost Subject to: c x ic ¸ 1 Each Point belongs to a cluster x ic · y c Cluster must be opened for point to belong 0 · x ic · 1 Points belong to clusters positively 0 · y c · 1 Clusters are opened positively

57
57 Final Cut: Probabilistic Anonymity Formalism and Practical technique to identify a Quasi-Identifier What is the ideal amount of generalization/ suppression to apply to the different columns Techniques to make your published database table conform to laws like HIPAA

58
58 Quasi-identifier Apple Guava Orange Apple Banana 0.6 Fraction uniquely identified by Fruit. Hence Fruit is 0.6 Quasi-identifier fraction of U.S. population uniquely identified by (DOB, Gender, Zipcode) hence it is a 0.87 quasi-identifier

59
59 Quasi-Identifier Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records. D distinct values, n rows If D <=n D/en (skewed distribution) Else e -n/D (uniform distribution)

60
60 Distinct values- Identifier DOB : 60*365=2*10 4 Gender: 2 Zipcode: 10 5 (DOB, Gender, Zipcode) has together 2*10 4 *2*10 5 =4*10 9 US population=3*10 8 Fraction of singletons= e -3*10^8/4*10^9 =0.92

61
61 Distinct values and K-anonymity Eg. Apply HIPAA to (Age in Years, Zipcode, Gender,Doctor details) Want k=20,000=2*10 4 anonymity with n=300 million=3*10 8 people. The number of distinct values is D=n/k=1.5*10 4 D=Distinct values= z(zipcode)*100(age in years)*2(gender)=200z 1.5*10 4 =200z, z=10 2 approximately. Retain first two digits of zipcode (retain states)

62
62 Anonymity Reduce number of distinct values Many more matches when you join with a foreign table Cluster close by values to a single value Use quantiles as placeholders Round all elements to closest quantile

63
63 1 dimensional Anonymity

64
64 Ex 20-Anonymity: Quantiles 200 numbers. Want 20 distinct values Sort them. Round first 20 to 10 th Next 20 to 30 th So on Last 20 to 190 th 10 Different groups each with 20 elements

65
65 Quantiles. Quantiles minimize the distortion introduced as a result of this anonymization Efficient Quantile finding algorithms using a sample

66
66 Using a Sample Exact Quantiles: requires sorting entire dataset. Instead create a random sample of size 20 Use 2 nd,4 th,..,20 th element as 10 placeholders Round each element to closest placeholder

67
67 Experiments Efficient Algorithms based on randomized algorithms to find quantiles in small space 10 seconds to anonymize quarter million rows. Or approximately 3GB per hour on a machine running 2.66Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2 order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation Optimal algorithms to anonymize the dataset. Scalable Almost independent of anonymity parameter k linear in quasi-identifier size (previously exponential) linear in dataset size

68
68 Time Required scales with database size

69
69 Time required scales with number of columns

70
70 Time vs # Buckets

71
71 Error vs # Buckets

72
72 Masketeer: A tool for data privacy Das, Lodha, Patwardhan, Sundaram, Thomas.

73
73 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy

74
74 Auditing Batches of SQL Queries Motwani, Nabar, Thomas PDM Workshop with ICDE 2007 Given a set of SQL queries that have been posed over a database, determine whether some subset of these queries have revealed private information about an individual or a group of individuals

75
75 Database Query Auditing Auditing Aggregate (Sum, Max, Median) queries Perfect Privacy Auditing SQL Queries Auditing a Batch of SQL Queries

76
76 Example SELECT zipcode FROM Patients p WHERE p.disease = ‘diabetes’ AUDIT zipcode FROM Patients p WHERE p.disease = ‘high blood pressure’ AUDIT disease FROM Patients p WHERE p.zipcode = Suspicious if someone in has diabetes Not Suspicious wrt this

77
77 Query Suspicious wrt an Audit Expression If all columns of audit expression are covered by the query If the audit expression and the query have one tuple in common

78
78 SQL Auditing Batch of SQL queries, each of form Project col 1 col 2 col 3 …. col k From R Where C 1 and C 2 and C 3 and … C j Each C i : (col m = value), (col m <= value), (col m >= value), (value 1 <= col m <= value 2 ) col 1, col 2,.. col k includes primary key so that result of query can be joined with other results

79
79 Semantically Suspicious A query batch Q 1, Q 2,.. Q n is said to be suspicious wrt to an audit expression A if an expression combining the results of these queries as base tables is suspicious wrt A Natural extension of a suspicious query to a query batch

80
80 Syntactically Suspicious A query batch is said to be syntactically suspicious with respect to an audit expression A if there exists an instantiation of the database tables for which it is suspicious wrt A Does not require execution of the queries against the table

81
81 SQL Batch Auditing Query 1 Query 2 Query 3 Audited tuple columns are covered Query 4 Audit expression Query batch suspicious wrt audit expression iff queries together cover all audited columns of at least audited tuple semanticallysyntactically on table Ton some table T

82
82 Syntactic and Semantic Auditing Checking for semantic suspiciousness has polynomial time algorithm Checking for syntactic suspiciousness is NP complete

83
83 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy

84
84 Two Can Keep a Secret: A Distributed Architecture for Secure Database Services Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005 How to distribute data across multiple sites for (1)redundancy and (2) privacy so that a single site being compromised does not lead to data loss

85
85 Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases Feder, Ganapathy, Garcia-Molina, Motwani, Thomas Work in Progress

86
86 Motivation Data outsourcing growing in popularity Cheap, reliable data storage and management 1TB $399 < $0.5 per GB $5000 – Oracle 10g / SQL Server $68k/year DBAdmin Privacy concerns looming ever larger High-profile thefts (often insiders) UCLA lost 900k records Berkeley lost laptop with sensitive information Acxiom, JP Morgan, Choicepoint

87
87 Present solutions Application level: Salesforce.com On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year Amazon Elastic Compute Cloud 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0.10 per instance hour $0.20 per GB of data in/out of Amazon $0.15 per GB-Month of Amazon S3 storage used Google Apps for your domain Small businesses, Enterprise, School, Family or Group

88
88 Encryption Based Solution Encrypt Client DSP Client-side Processor Query Q Q’ “Relevant Data”Answer Problem: Q’ “SELECT *”

89
89 The Power of Two Client DSP1 DSP2

90
90 The Power of Two DSP1 DSP2 Client-side Processor Query Q Q1 Q2 Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q)

91
91 SB1386 Privacy { Name, SSN}, { Name, LicenceNo} { Name, CaliforniaID} { Name, AccountNumber} { Name, CreditCardNo, SecurityCode} are all to be kept private. A set is private if at least one of its elements is “hidden”. Element in encrypted form ok

92
92 Techniques Vertical Fragmentation Partition attributes across R1 and R2 E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSN Use tuple IDs for reassembly. R = R1 JOIN R2 Encoding One-time Pad For each value v, construct random bit seq. r R1 v XOR r, R2 r Deterministic Encryption R1 E K (v) R2 K Can detect equality and push selections with equality predicate Random addition R1 v+r, R2 r Can push aggregate SUM

93
93 Example An Employee relation: {Name, DoB, Position, Salary, Gender, , Telephone, ZipCode} Privacy Constraints {Telephone}, { } {Name, Salary}, {Name, Position}, {Name, DoB} {DoB, Gender, ZipCode} {Position, Salary}, {Salary, DoB} Will use just Vertical Fragmentation and Encoding.

94
94 Example (2) {Telephone}{ } {Name, Salary} {Name, Position} {Name, DoB} {DoB, Gender,ZipCode} {Position, Salary} {Salary, DoB} Constraints NameDoB Position Salary Gender TelephoneZipCode R1 R2 Telephone Salary ID

95
95 Partitioning, Execution Partitioning Problem Partition to minimize communication cost for given workload Even simplified version hard to approximate Hill Climbing algorithm after starting with weighted set cover Query Reformulation and Execution Consider only centralized plans Algorithm to partition select and where clause predicates between the two partitions

96
96 Data Privacy Tools worked on Privacy Preserving OLAP Distributed Architecture Data Masking

97
97 Papers [AST05] Privacy Preserving OLAP. SIGMOD 2005 [AFK+05] Approximation Algorithms for K- Anonymity. ICDT 2005 [AFK+06] Clustering for Anonymity. PODS 2006 [LT07] Probabilistic Anonymity. Submitted

98
98 Other Privacy Papers [SPG05] Two Can Keep a Secret: A Distributed Architecture for Secure Database Services. CIDR 2005 [SPG04] Enabling Privacy for the Paranoids. VLDB 2004 [DLP+07] Masketeer: A Tool for Data Privacy. [MNT07] Auditing SQL Queries. PDM Workshop with ICDE 2007

99
99 Thank You!

100
100 Acknowledgements: Stanford Faculty Advisor: Rajeev Motwani Members of Orals Committee: Rajeev Motwani, Hector Garcia-Molina, Dan Boneh, John Mitchell, Ashish Goel Many other professors at Stanford, esp. Jennifer Widom

101
101 Acknowledgements: Projects STREAM: Jennifer Widom, Rajeev Motwani PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi

102
102 Acknowledgements: Internship Mentors Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman

103
103 Acknowledgements: CoAuthors[A-K] Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia- Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa,Gagan Aggarwal Arvind ArasuBrian Babcock Shivnath BabuMayank Bawa Nicolas BrunoRenato Carmo Surajit ChaudhuriMayur DatarTomás Feder Prasanna GanesanHector Garcia- MolinaKeith ItoKrishnaram KenthapadiSamir Khuller Yoshiharu Kohayakawa

104
104 Acknowledgements: CoAuthors[L-Z] Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, Liadan Boyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu Eduardo Sany Laber Nina MishraRajeev MotwaniItaru Nishizawa LiadanRina PanigrahyRamakrishnan SrikantUtkarsh Srivastava Rohit VarmaJennifer WidomYing XuAn Zhu

105
105 Acknowledgements: Others not in previous list Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc. Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many PhD students of the incoming year Paul etc. and many other students at Stanford Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help Andy, Miles, Lilian for keeping the machines running! Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee

106
106 Acknowledgements: More! Jojy Michael, Joshua Easow and families Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey Batchmates and Professors from IITs Friends and relatives, grandparents sister Dina, and Parents

107
107 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications – data input as continuous, ordered data streams Network and traffic monitoring Telecom call records Network security Financial applications Sensor networks Web logs and clickstreams Massive data sets

108
108 Scheduling Algorithms for Data Streams Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004 Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004 Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google