1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF.

1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 RoadMap  Motivation for Data Privacy Research  Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer  Auditing for Privacy  Distributed Architectures for Privacy

3 Health Personal medical details Disease history Clinical research data Banking Bank statement Loan Details Transaction history Finance Portfolio information Credit history Transaction records Investment details Insurance Claims records Accident history Policy details Outsourcing Customer data for testing Remote DB Administration BPO & KPO Retail Business Inventory records Individual credit card details Audits Manufacturing Process details Blueprints Production data Govt. Agencies Census records Economic surveys Hospital Records Motivation 1: Data Privacy in Enterprises

4 Motivation 2: Government Regulations CountryPrivacy Legislation AustraliaPrivacy Amendment Act of 2000 European UnionPersonal Data Protection Directive 1998 Hong KongPersonal Data (Privacy) Ordinance of 1995 United KingdomData Protection Act of 1998 United StatesSecurity Breach Information Act (S.B. 1386) of 2002 Gramm-Leach-Bliley Act of 1999 Health Insurance Portability and Accountability Act of 1996

5 Motivation 3: Personal Information  Emails  Searches on Google/Yahoo  Profiles on Social Networking sites  Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations  Documents on the Computer / Network

6 Losses due to Lack of Privacy: ID-Theft 3% of households in the US affected by ID-Theft US $5-50B losses/year UK £1.7B losses/year AUS $1-4B losses/year

8 Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source Agrawal, Srikant, Thomas SIGMOD 2005

9 Privacy Preserving OLAP  Motivation  Problem Definition  Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method  Privacy Guarantees  Experiments

10 Horizontally Partitioned Personal Information p 1 p 2 p n Table T for analysis at server Client C 1 Original Row r 1 Perturbed p 1 Client C 2 Original Row r 2 Perturbed p 2 Client C n Original Row r n Perturbed p n EXAMPLE: What number of children in this county go to college?

11 Vertically Partitioned Enterprise Information IDC1C1 John 1 Alice 5 Bob 18 IDC1C1 John 1 Alice 7 Bob 18 IDC2C2 C3C3 John 279 Alice 536 IDC2C2 C3C3 John 359 Alice 537 IDC1C1 C2C2 C3C3 John 1359 Alice 7537 Original Relation D 1 Perturbed Relation D’ 1 Original Relation D 2 Perturbed Relation D’ 2 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

12 Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P 1 and P 2 and P 3 and …. P k where P 1 and P 2 and P 3 and …. P k Eg Find # of people between age[30-50] and salary[80-150] i.e. COUNT T ( P 1 and P 2 and P 3 and …. P k ) Goal: provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes

13 Perturbation Example: Uniform Retention Replacement 1 1 3 4 2 5 4 3 1 3 HEADS: RETAIN TAILS: REPLACE U.A.R. FROM [1-5] BIAS=0.2 Throw a biased coin Heads: Retain Tails: Replace with a random number from a predefined pdf Tails Heads Tails

14 Retention Replacement Perturbation  Done for each column  The replacing pdf need not be uniform Best to use original pdf if available/ estimable  Different columns can have different biases for retention

15 Single Attribute Example What is the fraction of people in this building with age 30-50?  Assume age between 0-100  Whenever a person enters the building flips a coin of with heads probability p=0.2. Heads -- report true age RETAIN Tails -- random number uniform in 0-100 reported PERTURB  Totally 100 randomized numbers collected.  Of these 22 are 30-50.  How many among the original are 30-50?

16 Privacy Preserving OLAP  Motivation  Problem Definition  Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method  Privacy Guarantees  Experiments

17 Analysis 80 Perturbed 20 Retained Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

18 Analysis Contd. 64 Perturbed, NOT Age[30-50] 16 Perturbed, Age[30-50] 20 Retained 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t.

19 Analysis Contd. Since there were 22 randomized rows in [30-50]. 22-16=6 of them come from the 20 retained rows. 16 Perturbed, Age[30-50] 64 Perturbed, NOT Age[30-50] 6 Retained, Age[30-50] 14 Retained, NOT Age[30-50]

20 Scaling up Total RowsAge[30-50] 206 100 ?30 Thus 30 people had age 30-50 in expectation.

21 Multiple Attributes (k=2) P 1 =Age[30-50], P 2 =Salary[80-150] QueryEstimated on TEvaluated on T` count(¬P 1 ٨ ¬P 2 ) x0x0 y0y0 count(¬P 1 ٨ P 2 ) x1x1 y1y1 count(P 1 ٨ ¬P 2 ) x2x2 y2y2 count(P 1 ٨ P 2 ) x3x3 y3y3

22 Architecture

23 Formally : Select count(*) from R where Pred p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate Pred ( in example) a = 1-b

24 Transition matrix (1-p)a + p(1-p)b (1-p)a(1-p)b+p Count T ( : P)Count T ( P) Count T’ ( : P)Count T’ (P) = i.e. Solve xA=y A 00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A 00 = (1-p)a+p

25 Multiple Attributes For k attributes,  x, y are vectors of size 2 k  x=y A -1 Where A=A 1 A 2 .. A k [Tensor Product] A i is the transition matrix for column i

26 Error Bounds  In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9  Given T !  T’, with n rows f(T) is (n,,) reconstructible by g(T’) if |f(T) – g(T’)| < max (,  f(T)) with probability greater than (1- ). f(T) =2,  =0.1 in above example

27 Theoretical Basis and Results Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, ,  ) estimator for f if n > 4 log(2/  )(p  ) -2, by Chernoff bounds Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

28 Privacy Preserving OLAP  Motivation  Problem Definition  Query Reconstruction Inversion method Iterative method  Privacy Guarantees  Experiments

29 Iterative Algorithm [AS00] Iterate: x p T+1 =  q=0 t y q (a pq x p T / (  r=0 t a rq x r T )) [ By Application of Bayes Rule] Initialize: x 0 =y Stop Condition: Two consecutive x iterates do not differ much

30 Iterative Algorithm We had proved,  Theorem: Inversion Algorithm gives the MLE  Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < x i, 8 0 < i < 2 k -1 Models the fact the probabilities are non-negative Results better as shown in experiments

31 Privacy Preserving OLAP  Motivation  Problem Definition  Query Reconstruction  Privacy Guarantees  Experiments

32 Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach More subtle differential privacy in the thesis

33 Privacy Preserving OLAP  Motivation  Problem Definition  Query Reconstruction  Privacy Guarantees  Experiments

34 Experiments  Real data: Census data from the UCI Machine Learning Repository having 32000 rows  Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000  Error metric: l 1 norm of difference between x and y.  L 1 norm between 2 probability distributions Eg for 1-dim queries |x 1 – y 1 | + | x 0 – y 0 |

35 Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

36 The error in the iterative algorithm flattens out as its maximum value is bounded by 2 Error as a function of Number of Columns: Iterative Algorithm: Zipf Data

37 Error as a function of Number of Columns Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns Census Data

38 Error as a function of number of Rows Error decreases as as number of rows, n increases

39 Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. Privacy Preserving OLAP is Practical

41 Anonymizing Tables: ICDT05 Creating tables that do not identify individuals for research or out-sourced software development purposes Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu

42 Probabilistic Anonymity: (submitted) Lodha, Thomas Achieving Anonymity via Clustering: PODS06 Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu

43 Data Privacy  Value disclosure: What is the value of attribute salary of person X Perturbation  Privacy Preserving OLAP  Identity disclosure: Whether an individual is present in the database table Randomization, K-Anonymity etc.  Data for Outsourcing / Research

44 Original Dataset IdentifyingSensitive SSNNameDOBGenderZip codeDisease 614Sara03/04/76F94305Flu 615Joan07/11/80F94307Cold 629Karan05/09/55M94301Diabetes 710Harris11/23/62M94305Flu 840Carl11/23/62M94059Arthritis 780Amanda01/07/50F94042Heart problem 619Rob04/08/43M94042Arthritis

45 Randomized Dataset IdentifyingSensitive SSNNameDOBGenderZip codeDisease 101Amy03/04/76F94305Flu 102Betty07/11/80F94307Cold 103Clarke05/09/55M94301Diabetes 104David11/23/62M94305Flu 105Earl11/23/62M94059Arthritis 106Finy01/07/50F94042Heart problem 107George04/08/43M94042Arthritis

46 Quasi-Identifiers Uniquely identify you! Sensitive DOBGenderZip codeDisease 03/04/76F94305Flu 07/11/80F94307Cold 05/09/55M94301Diabetes 12/30/72M94305Flu 11/23/62M94059Arthritis 01/07/50F94042Heart problem 04/08/43M94042Arthritis Quasi-identifiers: approximate foreign keys

47 k-Anonymity Model [Swe00]  Modify some entries of quasi-identifiers each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers  Individual records hidden in a crowd of size k

48 Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

49 Suppressing all entries: No Utility AgeSalary Amy * * Brian * * Carol * * David * * Evelyn**

50 2-Anonymity with Clustering AgeSalary Amy[25-29][50-100] Brian[25-29][50-100] Carol[25-29][50-100] David[35-39][110-120] Evelyn[35-39][110-120] Cluster centers published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Clustering formulation: NP Hard

51 First cut: ICDT05 Combinatorial Algorithms 7 9 3 3 2 7 2 3 5 9 11 9 5 1 1 Nearest-neighbour edge Other edges

52 K-Anonymous partitions 3 3 2 2 3 1 1

53 Second cut: PODS06 Metric space clustering Combinatorial algorithm on graphs O(k) approximation Ω(k) lower bound Assume points in a Metric Space - Factor 2 approximation for maximum cluster size metric - Matching Lower bound of 2 - Constant Factor Approximations for median and cellular metrics too

54 Clustering Metrics 10 points, radius 5 20 points, radius 10 50 points, radius 15

55 r-center Clustering: Minimize Maximum Cluster Size 2d

56 Cellular Clustering: Linear Program Minimize  c (  i x ic d c + f c y c ) Sum of Cellular cost and facility cost Subject to:  c x ic ¸ 1 Each Point belongs to a cluster x ic · y c Cluster must be opened for point to belong 0 · x ic · 1 Points belong to clusters positively 0 · y c · 1 Clusters are opened positively

57 Final Cut: Probabilistic Anonymity  Formalism and Practical technique to identify a Quasi-Identifier  What is the ideal amount of generalization/ suppression to apply to the different columns  Techniques to make your published database table conform to laws like HIPAA

58 Quasi-identifier Apple Guava Orange Apple Banana 0.6 Fraction uniquely identified by Fruit. Hence Fruit is 0.6 Quasi-identifier. 0.87 fraction of U.S. population uniquely identified by (DOB, Gender, Zipcode) hence it is a 0.87 quasi-identifier

59 Quasi-Identifier Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records. D distinct values, n rows If D <=n D/en (skewed distribution) Else e -n/D (uniform distribution)

60 Distinct values- Identifier  DOB : 60*365=2*10 4  Gender: 2  Zipcode: 10 5  (DOB, Gender, Zipcode) has together 2*10 4 *2*10 5 =4*10 9  US population=3*10 8  Fraction of singletons= e -3*10^8/4*10^9 =0.92

61 Distinct values and K-anonymity  Eg. Apply HIPAA to (Age in Years, Zipcode, Gender,Doctor details)  Want k=20,000=2*10 4 anonymity with n=300 million=3*10 8 people.  The number of distinct values is D=n/k=1.5*10 4  D=Distinct values= z(zipcode)*100(age in years)*2(gender)=200z  1.5*10 4 =200z, z=10 2 approximately.  Retain first two digits of zipcode (retain states)

62 Anonymity  Reduce number of distinct values Many more matches when you join with a foreign table  Cluster close by values to a single value  Use quantiles as placeholders Round all elements to closest quantile

63 1 dimensional Anonymity 1 3 4 7 12 3 3 3 7 7

64 Ex 20-Anonymity: Quantiles  200 numbers. Want 20 distinct values  Sort them.  Round first 20 to 10 th Next 20 to 30 th So on Last 20 to 190 th  10 Different groups each with 20 elements

65 Quantiles.  Quantiles minimize the distortion introduced as a result of this anonymization  Efficient Quantile finding algorithms using a sample

66 Using a Sample  Exact Quantiles: requires sorting entire dataset.  Instead create a random sample of size 20  Use 2 nd,4 th,..,20 th element as 10 placeholders  Round each element to closest placeholder

67 Experiments  Efficient Algorithms based on randomized algorithms to find quantiles in small space 10 seconds to anonymize quarter million rows. Or approximately 3GB per hour on a machine running 2.66Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2  order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation  Optimal algorithms to anonymize the dataset.  Scalable Almost independent of anonymity parameter k linear in quasi-identifier size (previously exponential) linear in dataset size

68 Time Required scales with database size

69 Time required scales with number of columns

70 Time vs # Buckets

71 Error vs # Buckets

72 Masketeer: A tool for data privacy Das, Lodha, Patwardhan, Sundaram, Thomas.

74 Auditing Batches of SQL Queries Motwani, Nabar, Thomas PDM Workshop with ICDE 2007 Given a set of SQL queries that have been posed over a database, determine whether some subset of these queries have revealed private information about an individual or a group of individuals

75 Database Query Auditing  Auditing Aggregate (Sum, Max, Median) queries  Perfect Privacy  Auditing SQL Queries  Auditing a Batch of SQL Queries

76 Example SELECT zipcode FROM Patients p WHERE p.disease = ‘diabetes’ AUDIT zipcode FROM Patients p WHERE p.disease = ‘high blood pressure’ AUDIT disease FROM Patients p WHERE p.zipcode = 94305 Suspicious if someone in 94305 has diabetes Not Suspicious wrt this

77 Query Suspicious wrt an Audit Expression  If all columns of audit expression are covered by the query  If the audit expression and the query have one tuple in common

78 SQL Auditing  Batch of SQL queries, each of form Project col 1 col 2 col 3 …. col k From R Where C 1 and C 2 and C 3 and … C j Each C i : (col m = value), (col m <= value), (col m >= value), (value 1 <= col m <= value 2 ) col 1, col 2,.. col k includes primary key so that result of query can be joined with other results

79 Semantically Suspicious  A query batch Q 1, Q 2,.. Q n is said to be suspicious wrt to an audit expression A if an expression combining the results of these queries as base tables is suspicious wrt A  Natural extension of a suspicious query to a query batch

80 Syntactically Suspicious  A query batch is said to be syntactically suspicious with respect to an audit expression A if there exists an instantiation of the database tables for which it is suspicious wrt A  Does not require execution of the queries against the table

81 SQL Batch Auditing Query 1 Query 2 Query 3 Audited tuple columns are covered Query 4 Audit expression Query batch suspicious wrt audit expression iff queries together cover all audited columns of at least audited tuple semanticallysyntactically on table Ton some table T

82 Syntactic and Semantic Auditing  Checking for semantic suspiciousness has polynomial time algorithm  Checking for syntactic suspiciousness is NP complete

84 Two Can Keep a Secret: A Distributed Architecture for Secure Database Services Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005 How to distribute data across multiple sites for (1)redundancy and (2) privacy so that a single site being compromised does not lead to data loss

85 Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases Feder, Ganapathy, Garcia-Molina, Motwani, Thomas Work in Progress

86 Motivation  Data outsourcing growing in popularity Cheap, reliable data storage and management  1TB $399  < $0.5 per GB  $5000 – Oracle 10g / SQL Server  $68k/year DBAdmin  Privacy concerns looming ever larger High-profile thefts (often insiders)  UCLA lost 900k records  Berkeley lost laptop with sensitive information  Acxiom, JP Morgan, Choicepoint  www.privacyrights.org

87 Present solutions  Application level: Salesforce.com On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year  Amazon Elastic Compute Cloud 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0.10 per instance hour $0.20 per GB of data in/out of Amazon $0.15 per GB-Month of Amazon S3 storage used  Google Apps for your domain Small businesses, Enterprise, School, Family or Group

88 Encryption Based Solution Encrypt Client DSP Client-side Processor Query Q Q’ “Relevant Data”Answer Problem: Q’  “SELECT *”

89 The Power of Two Client DSP1 DSP2

90 The Power of Two DSP1 DSP2 Client-side Processor Query Q Q1 Q2 Key: Ensure Cost (Q1)+Cost (Q2)  Cost (Q)

91 SB1386 Privacy  { Name, SSN}, { Name, LicenceNo} { Name, CaliforniaID} { Name, AccountNumber} { Name, CreditCardNo, SecurityCode} are all to be kept private.  A set is private if at least one of its elements is “hidden”. Element in encrypted form ok

92 Techniques  Vertical Fragmentation Partition attributes across R1 and R2 E.g., to obey constraint {Name, SSN}, R1  Name, R2  SSN Use tuple IDs for reassembly. R = R1 JOIN R2  Encoding One-time Pad For each value v, construct random bit seq. r R1  v XOR r, R2  r Deterministic Encryption R1  E K (v) R2  K Can detect equality and push selections with equality predicate Random addition R1  v+r, R2  r Can push aggregate SUM

93 Example  An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode}  Privacy Constraints {Telephone}, {Email} {Name, Salary}, {Name, Position}, {Name, DoB} {DoB, Gender, ZipCode} {Position, Salary}, {Salary, DoB}  Will use just Vertical Fragmentation and Encoding.

94 Example (2) {Telephone}{Email} {Name, Salary} {Name, Position} {Name, DoB} {DoB, Gender,ZipCode} {Position, Salary} {Salary, DoB} Constraints NameDoB Position Salary Gender Email TelephoneZipCode R1 R2 Telephone Email Salary ID

95 Partitioning, Execution  Partitioning Problem Partition to minimize communication cost for given workload Even simplified version hard to approximate Hill Climbing algorithm after starting with weighted set cover  Query Reformulation and Execution Consider only centralized plans Algorithm to partition select and where clause predicates between the two partitions

96 Data Privacy Tools worked on  Privacy Preserving OLAP  Distributed Architecture  Data Masking

97 Papers  [AST05] Privacy Preserving OLAP. SIGMOD 2005  [AFK+05] Approximation Algorithms for K- Anonymity. ICDT 2005  [AFK+06] Clustering for Anonymity. PODS 2006  [LT07] Probabilistic Anonymity. Submitted

98 Other Privacy Papers  [SPG05] Two Can Keep a Secret: A Distributed Architecture for Secure Database Services. CIDR 2005  [SPG04] Enabling Privacy for the Paranoids. VLDB 2004  [DLP+07] Masketeer: A Tool for Data Privacy.  [MNT07] Auditing SQL Queries. PDM Workshop with ICDE 2007

99 Thank You!

100 Acknowledgements: Stanford Faculty  Advisor: Rajeev Motwani  Members of Orals Committee: Rajeev Motwani, Hector Garcia-Molina, Dan Boneh, John Mitchell, Ashish Goel  Many other professors at Stanford, esp. Jennifer Widom

101 Acknowledgements: Projects  STREAM: Jennifer Widom, Rajeev Motwani  PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell  TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina  RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi

102 Acknowledgements: Internship Mentors Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman

103 Acknowledgements: CoAuthors[A-K] Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia- Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa,Gagan Aggarwal Arvind ArasuBrian Babcock Shivnath BabuMayank Bawa Nicolas BrunoRenato Carmo Surajit ChaudhuriMayur DatarTomás Feder Prasanna GanesanHector Garcia- MolinaKeith ItoKrishnaram KenthapadiSamir Khuller Yoshiharu Kohayakawa

104 Acknowledgements: CoAuthors[L-Z] Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, Liadan Boyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu Eduardo Sany Laber Nina MishraRajeev MotwaniItaru Nishizawa LiadanRina PanigrahyRamakrishnan SrikantUtkarsh Srivastava Rohit VarmaJennifer WidomYing XuAn Zhu

105 Acknowledgements: Others not in previous list  Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc.  Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many PhD students of the incoming year 2002 -- Paul etc. and many other students at Stanford  Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help  Andy, Miles, Lilian for keeping the machines running!  Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee

106 Acknowledgements: More!  Jojy Michael, Joshua Easow and families  Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey  Batchmates and Professors from IITs  Friends and relatives, grandparents  sister Dina, and Parents

107 Data Streams data sets  Traditional DBMS – data stored in finite, persistent data sets data streams  New Applications – data input as continuous, ordered data streams Network and traffic monitoring Telecom call records Network security Financial applications Sensor networks Web logs and clickstreams Massive data sets

108 Scheduling Algorithms for Data Streams  Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004  Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004  Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003

1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF.

Similar presentations

Presentation on theme: "1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF.

Similar presentations

Presentation on theme: "1 Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani TexPoint fonts used in EMF."— Presentation transcript:

Similar presentations

About project

Feedback