Presentation on theme: "The Datacentric Grid and the Public/Private Boundary David Skillicorn School of Computing Queens University, Kingston"— Presentation transcript:
The Datacentric Grid and the Public/Private Boundary David Skillicorn School of Computing Queens University, Kingston
Assumption: Big datasets are immovable Theres not enough bandwidth (between storage and network) Theres not enough temporary storage space Some data may not cross political boundaries Some data may not cross social boundaries and distributed They are naturally collected that way There are optimal sizes for data repositories
The main implication of immovable data is that code must move to the data – reversing an assumption thats deeply embedded in computing, from the design of processors all the way to distributed computing. The Datacentric Grid Project at Queens is exploring the implications for data-intensive distributed applications. Some of the infrastructure of the computational grid carries over, but much does not (at least directly). Scale Costs Usage patterns Interfaces are different. D.B. Skillicorn and D. Talia, Mining Large Data Sets on Grids: Issues and Prospects, Computing and Informatics, Special Issue on Grid Computing Vol. 21, No. 4, 1--16, 2002.
1. Architectural implication: Data repositories should have their own tightly-coupled compute engines where code can execute. dataset network attached storage cluster
2. Semantics/programming languages implications: Theres a spectrum of choices about how much functionality exists at repositories, and how much arrives in the mobile code. At one extreme: the mobile code selects from a static library of operations, e.g. Netsolve, web services. - The user interface resembles SQL or SAS diagrams. At the other extreme: the mobile code is fully functional code that can communicate only locally. - The user interface resembles Java, Matlab.
3. Data-mining-specific implications: Caching Results. The result of applying a DM technique to a dataset is a model. Such a model may continue to have value (even to other users), so it makes sense to cache it. The obvious place to cache it is at one of the data repositories. So data repositories accrete models, which may themselves play the role of datasets for other users (e.g. a carefully chosen 10% sample).
Caching execution plans. Data mining tends to be exploratory – computing one model suggests refinements to produce a better model. Or the same model is built on a new days data. The execution plan for the second computation may very similar to that for the first. Rather than recompute the best way to execute the second computation, tweak the plan used to execute the first.
4. Information flow implications: Three reasons to limit information flow: Security Privacy Confidentiality across three boundaries Between user and host Between user and user Across the private/public boundary, i.e. between virtual private grids and public grids
Security: Each party does exactly what it has agreed to do. Computational grids emphasize protecting hosts from malicious users, but protecting users from malicious hosts is just as important. a. Hosts must be protected from malicious code: Authentication (requires an enforcement sidechannel –cannot be open) Sandboxing and playgrounds (still unreliable) Proof carrying code (cumbersome) Fully typed code (getting interesting) C.B. Jay, The constructor calculus. www-staff.it.uts.edu.au/~cbj/ Publications/constructors.ps
b. Users must be protected from malicious hosts: Obfuscated code (fudge!) Multiple execution (e.g. SETI, but expensive) Code instrumentation Private function evaluation (?practical) Program transforms (encrypting constants, mult, add means polynomials can be evaluated secretly) Sander and Tschudin, Protecting Mobile Agents Against Malicious Hosts, LNCS, c. Users must be protected from other users. Virtual machines (?strong enough)
Privacy: The default is that other people dont know what Im doing. Strong emotions about privacy, but a recent invention, outside traditional morality. The `denial vs no comment issue. N.B. how willing people are to surrender privacy for a benefit, e.g. Google search. Users must be protected from snoopy hosts, and snoopy fellow users (also perhaps from snooping in the network). Most of the threats are from logs – can hosts do without them?
Confidentiality: If either party reveals whats being done there are real losses to one or both parties We want to prevent others knowing what we are doing, and/or even that were doing it. Users need protection from hosts and from other users.
How important it is to provide these properties, and what costs can be borne, depend on the grid environment: 1.Virtual private grids, within a single organisation; 2.Public grids, open to all who pay to use them and piggbacking on ordinary internet resources; 3.Applications that originate within private grids but that also use facilities from public grids.
Information flow issues within a virtual private grid: Not important in theory because of the confluence of the interests of the organization and the people. Still relatively easy in practice because of visibility and control. Security: potential consequences of violation to perpetrator. Privacy: implicitly surrendered by individuals. Confidentiality: preserved because information does not flow out of the VPG.
Information flow issues within a public grid: Security: difficult to achieve in either direction (implicit assumption that malicious behavior is readily detectable) – viruses, worms!! Privacy: protected only by the difficulty of actual discovering what one individual is doing. Confidentiality: non-existent. Scientific research is usually conceived of as appropriate for this space. This is a dubious assumption – almost all research has some IP implications. Working in a public space could conceivably be considered public disclosure for patent purposes.
Information flow issues that cross the boundary of private and public grids: We assume that the base of the computation is within a private grid, but that information (and so computation) in the public grid is also required. We want to prevent inference of the goal of the entire computation from: whats being computed in public, and/or where the public computation is going for its data, how much work it does at each site, etc. (traffic analysis)
Examples: A movie distribution company bases next weeks shows on confidential financial information – and public reviews. An oil company bases projections of future supply on an in-house model – and political and weather information. A biochemical research team uses new techniques on a public dataset, and publishes descriptive results – but not the names of interesting genes discovered. A struggling company looks at its cashflow projections – and federal law about Chapter 11 and layoffs.
Two helpful technologies that provide partial solutions: Crowds: A group of users agree to form a crowd. Each request to the outside world is sent at random to one of the crowd. With a certain probability, the recipient either sends it to the outside world, or passes it on to another member of the crowd. This simple scheme makes it easy to hide the true origin of a computation, even with partial collusion among members of the crowd. So crowds can combine different organizations without significant risk. Reiter and Rubin, Crowds: Anonymity for Web Transactions, ACM Transactions on Information and System Security, 1997.
Information grazing. Human-directed data mining is hard; it amounts to knowing what question to ask. Its better if the data-mining system can pose plausible questions and then filter the answers to find the interesting ones. But this means that speculative data mining is happening inside an organization – so that what is visible outside is uncoupled from whats going on inside the organization. cf FUD project at Queens.
Ways to protect whats being modelled: I want to compute f(x) In public, I compute g(x) I need an h such that * h g = f * f is hard to figure out knowing g * g(x) is small relative to x Since h has no access to x, g(x) has to somehow contain all of the useful information about x.
1. h selects part of g(x) – so the real intent is hidden inside the computation of g : Examples: a.Association rules – use a lower support than required and postfilter rules; b.Use more dimensions of a truncated matrix decomposition than required and throw the others away; c.Have g build a classifier with more targets than required and use unions of gs classes at deployment – there are an exponential number of possibilities;
2. In public, compute g1 and g2 in such a way that they are not connected (e.g. using crowds). Examples: a.If g1 and g2 access different rows of datasets, their models can be combined using techniques from parallel data mining; b.If g1 and g2 access different columns of datasets, their models can be combined using techniques from distributed data mining; (Random forests shows that this can be very powerful;) c.g1 and g2 can be deployed using bagging; d.Let g1 and g2 use targets such that their intersections are the real targets; e.Use unsupervised classifiers in public, then learn the mapping from the cluster labels to the desired target in private (adapt techniques from metalearning);
3. Use private function evaluation + private selection General techniques for private function evaluation are known, but they are based on a circuit evaluation trick. Knowledge of the circuit leaks, and translating some part of the functionality into an input doesnt help in our setting. Private selection can be built on top of this trick so that the owner of the dataset doesnt know which values are being selected (so traffic analysis is partially prevented). However, both of these techniques are (at present) computationally intractable. The Sander-Tschudin technique seems to be practical, but only works for functions that can be cast as polynomials (which seems to include some control flow).
Summary: 1.Increasingly its the data rather than the computation that is the performance limiting factor. We need to learn how to move functionality to data, rather than data to functionality. 2.Large-scale, geographically-distributed datasets bring their own special problems, and require grids that are quite different from computational grids. 3.Code that moves to data necessarily executes in environments that are out of the control of the codes owner. This makes problems that were previously almost ignored suddenly critical. 4.The information flow issues require hosts to do what they are told to do; in particular, they must not (a) do something else, or (b) learn what the code they are executing is doing. These are tough, but interesting, problems.