# Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhavý Honeywell International, Inc. Automation and Control Solutions Advanced Technology.

## Presentation on theme: "Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhavý Honeywell International, Inc. Automation and Control Solutions Advanced Technology."— Presentation transcript:

Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhavý Honeywell International, Inc. Automation and Control Solutions Advanced Technology

Can We Generate More Value from Data? Today, a typical data mining project is ad hoc, lengthy, costly, knowledge-intensive, and requiring on-going maintenance. the resulting profit is often marginal Although the project benefits can be quite significant, the resulting profit is often marginal. robust methods and reusable workflowseasy to use and adapt requiring no special knowledge The industry is in search of robust methods and reusable workflows, easy to use and adapt to system and organizational changes, and requiring no special knowledge from the end user. This is a tough target … What can we offer to it today?

Learning from Data Probabilistic Approach

Learning from Data Data: Data: d i (k), i=1,…,n, k=1,…,N Independent variables States (disturbance vars) Actions (manipulated vars) Dependent variables Responses (controlled vars) Rewards (objective functions) Goal: Goal: Learn from the data how the responses and rewards depend on actions and states. d 1 (1)…d n (1) ……… d 1 (N)…d n (N) n variables N observations Data Matrix

From Data to Probability A B C 23 9 12 Datahypercube Cell i =1,…,L Dimensions Count N i ABC 23912 Fields RecordsRelationaldatabasetable k = 1 N i = 1 …L Empiricalprobability

Smoothened probability Data cube Database Query Empirical probability Probability operations Possible Monte Carlo approximation Answer Probabilistic Data Mining

What Makes Up Problem Dimensionality? Take a discrete perspective: Number of data (N) N=10 5 N=10 5 five-minute samples per year Number of cells (L) L=d n L=d n cells, assuming n dimensions, each divided into d cells Number of models (M) M=d m M=d m models, assuming m parameters of model, each divided into d cells Can be cut down if strong prior info is available.

Macroscopic Prediction E. T. Jaynes, Macroscopic Prediction, 1985: If any macrophenomenon is found to be reproducible, then it follows that all microscopic details that were not reproduced must be irrelevant for understanding and predicting it. Gibbs variational principle is … "predict that final state that can be realized by Nature in the greatest number of ways, while agreeing with your macroscopic information."

Boltzmanns Solution (1877) 6-dimensional phase space equal cells To determine how N gas molecules distribute themselves in a conservative force field such as gravitation, Boltzmann divided the accessible 6-dimensional phase space of a single molecule into equal cells, with N i molecules in the i-th cell. The cells were considered so small that the energy E i of a molecule did not vary appreciably within it, but at the same time so large that it could accommodate a large number N i of molecules.

Boltzmanns Solution (cont.) Noting that the number of ways this distribution can be realized is the multinomial coefficient he concluded that the "most probable" distribution is the one that maximizes W subject to the known constraints of his prior knowledge; in this case the total number of particles and total energy

If the numbers N i are large, the factorials can be replaced with Stirling approximation The solution maximizing log W can be found by Lagrange multipliers where C is a normalizing factor and the Lagrange multiplier is to be chosen so that the energy constraint is satisfied. Boltzmanns Solution (cont.) Shannonentropy Exponentialdistribution

Why Does It Work? E. T. Jaynes, Where Do We Stand on Maximum Entropy?, 1979: Information about the dynamics entered Boltzmanns equations at two places: (1) the conservation of total energy; and (2) the fact that he defined his cells in terms of phase volume … The fact that this was enough to predict the correct spatial and velocity distribution of the molecules shows that the millions of intricate dynamical details that were not taken into account, were actually irrelevant to the predictions …

Why Does It Work? (cont.) E. T. Jaynes, Where Do We Stand on Maximum Entropy?, 1979: Boltzmanns reasoning was super-efficient … Whether by luck or inspiration, he put into his equations only the dynamical information that happened to be relevant to the questions he was asking. Obviously, it would be of some importance to discover the secret of how this come about, and to understand it so well that we can exploit it in other problems …

General Maximum Entropy Empirical probability mass function r(N) Equivalence Equivalence of probability mass functions for a given (vector) function h (h 1,…,h L ). Equivalence class containing r(N):

General Maximum Entropy (cont.) Relative entropy (aka Kullback-Leibler distance) Minimum relative entropy w.r.t. reference s(0) Minimum relative entropy solution where C is a normalizing factor and is chosen so that Maximumentropy

Probability Approximation Approximate the empirical probability vector r(N) with a member s( ) of a more tractable family parameterized by vector : Taking a geometric perspective, this can be regarded as a projection of the point r(N) onto a surface of a lower dimension.

Maximum Likelihood Exponential family S(m) with a fixed "origin" s(0), canonical affine parameter, directional sufficient statistic h (h 1,…,h L ), and normalizing factor C Minimize relative entropy By definition of, the task is equivalent to Maximumlikelihood

Maximum Likelihood (cont.) Minimum relative entropy solution where C is a normalizing factor and is chosen so that Sufficientstatistic

Dual Projections Maximum Likelihood Maximum Entropy

Pythagorean Geometry Exponential family Equivalence class Dual parametrizations of exponential family

Dual Geometry Maximum Entropy The empirical probability known with precision up to an equivalence class. The solution found within an exponential family through a reference point. Maximum Likelihood The approximating probability sought within an exponential family. The approximation found by projecting the empirical probability. Equivalence classes Exponentialfamily

Bayesian Estimation Posterior probability vector for models i=1,…,M :

What If the Model Is Too Complex? For some real-life problems, the level of detail that needs to be collected on the empirical probability (and, correspondingly, the dimension of the exponential family) is too high, possibly infinite. In such case, we can either sacrifice the closed-form solution, or take a narrower view of the data, modeling only the part of system behavior relevant to the problem in question modeling only the part of system behavior relevant to the problem in question, while using a simpler, lower-dimensional model while using a simpler, lower-dimensional model.

Relevance-Based Weighting of Data A general idea of relevance weighting is to modify the empirical probability through where the weight vector reflects the relevance of particular cells to a case. A popular choice of the weights w i for a given query vector x(0) is using a kernel function:

Local Empirical Distributions Response variable Predictor variable Projections of relevance- weighted empirical distributions onto an exponential family Query-independent model family Query-specific empirical distributions Query-independent model family Query-specific empirical distributions Projections of relevance- weighted empirical distributions onto an exponential family

Local Modeling Outdoor temperature Time of day Heat demand Forecastedvariable Explanatoryvariables Query point ( What if ? ) Relational Database Multidimensional Data Cube

Data-Centric Technology Continuous target variable (product demand, product property, perform. measure) State and/or Action Neighborhood Query point Categorical target variable (discrete event, system fault, process trip) State and/or Action Neighborhood Query point Action (decision) State (operating conditions) Reward (operating profit, production cost, target matching) & new Current Past Novelty Detection Optimization RegressionClassification Tested variable (corrupt values, unusual responses, new behavior) State and/or Action Neighborhood Tested point

Increasingly Popular Approach Statistical Learning Locally-Weighted / Nonparametric Regression Cleveland (Bell Labs) Vapnik (AT&T Labs) Artificial Intelligence Lazy / Memory-Based Learning Moore (Carnegie Mellon University) Bontempi (University of Brussels) System Identification Just-in-Time / On-Demand Modeling Cybenko (Dartmouth College) Ljung & Stenman (Linköping University)

How Do Humans Solve Problems? Expert Take everything into account! Sales Rep Focus on recent experience!Engineer Use relevant information!

Corresponding Technologies Adaptive Regression Adaptive Regression Local Regression Local Regression Neural Network Neural Network

Pros and Cons Simple adaptation Fast computation Data compression No actual learning Local description Global description Fast lookup Data compression Slow learning Interference problem Lack of adaptation Difficult to interpret Minimum bias Inherent adaptation Easy to interpret No compact model No data compression Slower lookup Adaptive Regression Adaptive Regression Local Regression Local Regression Neural Network Neural Network

Addressing Dimensionality No Locality in High Dimension?

Limits of Local Modeling As the cube dimension n increases, it becomes increasingly difficult to do relevance weighting, similarity search, neighborhood sizing … The volume of a unit hypersphere becomes a fraction of the volume of a unit hypercube. The length of the diagonal ( ) of a unit hypercube goes to infinity. The hypercube increasingly resembles a spherical hedgehog (with 2 n spikes). When uniformly distributed, most data appear near the cube edges.

No Local Data in High Dimensions 1 2 3 10 100 Cube edge ratio Retrieved data ratio Dimension of surface on which the data live However, in most real- life problems, the data is anything but uniform- ly distributed. number of degrees of freedom Thanks to technology design, integrated control & optimization, and human supervision, the actual number of degrees of freedom is often quite limited.

Local Modeling Revisited dependence structure 1. Exploit data dependence structure. Divide and conquer approach. Compare p(x 1 ) p(x 2 ) against p(x 1,x 2 ). Make use of Markovian property. low-dimensional manifolds 2. Discover low-dimensional manifolds on which the data live. Feature selection. Cross-validation. Query point neighborhood defined over an embedded manifold.

Local Modeling Revisited multiple modes 3. Make use of multiple modes in data. Tree of production or operating modes. Definition of similar modes over the tree. patterns in population of the cube cells 4. Analyze patterns in population of the cube cells with the data, incl. the occupancy numbers. Estimate the probabilities of symbols generated by an information source, given an observed sequence of symbols. Symbols are defined by cube cell labels, in a proper encoding.

Cube Encoding 90919293949596979899 80818283848586878889 70717273747576777879 60616263646566676869 50515253545556575859 40414243444546474849 30313233343536373839 20212223242526272829 10111213141516171819 0123456789 For every i, i, there exists n such that

Cube Encoding 90919293949596979899 80818283848586878889 70717273747576777879 60616263646566676869 50515253545556575859 40414243444546474849 30313233343536373839 20212223242526272829 10111213141516171819 0123456789 For every i, i, there exists n such that

Cube Encoding 90919293949596979899 80818283848586878889 70717273747576777879 60616263646566676869 50515253545556575859 40414243444546474849 30313233343536373839 20212223242526272829 10111213141516171819 0123456789 For every i, i, there exists n such that

Cube Encoding 90919293949596979899 80818283848586878889 70717273747576777879 60616263646566676869 50515253545556575859 40414243444546474849 30313233343536373839 20212223242526272829 10111213141516171819 0123456789 For every i, i, there exists n such that

General Linear Case There exist m numbers D 1, D 2, …, D m such that for every two populated cells i, i, the absolute difference of the cell labels can be expressed as a weighted sum of the numbers D 1, D 2, …, D m, while the corresponding weights n 1, n 2, …, n m are natural numbers The number m defines the dimension of a hyperplane cutting the cube, on which the data live.

Symbolic Forecasting 90919293949596979899 80818283848586878889 70717273747576777879 60616263646566676869 50515253545556575859 40414243444546474849 30313233343536373839 20212223242526272829 10111213141516171819 0123456789 For every i, i, there exists n such that Condition acts as a sequence template:

Symbolic Forecasting More questions than answers at the moment: What are proper model functions capturing population patterns and occupancy numbers? What is a proper way of approaching the problem? Coding theory? Algebraic geometry? Harmonic analysis? Quantization error.. Discrete to continuous transition …

Decision-Making Process Lessons Learnt

Hypothesis Formulation … Two of worlds leading economists present quite distinct views of globalization in their new books: Joseph Stiglitz Globalization and Its Discontents Jagdish Bhagwati In Defense of Globalization

Feature Selection … The Wall Street Journal Europe, Dec 2, 2002 The Globalization Stirs Debate at U.S. Universities: Mr. Stiglitz In Latin America, Mr. Stiglitz says, growth in the 1990s was slower, at 2.9% a year, than it was during the days of trade protectionism in the 1960s, when the regions annual growth rate was about 5.4%. Mr. Bhagwati Mr. Bhagwati argues … that womens wages in many developing countries has increased as multinational investment has risen.

Training Data Selection … The Wall Street Journal Europe, Dec 2, 2002 The Globalization Stirs Debate at U.S. Universities: Mr. Stiglitz increased 1990s Mr. Stiglitz cites a World Bank study showing that the number of people living on less than \$2 a day increased by nearly 100 million during the booming 1990s. Mr. Bhagwati between 1976 and 1998 Mr. Bhagwati argues that the number of people living on less than \$2 a day declined by nearly 500 million between 1976 and 1998.

Decision Support Rather Than Automation DecisionMakerDecisionMaker DecisionSupportSystemDecisionSupportSystem Hypothesis Data Goodness of Fit Plausible Explanations Since there are more ways of phrasing a complex question, multiple answers are more likely than a single, simple one. Is globalization a good or bad thing? Should a company make an acquisition? Should a vendor introduce a new product? Should a production plant respond to a market opportunity? What demand for natural gas will be in a country in 5 years from now? Consistent Feedback

Humans To Stay in Control decision support At the moment, computerized data analysis is more likely to be delivered as decision support rather than closed-loop control. effective interaction Success depends to a large extent on effective interaction between humans and computers. formulation of hypothesesinterpretation of results For the foreseeable future, formulation of hypotheses and interpretation of results is likely to stay with the people. typical usage scenario Commercial decision support software should support a typical usage scenario.