# Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

## Presentation on theme: "Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University."— Presentation transcript:

Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu

Slide 2 Copyright © Andrew W. Moore Probability The world is a very uncertain place 30 years of Artificial Intelligence and Database research danced around this fact And then a few AI researchers decided to use some ideas from the eighteenth century. We will review the fundamentals of probability.

Slide 3 Copyright © Andrew W. Moore Discrete Random Variables. Caso Binario A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have Ebola

Slide 4 Copyright © Andrew W. Moore Probabilidades Denotamos P(A), la probabilidad de A, como “la fraccion de todas las ocurrencias en las cuales A es cierta”. Existen varias maneras de asignar probabilidades pero no hablaremos de eso ahora.

Slide 5 Copyright © Andrew W. Moore Diagrama de Venn para visualizar A Espacio muestral de todas las ocurrencias posibles. El area es 1 Ocurrencias en las cuales A es Falsa Ocurrencias en las cuales A es cierta P(A) = Area de la parte ovalada

Slide 6 Copyright © Andrew W. Moore The Axioms of Probability 0 <= P(A) <= 1 P(A es cierta en todas las ocurrencias) = 1 P(en ninguna ocurrencia A es cierta) = 0 P(A or B) = P(A) + P(B) si A y B son disjuntos. Estos axiomas fueron introducidos por la escuela rusa al principio de los 1900: Kolmogorov, Liapunov, Kinthchine, Chebychev

Slide 7 Copyright © Andrew W. Moore Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true

Slide 8 Copyright © Andrew W. Moore Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true

Slide 9 Copyright © Andrew W. Moore Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) A B

Slide 10 Copyright © Andrew W. Moore Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) A B P(A or B) B P(A and B) Simple addition and subtraction

Slide 11 Copyright © Andrew W. Moore Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) From these we can prove: P(not A) = P(~A) = 1-P(A) How?

Slide 12 Copyright © Andrew W. Moore Another important theorem 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) From these we can prove: P(A) = P(A ^ B) + P(A ^ ~B) How?

Slide 13 Copyright © Andrew W. Moore Multivalued Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1,v 2,.. v k } Thus…

Slide 14 Copyright © Andrew W. Moore An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that

Slide 15 Copyright © Andrew W. Moore An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove

Slide 16 Copyright © Andrew W. Moore Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that

Slide 17 Copyright © Andrew W. Moore Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove

Slide 18 Copyright © Andrew W. Moore Elementary Probability in Pictures P(~A) + P(A) = 1

Slide 19 Copyright © Andrew W. Moore Elementary Probability in Pictures P(B) = P(B ^ A) + P(B ^ ~A)

Slide 20 Copyright © Andrew W. Moore Elementary Probability in Pictures

Slide 21 Copyright © Andrew W. Moore Elementary Probability in Pictures

Slide 22 Copyright © Andrew W. Moore Conditional Probability P(A|B) = fraccion de ocurrencias en las cuales B es cierta y en las cuales A es tambien cierta. F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 “Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50- 50 chance you’ll have a headache.”

Slide 23 Copyright © Andrew W. Moore Conditional Probability F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(H|F) = fraccion de los que tiene flu que tienen headache = #ocurrencias con flu and headache ------------------------------------ #ocurrencias with flu = Area of “H and F” region ------------------------------ Area of “F” region = P(H ^ F) ----------- P(F)

Slide 24 Copyright © Andrew W. Moore Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B)

Slide 25 Copyright © Andrew W. Moore Probabilistic Inference F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good?

Slide 26 Copyright © Andrew W. Moore Probabilistic Inference F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(F ^ H) =P(F)P(H/F)=(1/40)(1/2)=1/80 P(F|H) =P(F^H)/P(H)=(1/80)/(1/10)=1/8=.125

Slide 27 Copyright © Andrew W. Moore Another way to understand the intuition Thanks to Jahanzeb Sherwani for contributing this explanation:

Slide 28 Copyright © Andrew W. Moore What we just did… P(A ^ B) P(A|B) P(B) P(B|A) = ----------- = --------------- P(A) P(A) This is Bayes Rule. P(B) es llamada la probabilidad apriori, P(A/B) es llamada la veosimiltud, y P(B/A) es la probabilidad posterior Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

Slide 29 Copyright © Andrew W. Moore More General Forms of Bayes Rule

Slide 30 Copyright © Andrew W. Moore More General Forms of Bayes Rule

Slide 31 Copyright © Andrew W. Moore Useful Easy-to-prove facts

Slide 32 Copyright © Andrew W. Moore The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

Slide 33 Copyright © Andrew W. Moore Usando la Regla de Bayes en Juegos El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras). \$1.00 El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero. Pregunta: Alguien extrae un sobre al azar y te lo ofrece en venta? Cuanto deberias pagar por el sobre?

Slide 34 Copyright © Andrew W. Moore Using Bayes Rule to Gamble El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras). \$1.00 El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero. Pregunta interesante: Antes de decidir, se le permite a Ud. Que vea una bolita extraida del sobre. Suponga que es negra cuanto deberia Ud pagar por el sobre? Suponga que es roja, cuanto deberia Ud. Pagar por el sobre?

Slide 36 Copyright © Andrew W. Moore Distribuciones conjuntas Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC 000 001 010 011 100 101 110 111 A=1 si la persona es varon y 0 si es mujer, B=1 si la persona esta casada y 0 si no esta y C =1 si la persona esta enferma y 0 si no lo esta

Slide 37 Copyright © Andrew W. Moore The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

Slide 38 Copyright © Andrew W. Moore The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 A B C 0.05 0.25 0.100.05 0.10 0.30

Slide 39 Copyright © Andrew W. Moore Datos para el ejemplo de la JD The Census dataset: 48842 instances, mix of continuous and discrete features (train=32561, test=16281)45222. if instances with unknown values are removed (train=30162,test=15060) Disponible en: http://ftp.ics.uci.edu/pub/machine- learning-databaseshttp://ftp.ics.uci.edu/pub/machine- learning-databases Donantes: Ronny Kohavi y Barry Becker (1996).

Slide 40 Copyright © Andrew W. Moore Variables en Census 1-age: continuous. 2-workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never- worked. 3-fnlwgt (final weight) : continuous. 4-education: Bachelors, Some-college, 11th, HS-grad, Prof- school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 5-education-num: continuous. 6-marital-status: Married-civ-spouse, Divorced, Never- married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 7-occupation: 8-relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

Slide 41 Copyright © Andrew W. Moore Variables en Census 9-race: White, Asian-Pac-Islander, Amer- Indian-Eskimo, Other, Black. 10-sex: Female[0], Male[1]. 11-capital-gain: continuous. 12-capital-loss: continuous. 13-hours-per-week: continuous. 14-native-country: nominal 15 Salary: >50K (rich)[2], <=50K (poor)[1].

Slide 42 Copyright © Andrew W. Moore Primeros datos de census(train) > traincensus[1:5,] age workcl fwgt educ educn marital occup relat race sex gain loss hpwk 1 39 6 77516 13 13 3 9 4 1 1 2174 0 40 2 50 2 83311 13 13 1 5 3 1 1 0 0 13 3 38 1 215646 9 9 2 7 4 1 1 0 0 40 4 53 1 234721 7 7 1 7 3 5 1 0 0 40 5 28 1 338409 13 13 1 6 1 5 0 0 0 40 country salary 1 1 1 2 1 1 3 1 1 4 1 1 5 13 1 >

Slide 43 Copyright © Andrew W. Moore Calculando frecuencias para sexo > countsex=table(traincensus\$sex) > probsex=countsex/32561 > countsex 0 1 10771 21790 > probsex 0 1 0.3307945 0.6692055 >

Slide 44 Copyright © Andrew W. Moore Calculando las probabilidades conjuntas > jdcensus=traincensus[,c(10,13,15)] > dim(jdcensus) [1] 32561 3 > jdcensus[1:3,] sex hpwk salary 1 1 40 1 2 1 13 1 3 1 40 1 > dim(jdcensus[jdcensus\$sex==0 &jdcensus\$hpwk<=40.5 &jdcensus\$salary==1,])[1] [1] 8220 > dim(jdcensus[jdcensus\$sex==0 &jdcensus\$hpwk<=40.5 &jdcensus\$salary==1,])[1]/dim(jdcensus)[1] [1] 0.2524492 > dim(jdcensus[jdcensus\$sex==0 &jdcensus\$hpwk<=40.5 &jdcensus\$salary==2,])[1]/dim(jdcensus)[1] [1] 0.02484567

Slide 45 Copyright © Andrew W. Moore > dim(jdcensus[jdcensus\$sex==0 &jdcensus\$hpwk>40.5 &jdcensus\$salary==1,])[1]/dim(jdcensus)[1] [1] 0.0421363 > dim(jdcensus[jdcensus\$sex==0 &jdcensus\$hpwk>40.5 &jdcensus\$salary==2,])[1]/dim(jdcensus)[1] [1] 0.01136329 > dim(jdcensus[jdcensus\$sex==1 &jdcensus\$hpwk<40.5 &jdcensus\$salary==1,])[1]/dim(jdcensus)[1] [1] 0.3309174 > dim(jdcensus[jdcensus\$sex==1 &jdcensus\$hpwk<=40.5 &jdcensus\$salary==2,])[1]/dim(jdcensus)[1] [1] 0.09754 > dim(jdcensus[jdcensus\$sex==1 &jdcensus\$hpwk>40.5 &jdcensus\$salary==1,])[1]/dim(jdcensus)[1] [1] 0.1336875 > dim(jdcensus[jdcensus\$sex==1 &jdcensus\$hpwk>40.5 &jdcensus\$salary==2,])[1]/dim(jdcensus)[1] [1] 0.1070606

Slide 46 Copyright © Andrew W. Moore Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute

Slide 47 Copyright © Andrew W. Moore Using the Joint P(Poor Male) = 0.4654

Slide 48 Copyright © Andrew W. Moore Using the Joint P(Poor) = 0.7604

Slide 49 Copyright © Andrew W. Moore Inference with the Joint

Slide 50 Copyright © Andrew W. Moore Inference with the Joint P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Slide 51 Copyright © Andrew W. Moore Inferencia es un asunto importante Tengo esta evidencia, cual es la probabilidad de que esta conclusion sea cierta? Tengo la nuca hinchada: Cual es la probabilidad de que tenga meningitis? Veo luces en mi casa y son las 9pm. Cual es la probabilidad de que mi esposa ya se haya quedado dormida? Hay bastante aplicaciones de Inferencia Bayesiana en Medicina, Farmacia, Diagnostico de Falla de Maquinas, etc.

Slide 52 Copyright © Andrew W. Moore Where do Joint Distributions come from? Idea One: Expert Humans Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(A) = 0.7 P(B|A) = 0.2 P(B|~A) = 0.1 P(C|A^B) = 0.1 P(C|A^~B) = 0.8 P(C|~A^B) = 0.3 P(C|~A^~B) = 0.1 Then you can automatically compute the JD using the chain rule P(A=x ^ B=y ^ C=z) = P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x) In another lecture: Bayes Nets, a systematic way to do this.

Slide 53 Copyright © Andrew W. Moore Where do Joint Distributions come from? Idea Three: Learn them from data! Prepare to see one of the most impressive learning algorithms you’ll come across in the entire course….

Slide 54 Copyright © Andrew W. Moore Learning a joint distribution Build a JD table for your attributes in which the probabilities are unspecified The fill in each row with ABCProb 000? 001? 010? 011? 100? 101? 110? 111? ABC 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 Fraction of all records in which A and B are True but C is False

Slide 55 Copyright © Andrew W. Moore Example of Learning a Joint This Joint was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

Slide 56 Copyright © Andrew W. Moore Where are we? We have recalled the fundamentals of probability We have become content with what JDs are and how to use them And we even know how to learn JDs from data.

Slide 57 Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Density Estimator Probability Input Attributes

Slide 58 Copyright © Andrew W. Moore Density Estimation Comparado con los otros dos modelos principales: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output Input Attributes

Slide 59 Copyright © Andrew W. Moore Evaluacion de la estimacion de densidad Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output Input Attributes Test set Accuracy ? Test-set criterion for estimating performance on future data* * See the Decision Tree or Cross Validation lecture for more detail

Slide 60 Copyright © Andrew W. Moore Given a record x, a density estimator M can tell you how likely the record is: Given a dataset with R records, a density estimator can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the Density Estimator’s JD) Evaluating a density estimator

Slide 61 Copyright © Andrew W. Moore The Auto-mpg dataset Donor: Quinlan,R. (1993) Number of Instances: 398 minus 6 missing=392 (training: 196 test: 196): Number of Attributes: 9 including the class attribute7. Attribute Information: 1. mpg: continuous (discretizado bad 25) 2. cylinders: multi-valued discrete 3. displacement: continuous (discretizado low 200) 4. horsepower: continuous ((discretizado low 90) 5. weight: continuous (discretizado low 3000) 6. acceleration: continuous (discretizado low 15) 7. model year: multi-valued discrete (discretizado 70-74,75-77,78-82 8. origin: multi-valued discrete 9. car name: string (unique for each instance) Note: horsepower has 6 missing values

Slide 62 Copyright © Andrew W. Moore The auto-mpg dataset 18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" ………………………………………………………………….. 27.0 4 140.0 86.00 2790. 15.6 82 1 "ford mustang gl" 44.0 4 97.00 52.00 2130. 24.6 82 2 "vw pickup" 32.0 4 135.0 84.00 2295. 11.6 82 1 "dodge rampage" 28.0 4 120.0 79.00 2625. 18.6 82 1 "ford ranger" 31.0 4 119.0 82.00 2720. 19.4 82 1 "chevy s-10"

Slide 63 Copyright © Andrew W. Moore A subset of auto-mpg levelmpg modelyear maker [1,] "bad" "70-74" "america" [2,] "bad" "70-74" "america" [3,] "bad" "70-74" "america" [4,] "bad" "70-74" "america" [5,] "bad" "70-74" "america" [6,] "bad" "70-74" "america“ ……………………………………………… …………………………………………….. [7,] "good" "78-82" "america" [8,] "good" "78-82" "europe" [9,] "good" "78-82" "america" [10,] "good" "78-82" "america" [11,] "good" "78-82" "america"

Slide 64 Copyright © Andrew W. Moore Mpg: Train test and test set indices_samples=sample(1:392,196) mpgtrain=smallmpg[indices_samples,] dim(mpgtrain) [1] 196 3 mpgtest=smallmpg[-indices_samples,] #computing the first joint probability a1=dim(mpgtrain[mpgtrain[,1]=="bad" & mpgtrain[,2]=="70-74" & mpgtrain[,3]=="america",])[1]/192

Slide 66 Copyright © Andrew W. Moore This probability has been computed considering that any joint probability is at least greater than 1/10 20 =10 -20

Slide 67 Copyright © Andrew W. Moore A small dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 Training Set Records

Slide 68 Copyright © Andrew W. Moore A small dataset: Miles Per Gallon 192 Training Set Records

Slide 69 Copyright © Andrew W. Moore A small dataset: Miles Per Gallon 192 Training Set Records Dataset puede ser mpgtrain o mpgtest

Slide 70 Copyright © Andrew W. Moore Log Probabilities Since probabilities of datasets get so small we usually use log probabilities

Slide 71 Copyright © Andrew W. Moore A small dataset: Miles Per Gallon 192 Training Set Records In our case= -510.79

Slide 72 Copyright © Andrew W. Moore Resumen: The Good News Hemos visto una manera de aprender un estimador de densidad a partir de datos. Estimadores de densidad pueden ser usados para Ordenar los records segun la probabilidad, y detectar asi casos anomalos (“outliers”). Para hacer inferencia: calcular P(E1|E2) Para calcular clasificadores Bayesianos (se vera mas tarde).

Slide 73 Copyright © Andrew W. Moore Summary: The Bad News Density estimation by directly learning the joint is trivial, mindless and dangerous

Slide 74 Copyright © Andrew W. Moore Using a test set An independent test set with 196 cars has a worse log likelihood (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!

Slide 75 Copyright © Andrew W. Moore Overfitting Density Estimators If this ever happens, it means there are certain combinations that we learn are impossible

Slide 76 Copyright © Andrew W. Moore Using a test set The only reason that our test set didn’t score -infinity is that my code is hard-wired to always predict a probability of at least one in 10 20 We need Density Estimators that are less prone to overfitting

Slide 77 Copyright © Andrew W. Moore Naïve Density Estimation El problema con el estimador de densidad conjunto es que simplemente refleja la muestra de entrenamiento. Necesitamos algo que generalize en forma mas util. El modelo naïve model generaliza mas eficientemente: Asume que cada atributo esta distribuido independientemente uno del otro.

Slide 78 Copyright © Andrew W. Moore A note about independence Assume A and B are Boolean Random Variables. Then “A and B are independent” if and only if P(A|B) = P(A) “A and B are independent” is often notated as

Slide 79 Copyright © Andrew W. Moore Independence Theorems Assume P(A|B) = P(A) Then P(A^B) = = P(A) P(B) Assume P(A|B) = P(A) Then P(B|A) = Por el resultado anterior P(B/A)=P(A)P(B)/P(A) = P(B) P(A/B)=P(A^B)/P(B) Por Independencia, P(A)=P(A^B)/P(B) P(A^B)/P(A)

Slide 80 Copyright © Andrew W. Moore Independence Theorems Assume P(A|B) = P(A) Then P(~A|B) = P(~A^B)/P(B)= [P(B)-P(A^B)]/P(B) 1-P(A^B)/P(B)= 1-P(A), por Indep. P(~A) = P(~A) Assume P(A|B) = P(A) Then P(A|~B) = P(A^~B)/P(~B) [P(A)-P(A^B)]/1-P(B) P(A)-P(A)P(B)/1-P(B) P(A)[1-P(B)]/1-P(B) = P(A)

Slide 81 Copyright © Andrew W. Moore Multivalued Independence For multivalued Random Variables A and B, if and only if from which you can then prove things like…

Slide 82 Copyright © Andrew W. Moore Independently Distributed Data Let x[i] denote the i’th field of record x. The independently distributed assumption says that for any i,v, u 1 u 2 … u i-1 u i+1 … u M Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]} This is often written as

Slide 83 Copyright © Andrew W. Moore Back to Naïve Density Estimation Let x[i] denote the i’th field of record x: Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]} Example: Suppose that each record is generated by randomly shaking a green dice and a red dice Dataset 1: A = red value, B = green value Dataset 2: A = red value, B = sum of values Dataset 3: A = sum of values, B = difference of values Which of these datasets violates the naïve assumption?

Slide 84 Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?

Slide 85 Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)? = P(A|~B^C^~D) P(~B^C^~D) = P(A) P(~B^C^~D) = P(A) P(~B|C^~D) P(C^~D) = P(A) P(~B) P(C^~D) = P(A) P(~B) P(C|~D) P(~D) = P(A) P(~B) P(C) P(~D)

Slide 86 Copyright © Andrew W. Moore Naïve Distribution General Case Suppose x[1], x[2], … x[M] are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. So we can do any inference But how do we learn a Naïve Density Estimator?

Slide 87 Copyright © Andrew W. Moore Learning a Naïve Density Estimator Another trivial learning algorithm!

Slide 88 Copyright © Andrew W. Moore Conparacion Joint DENaïve DE Can model anythingCan model only very boring distributions Puede modelar una copia ruidosa de la muestra original No lo puede hacer Given 100 records and more than 6 Boolean attributes will screw up badly Given 100 records and 10,000 multivalued attributes will be fine

Slide 89 Copyright © Andrew W. Moore A tiny part of the DE learned by “Joint” Empirical Results: “MPG” The “MPG” dataset consists of 392 records and 8 attributes The DE learned by “Naive”

Slide 90 Copyright © Andrew W. Moore A tiny part of the DE learned by “Joint” Empirical Results: “MPG” The “MPG” dataset consists of 392 records and 8 attributes The DE learned by “Naive”

Slide 91 Copyright © Andrew W. Moore The DE learned by “Joint” Empirical Results: “Weight vs. MPG” Suppose we train only from the “Weight” and “MPG” attributes The DE learned by “Naive”

Slide 92 Copyright © Andrew W. Moore The DE learned by “Joint” Empirical Results: “Weight vs. MPG” Suppose we train only from the “Weight” and “MPG” attributes The DE learned by “Naive”

Slide 93 Copyright © Andrew W. Moore The DE learned by “Joint” “Weight vs. MPG”: The best that Naïve can do The DE learned by “Naive”