Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probability theory retro

Similar presentations


Presentation on theme: "Probability theory retro"— Presentation transcript:

1 Probability theory retro
14/03/2017

2 Probability (atomic) events (A) and probability space () Axioms:
- If A1, A2, … mutually exclusive events (Ai ∩Aj = , i  j) then P(k Ak) =  k P(Ak) 1

3 - P(A  B)=P(A)+P(B) – P(A∩B) - P(A) = P(A ∩ B)+P(A ∩¬B)
If A  B, then P(A) ≤ P(B) and P(B-A) = P(B) – P(A) 2

4 Conditional probability
Conditional probability is the probability of some event A, given the occurrence of some other event B. P(A|B) = P(A∩B)/P(B) Chain rule: P(A∩B) = P(A|B)·P(B) Example: A: headache, B: influenza P(A) = 1/10, P(B) = 1/40, P(A|B)=? 3

5 Conditional probability

6 Independence of events
A and B are independent iff P(A|B) = P(A) Corollary: P(AB) = P(A)P(B) P(B|A) = P(B) 5

7 Product rule A1, A2, …, An arbitrary events P(A1A2…An) = P(An|A1…An-1)
P(An-1|A1…An-2)…P(A2| A1)P(A1) If A1, A2, …, An events form a complete probability space and P(Ai) > 0 for each i, then P(B) = ∑j=1n P(B | Ai)P(Ai) 6

8 Bayes rule P(A|B) = P(A∩B)/P(B) = P(B|A)P(A)/P(B) 7

9 Random variable ξ:  → R Random variable vectors… 8

10 cumulative distribution function (CDF),
F(x) = P( < x) F(x1) ≤ F(x2), if x1 < x2 limx→-∞ F(x) = 0, limx→∞ F(x) = 1 F(x) is non-decreasing and right-continuous 9

11 Discrete vs continous random variables
its value set forms a finite of infinate series Continous: we assume that f(x) is valid on the (a, b) interval 10

12 Probability density functions (pdf)
F(b) - F(a) = P(a <  < b) = a∫b f(x)dx f(x) = F ’(x) és F(x) = .-∞∫x f(t)dt

13 Histogram Empirical estimation of a density 12

14 Independence of random variables
 and  are independent, iff any a ≤ b, c ≤ d P(a ≤  ≤ b, c ≤  ≤ d) = P(a ≤  ≤ b) P(c ≤  ≤ d). 13

15 Composition of random variables
Discrete case:  =  +  iff  and  are independent rn = P( = n) = k=- P( = n - k,  = k) 14

16 Expected value  can take values x1, x2, … with p1, p2, … probability then M() = i xipi continous case: M() = -∞∫ xf(x)dx 15

17 Properties of expected value
M(c) = cM() M( + ) = M() + M() If  and  are independent random variables, then M() = M()M() 16

18 Standard deviation D() = (M[( - M())2])1/2 D2() = M(2) – M2() 17

19 Properties of standard deviation
D2(a + b) = a2D2() if 1, 2, …, n are independent random variables then D2(1 + 2 + … + n) = D2(1) + D2(2) + … + D2(n) 18

20 Correlation Covariance: c = M[( - M())( - M())]
c is 0 if  and  are independent Correlation coefficient: r = c / ((D()D()), normalised covariance into [-1,1] 19

21 Well-known distributions
Normal/Gauss Binomial:  ~ B(n,p) M() = np D() = np(1-p) 20

22 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, with the permission of the authors and the publisher

23 Bayes classification

24 23 Classification Supervised learning: Based on training examples (E), learn a modell which works fine on previously unseen examples. Classification: a supervised learning task of categorisation of entities into predefined set of classes

25 Posterior, likelihood, evidence
P(j | x) = P(x | j) . P (j) / P(x) Posterior = (Likelihood. Prior) / Evidence Where in case of two categories Pattern Classification, Chapter 2 (Part 1)

26 Pattern Classification, Chapter 2 (Part 1)

27 Pattern Classification, Chapter 2 (Part 1)

28 Bayes Classifier Decision given the posterior probabilities
X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 This rule minimizes the probability of the error. Pattern Classification, Chapter 2 (Part 1)

29 Classifiers, Discriminant Functions and Decision Surfaces
The multi-category case Set of discriminant functions gi(x), i = 1,…, c The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Pattern Classification, Chapter 2 (Part 2)

30 For the minimum error rate, we take
gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!) Pattern Classification, Chapter 2 (Part 2)

31 Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri (Ri means assign x to i) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 2)

32 The computation of g(x)
Pattern Classification, Chapter 2 (Part 2)

33 Discriminant functions of the Bayes Classifier with Normal Density
Pattern Classification, Chapter 2 (Part 1)

34 The Normal Density Univariate density Where:
Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x 2 = expected squared deviation or variance Pattern Classification, Chapter 2 (Part 2)

35 Pattern Classification, Chapter 2 (Part 2)

36 Multivariate density where:
Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t (t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively Pattern Classification, Chapter 2 (Part 2)

37 Discriminant Functions for the Normal Density
We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) Case of multivariate normal Pattern Classification, Chapter 2 (Part 3)

38 Case i = 2.I (I stands for the identity matrix)
Pattern Classification, Chapter 2 (Part 3)

39 A classifier that uses linear discriminant functions is called “a linear machine”
The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) Pattern Classification, Chapter 2 (Part 3)

40 The hyperplane is always orthogonal to the line linking the means!
Pattern Classification, Chapter 2 (Part 3)

41 always orthogonal to the line linking the means!
The hyperplane separating Ri and Rj always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 3)

42 Pattern Classification, Chapter 2 (Part 3)

43 Pattern Classification, Chapter 2 (Part 3)

44 Case i =  (covariance of all classes are identical but arbitrary!)
Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) Pattern Classification, Chapter 2 (Part 3)

45 Pattern Classification, Chapter 2 (Part 3)

46 Pattern Classification, Chapter 2 (Part 3)

47 Case i = arbitrary The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids) Pattern Classification, Chapter 2 (Part 3)

48 Pattern Classification, Chapter 2 (Part 3)

49 Pattern Classification, Chapter 2 (Part 3)

50 Exercise Select the optimal decision where: = {1, 2}
P(x | 1) N(2, 0.5) (Normal distribution) P(x | 2) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3 Pattern Classification, Chapter 2

51 Parameter estimation Pattern Classification, Chapter 3

52 Data availability in a Bayesian framework
We could design an optimal classifier if we knew: P(i) (priors) P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

53 A priori information about the problem
E.g. assume normality of P(x | i) P(x | i) ~ N( i, i) Characterized by 2 parameters Estimation techniques Maximum-Likelihood (ML) and the Bayesian estimations Results are nearly identical, but the approaches are different 1

54 Parameters in ML estimation are fixed but unknown!
Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution In either approach, we use P(i | x) for our classification rule! 1

55 Use the information provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category Suppose that D contains n samples, x1, x2,…, xn ML estimate of  is, by definition the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” 2

56 Let  = (1, 2, …, p)t and let  be the gradient operator
We define l() as the log-likelihood function l() = ln P(D | ) New problem statement: determine  that maximizes the log-likelihood 2

57 Pattern Classification, Chapter 3
56 Example: univariate normal density,  and  are unknown azaz  = (1, 2) = (, 2) Pattern Classification, Chapter 3

58 Pattern Classification, Chapter 3
57 Sum over the sample: Solving (1) and (2): Pattern Classification, Chapter 3

59 Bayesian Estimation In MLE  was supposed fix
In BE  is a random variable The computation of posterior probabilities P(i | x) lies at the heart of Bayesian classification Goal: compute P(i | x, D) Given the sample D, Bayes formula can be written

60 Pattern Classification, Chapter 1

61 Bayesian Parameter Estimation: Gaussian Case
Goal: Estimate  using the a-posteriori density P( | D) The univariate case: P( | D)  is the only unknown parameter (0 and 0 are known!)

62 Reproducing density Identifying (1) and (2) yields:

63 The univariate case P(x | D)
P( | D) computed P(x | D) remains to be computed! It provides: (Desired class-conditional density P(x | Dj, j)) Therefore: P(x | Dj, j) together with P(j) and using Bayes formula, we obtain the Bayesian classification rule:

64 Bayesian Parameter Estimation: General Theory
P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: The form of P(x | ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P() The rest of our knowledge  is contained in a set D of n random variables x1, x2, …, xn that follows P(x) 5

65 MLE vs. Bayes estimation
64 MLE vs. Bayes estimation If n→∞ they are equal! MLE Simple and fast (convex optimisation vs. numerical integration) Bayes estimation We can express our uncertainty by P()


Download ppt "Probability theory retro"

Similar presentations


Ads by Google