Download presentation
Presentation is loading. Please wait.
1
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected ratings between 1 and 5. Not all users have rated all movies. These entries are indicated by a “?”. Sue has build a model that can predict unrated movies for any user u in the database. The model always predicts a real number, q, between 1 and 5. To further optimize her predictions she decides to train another model that maps her predictions q to new rating estimates q’. To train this model she uses all observed ratings in the database (i.e. ignoring “?”) and fits a neural network as follows: Here “i” runs over all observed ratings, and |.| denotes the absolute value. 1) Derive the gradients 2) Give pseudo-code for a stochastic gradient descent algorithm for learning “a” and “b”. 3) Given a fixed step-size, explain if this algorithm will converge after infinitely many gradient updates? 4) Jimmy has another algorithm that also predicts ratings. Sue and Jimmy decide to combine their models and compute a bagged estimate. Calling Sue prediction q_sue and Jimmy prediction q_jim give an expression for a combined prediction using bagging. 5) Explain whether bagging increases or decreases variance and why.
2
Bayesian Learning Instructor: Max Welling Read chapter 6 in book.
3
Probabilities Building models with probability distributions is important because: We can naturally include prior knowledge We can naturally encode uncertainty We can build models that are naturally protected against overfitting. We define multivariate probability distributions over discrete sample spaces by Probability densities are different beasts. They are defined over continuous sample spaces and we have Can P(x) > 1 for probability densities? How about discrete distributions?
4
Conditional Distributions A conditional distribution expresses the remaining uncertainty in x, after we know the value for y. Bayes rule: Useful for assessing diagnostic probability from causal probability: P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect) E.g., let M be meningitis, S be stiff neck: P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008 Note1: even though the probability of having a stiff neck given meningitis is very large (0.8), the posterior probability of meningitis given a stiff neck is still very small (why?). Note2: P(s|m) only depends on meningitis (a stable fact), but P(m|s) depends on whether e.g. the flu is around.
5
(Conditional) Independence There are two equivalent ways you can test for independence between two random variables. Conditional independence is a very powerful modeling assumption. It says: Note that this does not mean that P(x,y)=P(x)P(y). Only x and y are only independent given a third variable.
6
Example C.I. asthma lung cancer smog Asthma and lung cancer are not independent (more people with asthma also suffer from lung cancer). However, there is a third cause, that explains why: smog causes both asthma and lung-cancer. Given that we know the presence of smog, asthma and lung-cancer become independent. This type of independency can be graphically using a graphical model.
7
Bayesian Networks asthma lung cancer smog To every graphical model corresponds a probability distribution. More generally: To every graphical model corresponds a list of (conditional) independency relations that we can either read off from the graph, or prove using the corresponding expression. In this example we have: This implies marginal independence between Eq and Bf: earth quake bike falls alarm Prove this
8
Naive Bayes Classifier class label attribute 1 attribute 2 attribute 3 attribute 4 attribute 5 attribute 1
9
NB Classifier First we learn the conditional probabilities and To classify we use Bayes rule and maximize over y We can equivalently maximize:
10
Example: Text Data consists of documents from a certain class y. Xi is a count of the number times words “i” is present in the document (bag-of-words) We describe the probability that a word in a document in class y is equal to vocabulary word “i” to be: Also, the probability that a document is from class c is given by: The probability of a document is: So, classification for a new test document (with unknown c) boils down to: W is # words in test doc.
11
Learning NB One can maximize the log-probability of the data under the model: Taking derivatives and imposing the normalization constraints, one finds: So learning is really easy: It’s just counting!
12
Smoothing With a large vocabulary, there may not be enough documents in class c to have every word in the data. E.g. the word mouse was not encountered in documents on computers. This means that when we happen to encounter a test document on computers that mentions the word mouse, the probability of it belonging to the class computers is 0. This is precisely over-fitting (with more data this would not have happened). Solution: smoothing (Laplace correction): # of imaginary extra docs smooth a priory estimate of # of imaginary extra words in class c smooth a priory estimate of
13
Bayesian Networks If all variables are observed learning just boils down to counting Sometimes variables are never observed. This are called “hidden” or “latent” variables. Learning is now a lot harder, because plausible fill-in values for these variables need to be infreed. BNs are very powerful expert systems.
14
Full Bayesian Approaches The idea is to not fit anything (so you can’t over-fit). Instead we consider our parameters as random variables. If we place a prior distribution on the parameters, we can simply integrate them out (now they are gone!). Remember though that bad priors lead to bad models, it’s not a silver bullet. In the limit of large numbers of data-items, one can derive the MDL penalty: Computational overhead for full Bayesian approaches can be large.
15
Conclusions Bayesian learning is learning with probabilities and using Bayes rule. Full Bayesian “learning” marginalizes out parameters Naive Bayes models are “generative models” in that you imagine how data is generated and then invert it using bayes rule to classify new data. Separate model are trained for each class! All other classifiers seen so far are discriminative. Decision surface were trained on all classes jointly. With many data-items discriminative is expected to better, but for small datasets generative (NB) is better. When we don’t know the class label y in NB, we have a hidden variable. Clustering is like fitting a NB model with hidden class label. MoG uses Gaussian conditional distributions. If we use discrete distributions (q) we are fitting a “mixture of multinomials”. This is a good model to cluster text documents.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.