Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures.

Books: “Pattern Recognition and Machine Learning”, C. M. Bishop, 2006. “Information theory, inference and Learning algorithms”, D. J. C. Mackay, 2003. “Probability theory: the logic of science”, E.T. Jaynes, 2003. “Pattern Classification”, Duda, Hart, Stork, 2001. “Neural Networks: A comprehensive foundation”, Simon Haykin, 1999. Glossary: http://research.microsoft.com/~minka/statlearn/glossary/ Some maths: http://learning.eng.cam.ac.uk/zoubin/course04/cribsheet.pdf http://matrixcookbook.com/

With teacher (labelled data) 1. Supervised learning Classification Regression 3 Types of Learning Without teacher (unlabelled data) 2. Unsupervised learning Clustering 3. Reinforcement learning (interactive) Markov Decision Processes Q-Learning

Some more terminology Pattern Recognition –Is defined as the process whereby a received pattern/signal is assigned to one of a prescribed number of classes (categories, clusters, groups). Agent (autonomous) –Exhibit emergent behaviours. Such behaviours are not programmed into the agents by the designer but rather are a result of interaction of the agents with the environment. Autonomous agents Biological agents Robotic agents Research agents Industrial agents Computational agents Simulated agents (simulate an animal or a robot) Artificial Life agents (digital creatures that nteract with each other) Software agents

Software agent –Computer programs that perform a certain task and interact with real-world software environments and humans by issuing commands interpreting the environment’s feedback. Examples –Filtering e-mails. –Routine messages as reminders for meetings or seminars. –Intrusion detection systems etc. Complete agent Toda’s Fungus Eater (example) The robot is operating on a distant planet. Its task is to collect uranium ore. It feeds on a certain type of fungus. –Autonomous Too far for remote control. –Self-sufficient It must take care its own energy supply (fungus). –Embodied It exists as a physical system. –Situated Its knowledge about the environment is acquired through its own sensory system.

(Supervised)

(Unsupervised) Unlabeled Data!

Probability Theory Marginal Probability Conditional Probability Joint Probability

Rules of Probability Sum Rule Product Rule

Bayes’ Theorem posterior  likelihood × prior

Bayesian statistics provide a framework for building intelligent learning systems. Bayes rule states that: P(M|D) = P(D|M) P(M) /P(D) “the probability of the model given the data P(M|D) is the probability of the data given the model P(D|M) times the prior probability of the model P(M) divided by the probability of the data P(D)”. Think of ML as learning models of data. Bayesian ML

The Bayesian Framework states that –Start out by enumerating all reasonable models of the data and assigning your prior belief P(M) to each of these models –Then upon observing the data D you evaluate how probable the data was under each of these models to compute P(D|M). –Multiplying this likelihood by the prior and normalizing results in the posterior P(M|D) which encapsulates everything that you have learned from the data regarding the possible models under consideration. –To compare two models we compute the relative probability given the data: P(M)P(D|M)/P(M’)P(D|M). In practice applying Bayes’ rule exactly is usually impractical because it involves summing or integrating over too large a space of models. Solution : Approximate Bayesian methods Laplace approximation Variational approximations Expectation Propagation Markov Chain Monte Carlo …

Regression Example Penalize large coefficient values Solution: Regularization Goal: Min the Sum-of-Squares Error Function

Maximum Likelihood Determine by minimizing sum-of-squares error,.

MAP Determine by minimizing regularized sum-of-squares error,. Bayesian Curve Fitting Over-fitting can be avoided, the effective number of parameters adapts automatically to the size of the data. Alpha here is a hyperparameter. Taking the negative logarithm of the posterior of the above function determines the weights of the maximum posterior probability or maximum a posteriori.

Bayesian Predictive Distribution If the noise variance beta is not known then we can use a gaussian-gamma distribution as a conjugate prior and then The predictive distribution will be a student-t distribution.

Information Theory We begin by considering a discrete random variable x and we ask how much information is received when we observe a specific value of this variable. The amount of information can be viewed as the degree of “surprise” on learning the value of x. If we were told that a highly improbable event has occurred, we will have received more information than if we were told that some very likely event has just occurred. If we knew that the event was certain to happen we would receive no information.

h p 0 1 H(X,Y) = H(Y|X) + H(X) H(x,y) = H(x)+H(y)

Assuming x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely States have different probabilities The uniform distribution has maximum entropy.

As well known as Noiseless Coding Theorem (Shannon 1948) : states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

Consider some unknown distr. p(x) and suppose we have modelled this using an approximating distr. q(x) What is the average additional amount of information required to specify the Value of x ? Relative entropy or Kullback-Leibler divergence equality iff, p(x) = q(x) Minimizing the KL is equivalent to maximizing the likelihood. KL is a measure of the dissimilarity of the two distributions p(x) and q(x).

Consider the joint distribution between two sets of variables x and y given by p(x,y) If x, y are independent then p(x,y) = p(x)p(y). We can gain some idea of whether they are ‘close’ to being independent by considering the KL divergence. Mutual information is >=0 with equality iff the x, y are independent and it is related to the conditional entropy through the following formula. Mutual information represents the reduction in uncertainty about x as a consequence of the new observation y. Mutual Information (MI) Major application of MI (fast and robust) is for registration of MRI images (Viola, MIT, 1995).

Probability density functions (pdf) and Cumulative distribution functions (cdf) with continuous random variables Probability mass functions (pmf) and their relative cdfs with discrete random variables

Used in Bernouli trial with random outcome {success, failure}.

(conjugate prior of the Bernoulli distribution) Generalization of the beta distribution and conjugate prior of the multinomial distr.

describes the number of successes in a series of independent Yes/No trials. As in binomial but each trial can have more than 2 outcomes. is a probability distribution that arises in the problem of estimating the mean of a normally distributed population when the sample size is small (t-test) – Much more robust to outliers than the gaussian.

Take away message : Learning is about Optimization or Integration again…

Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures.

Similar presentations

Presentation on theme: "Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures.

Similar presentations

Presentation on theme: "Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures."— Presentation transcript:

Similar presentations

About project

Feedback