IGERT External Advisory Board Meeting Wednesday, March 14, 2007 INSTITUTE FOR COGNITIVE SCIENCES University of Pennsylvania.

Slides:



Advertisements
Similar presentations
CSE 5522: Survey of Artificial Intelligence II: Advanced Techniques Instructor: Alan Ritter TA: Fan Yang.
Advertisements

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Chapter 6 Information Theory
Lecture 5: Learning models using EM
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
Coding, Information Theory (and Advanced Modulation) Prof. Jay Weitzen Ball 411
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CSE 590ST Statistical Methods in Computer Science Instructor: Pedro Domingos.
CIS 410/510 Probabilistic Methods for Artificial Intelligence Instructor: Daniel Lowd.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Overview of the MS Program Jan Prins. The Computer Science MS Objective – prepare students for advanced technical careers in computing or a related field.
CSE 515 Statistical Methods in Computer Science Instructor: Pedro Domingos.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 Advanced Smoothing, Evaluation of Language Models.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Machine Learning Queens College Lecture 3: Probability and Statistics.
Proposal for Background Requirements Changes For the current MS/PhD programs, background requirements are expressed in the "Background Preparation Worksheet"
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Some Probability Theory and Computational models A short overview.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ICIP 2004, Singapore, October A Comparison of Continuous vs. Discrete Image Models for Probabilistic Image and Video Retrieval Arjen P. de Vries.
Introduction Chapter 1 Foundations of statistical natural language processing.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
A Brief History of AI Fall 2013 COMP3710 Artificial Intelligence Computing Science Thompson Rivers University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Language Model for Machine Translation Jang, HaYoung.
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Done Done Course Overview What is AI? What are the Major Challenges?
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Data Mining Lecture 11.
Hidden Markov Models Part 2: Algorithms
CSE 515 Statistical Methods in Computer Science
LING/C SC 581: Advanced Computational Linguistics
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
LECTURE 15: REESTIMATION, EM AND MIXTURES
CPSC 503 Computational Linguistics
Learning linguistic structure with simple recurrent neural networks
CS249: Neural Language Model
Presentation transcript:

IGERT External Advisory Board Meeting Wednesday, March 14, 2007 INSTITUTE FOR COGNITIVE SCIENCES University of Pennsylvania

COGS 501 & COGS 502 A two-semester sequence that aims to provide basic mathematical and algorithmic tools for the study of animal, human or machine communication. COGS 501: 1. Mathematical and programming basics: linear algebra and Matlab 2. Probability 3. Model fitting 4. Signal processing COGS 502 topics: 1. Information theory 2. Formal language theory 3. Logic 4. Machine learning

Challenges  Diverse student background: from almost nothing to MS-level skills in Mathematics Programming  Breadth of topics and applications: normally many courses with many prerequisites  Lack of suitable instructional materials

Precedent: LING 525 / CIS 558 Computer Analysis and Modeling of Biological Signals and Systems. A hands-on signal and image processing course for non-EE graduate students needing these skills. We will go through all the fundamentals of signal and image processing using computer exercises developed in MATLAB. Examples will be drawn from speech analysis and synthesis, computer vision, and biological modeling.

History CIS 558 / LING 525CIS 558 / LING 525: “Digital signal processing for non-EEs” - started in 1996 by Simoncelli and Liberman - similar problems of student diversity, breadth of topics, lack of suitable materials Solutions: - Matlab-based lab course: concepts, digital methods, applications - Several tiers for each topic: basic, intermediate, advanced - Extensive custom-built lecture notes and problem sets - Lots of individual hand-holding Results: - Successful uptake for wide range of student backgrounds (e.g. from “no math since high school” to “MS in math”; from “never programmed” to “five years in industry”) - Successor now a required course in NYU neuroscience program: “Mathematical tools for neural science”Mathematical tools for neural science

Mathematical Foundations Course goal: IGERT students should understand and be able to apply Models of language and communication Experimental design and analysis Corpus-based methods in research areas including Sentence processing Animal communication Language learning Communicative interaction Cognitive neuroscience

COGS rev. 0.1  Problem is somewhat more difficult: Students are even more diverse Concepts and applications are even broader  Advance preparation was inadequate Lack of pre-prepared lecture notes and problems (except for those derived from other courses) Not enough coordination by faculty  Not enough explicit connections to research

COGS 501-2: how to do better  Will start sequence again in Fall 2007  Plans for rev. 0.9: Advance preparation of course-specific lecture notes and problem sets Systematic remediation where needed  for mathematical background  for entry-level Matlab programming Connection to research themes (e.g. sequence modeling, birdsong analysis, artificial language learning)  Historical papers  Contemporary research

Research theme: example “Colorless green ideas sleep furiously” Shannon 1948 Chomsky 1957 Pereira 2000 (?)

Word sequences: Shannon C. Shannon, “A mathematical theory of communication”, BSTJ, 1948: … a sufficiently complex stochastic process will give a satisfactory representation of a discrete source. [The entropy] H … can be determined by limiting operations directly from the statistics of the message sequences… [Specifically:] Theorem 6: Let p(B i, S j ) be the probability of sequence B i followed by symbol S j and p Bi (S j ) … be the conditional probability of S j after B i. Let where the sum is over all blocks B i of N-1 symbols and over all symbols S j. Then F N is a monotonic decreasing function of N, … and Lim N→∞ F N = H.

Word sequences: Chomsky N. Chomsky, Syntactic Structures, 1957: (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless.... It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally ‘remote’ from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

Word sequences: Pereira F. Pereira, “Formal grammar and information theory: together again?”, 2000: [Chomsky’s argument] relies on the unstated assumption that any probabilistic model necessarily assigns zero probability to unseen events. Indeed, this would be the case if the model probability estimates were just the relative frequencies of observed events (the maximum-likelihood estimator). But we now understand that this naive method badly overfits the training data. […] To avoid this, one usually smoothes the data […] In fact, one of the earliest such methods, due to Turing and Good (Good, 1953), had been published before Chomsky's attack on empiricism…

Word sequences: Pereira Hidden variables … can also be used to create factored models of joint distributions that have far fewer parameters to estimate, and are thus easier to learn, than models of the full joint distribution. As a very simple but useful example, we may approximate the conditional probability p(x, y) of occurrence of two words x and y in a given configuration as where c is a hidden “class” variable for the associations between x and y … When (x,y) = (v i, v i+1 ) we have an aggregate bigram model … which is useful for modeling word sequences that include unseen bigrams. With such a model, we can approximate the probability of a string p(w 1 … w n ) by

Word sequences: Pereira Using this estimate for the probability of a string and an aggregate model with C=16 trained on newspaper text using the expectation-maximization method, we find that

Word sequences: concepts & problems  Concepts: Information entropy (and conditional entropy, cross-entropy, mutual information) Markov models, N-gram models Chomsky hierarchy (first glimpse)  Problems: Entropy estimation algorithms (n-gram, LZW, BW, etc.) LNRE smoothing methods EM estimation of hidden variables (learned earlier for gaussian mixtures)