CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012.

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012

Outline Presenters on April 12 (one week from today) –Determined in python by: import random random.uniform(1,9) performed 4 times 1.Brackets (Longyear) 2.MARC (Thompson) 3.Senioritis (Luis) 4.Eldridge team Perceptrons: single layer neural nets BriefBrief overview of support vector machines Continue with NLP concepts/examples

Perceptron: a single layer neural net

The Perceptron The Perceptron is a kind of a single-layer artificial network with only one neuron The Percepton is a network in which the neuron unit calculates the linear combination of its real-valued or boolean inputs and passes it through a threshold activation function: o = Threshold( Σ i=0 N wi xi ) where xi are the components of the input xe= ( xe1, xe2,..., xeN ) from the set {( xe, ye )}e=1 D Threshold is the activation function defined as follows: Threshold ( s ) = 1 if s > 0 and –1 otherwise

The Perceptron is sometimes referred to a threshold logic unit (TLU) since it discriminates the data depending on whether the sum is greater than the threshold value: Σ i=0 N w i x i > 0 or the sum is less than the threshold value Σ i=0 N w i x i <= 0 In the above formulation we implement the threshold using a pseudo-input x0 which is always equal to 1, and if the threshold value is T then w 0 (called the bias weight) is set to –T. The Perceptron is strictly equivalent to a linear discriminant, and it is often used as a device that decides whether an input pattern belongs to one of two classes.

Linear discriminant

Networks of such threshold units can represent a rich variety of functions while single units alone can not. For example, every boolean function can be presented by some network of interconnected units. The Perceptron (a single layer network) can represent most of the primitive boolean functions: AND, OR, NAND and NOR but can not represent XOR.

Perceptron Learning Algorithms As with other linear discriminants the main learning task is to learn the weights. Unlike the normal linear discriminant, no means or covariances are calculated. Instead of examining sample statistics, sample cases are presented sequentially, and errors are corrected by adjusting the weights. If the output is correct, the weights are not adjusted. There are several learning algorithms for training single-layer Perceptrons based on: - the Perceptron rule - the Gradient descent rule - the Delta rule All these rules define error-correction learning procedures, that is, the algorithm: cycles through the training examples over and over adjusting the weights when the perceptron’s answer is wrong until the perceptron gets the right answer for all inputs

Initialization: Examples {( x e, y e )} e=1 D, initial weights w i set to small random values, Learning rate parameter  = 0.1 (or other learning rate) Repeat for each training example ( x e, y e ) calculate the output: o = Threshold (  i = 0 N w i x ie ) if the Perceptron does not respond correctly update the weights: w i = w i +  ( y e - o e ) x ie // this is the Perceptron Rule until termination condition is satisfied. where: y e is the desired output, o e is the output generated by the Perceptron, w i is the weight associated with the i-th input x ie Note that if x ie is 0 then w i does not get adjusted.

The role of the learning parameter is to moderate the degree to which weights are changed at each step of the learning algorithm. During learning the decision boundary defined by the Perceptron moves, and some points that have been previously misclassified will become correctly classified.

The Perceptron convergence theorem states that for any data set which is linearly separable the Perceptron learning rule is guaranteed to find a solution in a finite number of steps. In other words, the Perceptron learning rule is guaranteed to converge to a weight vector that correctly classifies the examples provided the training examples are linearly separable. A function is said to be linearly separable when its outputs can be discriminated by a function which is a linear combination of features, that is we can discriminate its outputs by a line or a hyperplane.

Example: Suppose an example of perceptron which accepts two inputs x 1 = 2 and x 2 = 1, with weights w 1 = 0.5 and w 2 = 0.3 and w 0 = -1 (meaning that the threshold is 1) The output of the perceptron is : O = 2 * 0.5 + 1 * 0.3 - 1 = 0.3 which is > 0 Therefore the output is 1. If the correct output however is -1, the weights will be adjusted according to the Perceptron rule as follows: w 1 = 0.5 + 0.1 * ( -1 - 1 ) * 2 = 0.1 w 2 = 0.3 + 0.1 * ( -1 - 1 ) * 1 = 0.1 w 0 = - 1 + 0.1 * ( -1 - 1 ) * 1 = -1.2 The new weights would classify this input as follows: O = 2 * 0.1 + 1 * 0.1 – 1.2 = - 0.9 Therefore we have done “error correction”

Why XOR is not linearly separable

New topic: Hill Climbing: optimization search given an (unknown) numerical function F: R n  R find the x that gives best F by sampling the space Choose a point in n-dimensional space to search as your current ‘guess” x = x1.. xn Take a small step in k directions Choose the direction that results in maximum improvement in F(x) Make that the new guess Continue until no more improvement is possible (plateau) or result satisfies a criterion Variant: simulated annealing: random jump at intervals to find better region

A Brief Overview of Support Vector Machines

LOOCV = “leave one out cross validation”

Suppose we have a model with one or more unknown parameters, and a data set to which the model can be fit (the training data set). The fitting process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross- validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available.modelparametersoptimizesindependentpopulationoverfitting Cross Validation

In K-fold cross-validation, the original sample is randomly partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used

As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as a K-fold cross- validation with K being equal to the number of observations in the original sample. Leave-one-out cross-validation is usually very expensive from a computational point of view because of the large number of times the training process is repeated.

Natural Language Processing (cont.) Levels –Phonological –Lexical –Syntactic –Semantic –Pragmatic What is each level concerned with? What are some of the difficulties?

Part of Speech Tagging: A success story for corpus linguistics Tagging The process of associating labels with each token in a text Tags The labels Tag Set The collection of tags used for a particular task Thanks to: Marti Hearst, Diane Litman, Steve Bird

Example from the GENIA corpus Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN./.

What does Tagging do? 1.Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP 1.Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB 1.Helps in classification and prediction

Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors: Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words Can help select nouns from a document for summarization Basis for partial parsing (chunked parsing) Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between Getting better information about context Make it possible for classifiers to do their job

Some of the best-known Tagsets Brown corpus: 87 tags (more when tags are combined) Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags

The Brown Corpus An early digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: – Science fiction, romance fiction, press reportage scientific writing, popular lore

Penn Treebank First large syntactically annotated corpus 1 million words from Wall Street Journal Part-of-speech tags and syntax trees

How hard is POS tagging? Number of tags1234567 Number of word types 353403760264611221 In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous

Tagging methods Hand-coded Statistical taggers (supervised learning) Brill (transformation-based) tagger

Supervised Learning Approcah Algorithms that “learn” from data see a set of examples and try to generalize from them. Training set: –Examples trained on Test set: –Also called held-out data and unseen data –Use this for evaluating your algorithm –Must be separate from the training set Otherwise, you cheated! “Gold” standard –A test set that a community has agreed on and uses as a common benchmark.

Cross-Validation of Learning Algorithms Cross-validation set –Part of the training set. Used for tuning parameters of the algorithm without “polluting” (tuning to) the test data. –You can train on x%, and then cross-validate on the remaining 1-x% E.g., train on 90% of the training data, cross-validate (test) on the remaining 10% Repeat several times with different splits –This allows you to choose the best settings to then use on the real test set. You should only evaluate on the test set at the very end, after you’ve gotten your algorithm as good as possible on the cross-validation set.

Strong Baselines When designing NLP algorithms, you need to evaluate them by comparing to others. Baseline Algorithm: –An algorithm that is relatively simple but can be expected to do well –Should get the best score possible by doing the somewhat obvious thing. A Tagging Baseline: for each word, assign its most frequent tag in the training set

N-Grams The N stands for how many terms are used –Unigram: 1 term (0 th order) Ex: P(Xi) –Bigram: 2 terms (1 st order) Ex: P(Xi | Xi-1) –Trigrams: 3 terms (2 nd order) Ex: P(Xi | Xi-1, Xi-2) Usually don’t go beyond this You can use different kinds of terms, e.g.: –Character based n-grams –Word-based n-grams –POS-based n-grams We use n-grams to help determine the context in which some linguistic phenomenon happens. –E.g., look at the words before and after the period to see if it is the end of a sentence or not.

Unigram Tagger Train on a set of sentences Keep track of how many times each word is seen with each tag. After training, associate with each word its most likely tag. –Problem: many words never seen in the training data. –Solution: have a default tag to “backoff” to.

What’s wrong with unigram? Most frequent tag isn’t always right! Need to take the context into account –Which sense of “to” is being used? –Which sense of “like” is being used?

N-gram (Bigram) Tagging –For tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens What is the most likely tag for word n, given word n-1 and tag n-1? –The tagger picks the tag which is most likely for that context.

N-gram tagger Uses the preceding N-1 predicted tags Also uses the unigram estimate for the current word

Reading the Bigram table The current word The predicted POS The previously seen tag

Syntax (cont.) Define context free grammar for L –A set of terminal (T) symbols (words) –A set of non terminal (N) symbols (phrase types) –A distinguished non terminal symbol “S” for sentence –A set PR of production rules of the form: NT  a sequence of one or more symbols from T U NT A sentence of L is a sequence of terminal symbols that can be generated by: –Starting with S –Replace a non-terminal N with the RHS of a rule from PR with N as its LHS. –Until no more non-terminals remain

Syntax tree for a sentence Any sentence of L has one or more syntax trees that show how it can be derived from S by repeated applications of the rules in PR The syntax tree (also called the parse tree) represents a grammatical structure which may also represent a semantic interpretation

Example: a grammar for arithmetic expressions 3 * 4 + 5 has two valid parse trees with different semantics. What are they? (class exercise) S  digit S  S + S S  S * S

Parsing Parsing with CFGs refers to the task of assigning syntax trees to input sequences of terminal symbols A valid parse tree covers all and only the elements of the input and has an S at the top It doesn’t actually mean that the system can select the semantically “correct “ tree from among the possible syntax trees As with everything of interest, parsing involves a search that involves the making of choices Example: “I SAW A MAN IN THE PARK WITH A TELESCOPE”

The problem of “scaling up” – the same as in knowledge representation, planning, etc but even more difficult

A Simple CFG for English S  NP VP S  aux NP VP S  VP NP  det Nom NP  propN NP  Nom Nom  noun Nom  adj Nom VP  verb VP  verb NP VP  verb PP PP  prep NP propN  John, Boston, United States det  the, that,... noun  flight, book, pizza adj  small, red, happy verb  book, see aux  be, do, have prep  of, in, at, to, before Example sentence: Book that flight

Top-Down Parsing Since we are trying to find trees rooted with an S (Sentences) start with the rules that give us an S. Then work your way down from there to the words.

Top Down Space

Bottom-Up Parsing Of course, we also want trees that cover the input words. So start with trees that link up with the words in the right way. Then work your way up from there.

Bottom-Up Space

Top-Down VS. Bottom-Up Top-down – Only searches for trees that can be answers – But suggests trees that are not consistent with the words – Guarantees that tree starts with S as root – Does not guarantee that tree will match input words Bottom-up – Only forms trees consistent with the words – Suggest trees that make no sense globally – Guarantees that tree matches input words – Does not guarantee that parse tree will lead to S as a root Combine the advantages of the two by doing a search constrained from both sides (top and bottom)

Top-Down, Depth-First, Left-to-Right Search + Backtracking (“Bottom up filtering”)

Example (cont’d)

flight

Example (cont’d) flight

Problem: Left-Recursion What happens in the following situation –S -> NP VP –S -> Aux NP VP –NP -> NP PP –NP -> Det Nominal –… –With the sentence starting with Did the flight

Avoiding Repeated Work Parsing is hard, and slow. It’s wasteful to redo stuff over and over and over. Consider an attempt to top-down parse the following as an NP A flight from Indianapolis to Houston on TWA

flight

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012.

Similar presentations

Presentation on theme: "CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012.

Similar presentations

Presentation on theme: "CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012."— Presentation transcript:

Similar presentations

About project

Feedback