Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November.

Similar presentations


Presentation on theme: "Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November."— Presentation transcript:

1

2 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November 2008 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2008/CIS730http://www.kddresearch.org/Courses/Fall-2008/CIS730 Instructor home page: http://www.cis.ksu.edu/~bhsuhttp://www.cis.ksu.edu/~bhsu Reading for Next Class: Sections 22.1, 22.6-7, Russell & Norvig 2 nd edition Genetic and Evolutionary Computation Discussion: GA, GP

3 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Hidden Units and Feature Extraction  Training procedure: hidden unit representations that minimize error E  Sometimes backprop will define new hidden features that are not explicit in the input representation x, but which capture properties of the input instances that are most relevant to learning the target function t(x)  Hidden units express newly constructed features  Change of representation to linearly separable D’ A Target Function (Sparse aka 1-of-C, Coding)  Can this be learned? (Why or why not?) Learning Hidden Layer Representations

4 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Training: Evolution of Error and Hidden Unit Encoding error D (o k ) h j (01000000), 1  j  3

5 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Input-to-Hidden Unit Weights and Feature Extraction  Changes in first weight layer values correspond to changes in hidden layer encoding and consequent output squared errors  w 0 (bias weight, analogue of threshold in LTU) converges to a value near 0  Several changes in first 1000 epochs (different encodings) Training: Weight Evolution u i1, 1  i  8

6 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Convergence of Backpropagation No Guarantee of Convergence to Global Optimum Solution  Compare: perceptron convergence (to best h  H, provided h  H; i.e., LS)  Gradient descent to some local error minimum (perhaps not global minimum…)  Possible improvements on backprop (BP) Momentum term (BP variant with slightly different weight update rule) Stochastic gradient descent (BP algorithm variant) Train multiple nets with different initial weights; find a good mixture  Improvements on feedforward networks Bayesian learning for ANNs (e.g., simulated annealing) - later Other global optimization methods that integrate over multiple networks Nature of Convergence  Initialize weights near zero  Therefore, initial network near-linear  Increasingly non-linear functions possible as training progresses

7 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Overtraining in ANNs Error versus epochs (Example 2) Recall: Definition of Overfitting  h’ worse than h on D train, better on D test Overtraining: A Type of Overfitting  Due to excessive iterations  Avoidance: stopping criterion (cross-validation: holdout, k-fold)  Avoidance: weight decay Error versus epochs (Example 1)

8 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Overfitting in ANNs Other Causes of Overfitting Possible  Number of hidden units sometimes set in advance  Too few hidden units (“underfitting”) ANNs with no growth Analogy: underdetermined linear system of equations (more unknowns than equations)  Too many hidden units ANNs with no pruning Analogy: fitting a quadratic polynomial with an approximator of degree >> 2 Solution Approaches  Prevention: attribute subset selection (using pre-filter or wrapper)  Avoidance Hold out cross-validation (CV) set or split k ways (when to stop?) Weight decay: decrease each weight by some factor on each epoch  Detection/recovery: random restarts, addition and deletion of weights, units

9 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence 90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces http://www.cs.cmu.edu/~tom/faces.html Example: Neural Nets for Face Recognition 30 x 32 Inputs Left Straight Right Up Hidden Layer Weights after 1 Epoch Hidden Layer Weights after 25 Epochs Output Layer Weights (including w 0 =  ) after 1 Epoch

10 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Example: NetTalk Sejnowski and Rosenberg, 1987 Early Large-Scale Application of Backprop  Learning to convert text to speech Acquired model: a mapping from letters to phonemes and stress marks Output passed to a speech synthesizer  Good performance after training on a vocabulary of ~1000 words Very Sophisticated Input-Output Encoding  Input: 7-letter window; determines the phoneme for the center letter and context on each side; distributed (i.e., sparse) representation: 200 bits  Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme; distributed representation  40 hidden units; 10000 weights total Experimental Results  Vocabulary: trained on 1024 of 1463 (informal) and 1000 of 20000 (dictionary)  78% on informal, ~60% on dictionary http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

11 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence NeuroSolutions Demo

12 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Definition and Rationale Intuition  Can’t expect a learner to learn exactly Multiple consistent concepts Unseen examples: could have any label (“OK” to mislabel if “rare”)  Can’t always approximate c closely (probability of D not being representative) Terms Considered  Class C of possible concepts, learner L, hypothesis space H  Instances X, each of length n attributes  Error parameter , confidence parameter , true error error D (h)  size(c) = the encoding length of c, assuming some representation Definition  C is PAC-learnable by L using H if for all c  C, distributions D over X,  such that 0 <  < 1/2, and  such that 0 <  < 1/2, learner L will, with probability at least (1 -  ), output a hypothesis h  H such that error D (h)    Efficiently PAC-learnable: L runs in time polynomial in 1/ , 1/ , n, size(c)

13 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Results for Two Hypothesis Languages Unbiased Learner  Recall: sample complexity bound m  1/  (ln | H | + ln (1/  ))  Sample complexity not always polynomial  Example: for unbiased learner, | H | = 2 | X |  Suppose X consists of n booleans (binary-valued attributes) | X | = 2 n, | H | = 2 2 n m  1/  (2 n ln 2 + ln (1/  )) Sample complexity for this H is exponential in n Monotone Conjunctions  Target function of the form  Active learning protocol (learner gives query instances): n examples needed  Passive learning with a helpful teacher: k examples (k literals in true concept)  Passive learning with randomly selected examples (proof to follow): m  1/  (ln | H | + ln (1/  )) = 1/  (ln n + ln (1/  ))

14 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [1] Monotone Conjunctive Concepts  Suppose c  C (and h  H) is of the form x 1  x 2  …  x m  n possible variables: either omitted or included (i.e., positive literals only) Errors of Omission (False Negatives)  Claim: the only possible errors are false negatives (h(x) = -, c(x) = +)  Mistake iff (z  h)  (z  c)  (  x  D test. x(z) = false): then h(x) = -, c(x) = + Probability of False Negatives  Let z be a literal; let Pr(Z) be the probability that z is false in a positive x  D  z in target concept (correct conjunction c = x 1  x 2  …  x m )  Pr(Z) = 0  Pr(Z) is the probability that a randomly chosen positive example has z = false (inducing a potential mistake, or deleting z from h if training is still in progress)  error(h)   z  h Pr(Z) c h Instance Space X + + - - - - + +

15 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [2] Bad Literals  Call a literal z bad if Pr(Z) >  =  ’/n  z does not belong in h, and is likely to be dropped (by appearing with value true in a positive x  D), but has not yet appeared in such an example Case of No Bad Literals  Lemma: if there are no bad literals, then error(h)   ’  Proof: error(h)   z  h Pr(Z)   z  h  ’/n   ’ (worst case: all n z’s are in c ~ h) Case of Some Bad Literals  Let z be a bad literal  Survival probability (probability that it will not be eliminated by a given example): 1 - Pr(Z) < 1 -  ’/n  Survival probability over m examples: (1 - Pr(Z)) m < (1 -  ’/n) m  Worst case survival probability over m examples (n bad literals) = n (1 -  ’/n) m  Intuition: more chance of a mistake = greater chance to learn

16 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [3] Goal: Achieve An Upper Bound for Worst-Case Survival Probability  Choose m large enough so that probability of a bad literal z surviving across m examples is less than   Pr(z survives m examples) = n (1 -  ’/n) m <   Solve for m using inequality 1 - x < e -x n e -m  ’/n <  m > n/  ’ (ln (n) + ln (1/  )) examples needed to guarantee the bounds  This completes the proof of the PAC result for monotone conjunctions  Nota Bene: a specialization of m  1/  (ln | H | + ln (1/  )); n/  ’ = 1/  Practical Ramifications  Suppose  = 0.1,  ’ = 0.1, n = 100: we need 6907 examples  Suppose  = 0.1,  ’ = 0.1, n = 10: we need only 460 examples  Suppose  = 0.01,  ’ = 0.1, n = 10: we need only 690 examples

17 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: k-CNF, k-Clause-CNF, k-DNF, k-Term-DNF k-CNF (Conjunctive Normal Form) Concepts: Efficiently PAC-Learnable  Conjunctions of any number of disjunctive clauses, each with at most k literals  c = C 1  C 2  …  C m ; C i = l 1  l 1  …  l k ; ln (| k-CNF |) = ln (2 (2n) k ) =  (n k )  Algorithm: reduce to learning monotone conjunctions over n k pseudo-literals C i k-Clause-CNF  c = C 1  C 2  …  C k ; C i = l 1  l 1  …  l m ; ln (| k-Clause-CNF |) = ln (3 kn ) =  (kn)  Efficiently PAC learnable? See below (k-Clause-CNF, k-Term-DNF are duals) k-DNF (Disjunctive Normal Form)  Disjunctions of any number of conjunctive terms, each with at most k literals  c = T 1  T 2  …  T m ; T i = l 1  l 1  …  l k k-Term-DNF: “Not” Efficiently PAC-Learnable (Kind Of, Sort Of…)  c = T 1  T 2  …  T k ; T i = l 1  l 1  …  l m ; ln (| k-Term-DNF |) = ln (k3 n ) =  (n + ln k)  Polynomial sample complexity, not computational complexity (unless RP = NP)  Solution: Don’t use H = C! k-Term-DNF  k-CNF (so let H = k-CNF)

18 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Consistent Learners General Scheme for Learning  Follows immediately from definition of consistent hypothesis  Given: a sample D of m examples  Find: some h  H that is consistent with all m examples  PAC: show that if m is large enough, a consistent hypothesis must be close enough to c  Efficient PAC (and other COLT formalisms): show that you can compute the consistent hypothesis efficiently Monotone Conjunctions  Used an Elimination algorithm (compare: Find-S) to find a hypothesis h that is consistent with the training set (easy to compute)  Showed that with sufficiently many examples (polynomial in the parameters), then h is close to c  Sample complexity gives an assurance of “convergence to criterion” for specified m, and a necessary condition (polynomial in n) for tractability

19 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Framework Infinite Hypothesis Space?  Preceding analyses were restricted to finite hypothesis spaces  Some infinite hypothesis spaces are more expressive than others, e.g., rectangles vs. 17-sided convex polygons vs. general convex polygons linear threshold (LT) function vs. a conjunction of LT units  Need a measure of the expressiveness of an infinite H other than its size Vapnik-Chervonenkis Dimension: VC(H)  Provides such a measure  Analogous to | H |: there are bounds for sample complexity using VC(H)

20 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Shattering A Set of Instances Dichotomies  Recall: a partition of a set S is a collection of disjoint sets S i whose union is S  Definition: a dichotomy of a set S is a partition of S into two subsets S 1 and S 2 Shattering  A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S, there exists a hypothesis in H consistent with this dichotomy  Intuition: a rich set of functions shatters a larger instance space The “Shattering Game” (An Adversarial Interpretation)  Your client selects an S (an instance space X)  You select an H  Your adversary labels S (i.e., chooses a point c from concept space C = 2 X )  You must find then some h  H that “covers” (is consistent with) c  If you can do this for any c your adversary comes up with, H shatters S

21 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Examples of Shattered Sets Three Instances Shattered Intervals  Left-bounded intervals on the real axis: [0, a), for a  R  0 Sets of 2 points cannot be shattered Given 2 points, can label so that no hypothesis will be consistent  Intervals on the real axis ([a, b], b  R > a  R ): can shatter 1 or 2 points, not 3  Half-spaces in the plane (non-collinear): 1? 2? 3? 4? Instance Space X 0a -+ -+ ab +

22 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture Outline Readings for Friday  Finish Chapter 20, Russell and Norvig 2e  Suggested: Chapter 1, 6.1-6.5, Goldberg; 9.1 – 9.4, Mitchell Evolutionary Computation  Biological motivation: process of natural selection  Framework for search, optimization, and learning Prototypical (Simple) Genetic Algorithm  Components: selection, crossover, mutation  Representing hypotheses as individuals in GAs An Example: GA-Based Inductive Learning (GABIL) GA Building Blocks (aka Schemas) Taking Stock (Course Review)

23 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Simple Genetic Algorithm (SGA) Algorithm Simple-Genetic-Algorithm (Fitness, Fitness-Threshold, p, r, m) // p: population size; r: replacement rate (aka generation gap width), m: string size  P  p random hypotheses// initialize population  FOR each h in P DO f[h]  Fitness(h)// evaluate Fitness: hypothesis  R  WHILE (Max(f) < Fitness-Threshold) DO  1. Select: Probabilistically select (1 - r)p members of P to add to P S  2. Crossover:  Probabilistically select (r · p)/2 pairs of hypotheses from P  FOR each pair DO P S += Crossover ( )// P S [t+1] = P S [t] +  3. Mutate: Invert a randomly selected bit in m · p random members of P S  4. Update: P  P S  5. Evaluate: FOR each h in P DO f[h]  Fitness(h)  RETURN the hypothesis h in P that has maximum fitness f[h]

24 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GA-Based Inductive Learning (GABIL) GABIL System [Dejong et al, 1993]  Given: concept learning problem and examples  Learn: disjunctive set of propositional rules  Goal: results competitive with those for current decision tree learning algorithms (e.g., C4.5) Fitness Function: Fitness(h) = (Correct(h)) 2 Representation  Rules: IF a 1 = T  a 2 = F THEN c = T; IF a 2 = T THEN c = F  Bit string encoding: a 1 [10]. a 2 [01]. c [1]. a 1 [11]. a 2 [10]. c [0] = 10011 11100 Genetic Operators  Want variable-length rule sets  Want only well-formed bit string hypotheses

25 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Crossover: Variable-Length Bit Strings Basic Representation  Start with a 1 a 2 c a 1 a 2 c h 1 1[0 011111]00 h 2 0[1 1]1010010  Idea: allow crossover to produce variable-length offspring Procedure  1. Choose crossover points for h 1, e.g., after bits 1, 8  2. Now restrict crossover points in h 2 to those that produce bitstrings with well-defined semantics, e.g.,,, Example  Suppose we choose  Result h 3 11 10 0 h 4 00 01 111 11 010 01 0

26 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GABIL Extensions New Genetic Operators  Applied probabilistically  1. AddAlternative: generalize constraint on a i by changing a 0 to a 1  2. DropCondition: generalize constraint on a i by changing every 0 to a 1 New Field  Add fields to bit string to decide whether to allow above operators a 1 a 2 c a 1 a 2 cAADC 011101001010  So now learning strategy also evolves!  aka genetic wrapper

27 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GABIL Results Classification Accuracy  Compared to symbolic rule/tree learning methods  C4.5 [Quinlan, 1993]  ID5R  AQ14 [Michalski, 1986]  Performance of GABIL comparable  Average performance on a set of 12 synthetic problems: 92.1% test accuracy  Symbolic learning methods ranged from 91.2% to 96.6% Effect of Generalization Operators  Result above is for GABIL without AA and DC  Average test set accuracy on 12 synthetic problems with AA and DC: 95.2%

28 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Building Blocks (Schemas) Problem  How to characterize evolution of population in GA?  Goal  Identify basic building block of GAs  Describe family of individuals Definition: Schema  String containing 0, 1, * (“don’t care”)  Typical schema: 10**0*  Instances of above schema: 101101, 100000, … Solution Approach  Characterize population by number of instances representing each schema  m(s, t)  number of instances of schema s in population at time t

29 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Selection and Building Blocks Restricted Case: Selection Only   average fitness of population at time t  m(s, t)  number of instances of schema s in population at time t   average fitness of instances of schema s at time t Quantities of Interest  Probability of selecting h in one selection step  Probability of selecting an instance of s in one selection step  Expected number of instances of s after n selections

30 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Schema Theorem Theorem  m(s, t)  number of instances of schema s in population at time t   average fitness of population at time t   average fitness of instances of schema s at time t  p c  probability of single point crossover operator  p m  probability of mutation operator  l  length of individual bit strings  o(s)  number of defined (non “*”) bits in s  d(s)  distance between rightmost, leftmost defined bits in s Intuitive Meaning  “The expected number of instances of a schema in the population tends toward its relative fitness”  A fundamental theorem of GA analysis and design

31 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Genetic Programming Readings / Viewings  View GP videos 1-3  GP1 – Genetic Programming: The Video  GP2 – Genetic Programming: The Next Generation  GP3 – Genetic Programming: Invention  GP4 – Genetic Programming: Human-Competitive  Suggested: Chapters 1-5, Koza Previously  Genetic and evolutionary computation (GEC)  Generational vs. steady-state GAs; relation to simulated annealing, MCMC  Schema theory and GA engineering overview Today: GP Discussions  Code bloat and potential mitigants: types, OOP, parsimony, optimization, reuse  Genetic programming vs. human programming: similarities, differences

32 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GP Flow Graph Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez http://www.geneticprogramming.com

33 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Structural Crossover Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez http://www.geneticprogramming.com

34 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Structural Mutation Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez http://www.geneticprogramming.com

35 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Terminology Evolutionary Computation (EC): Models Based on Natural Selection Genetic Algorithm (GA) Concepts  Individual: single entity of model (corresponds to hypothesis)  Population: collection of entities in competition for survival  Generation: single application of selection and crossover operations  Schema aka building block: descriptor of GA population (e.g., 10**0*)  Schema theorem: representation of schema proportional to its relative fitness Simple Genetic Algorithm (SGA) Steps  Selection  Proportionate (aka roulette wheel): P(individual)  f(individual)  Tournament: let individuals compete in pairs or tuples; eliminate unfit ones  Crossover  Single-point: 11101001000  00001010101  { 11101010101, 00001001000 }  Two-point: 11101001000  00001010101  { 11001011000, 00101000101 }  Uniform: 11101001000  00001010101  { 10001000100, 01101011001 }  Mutation: single-point (“bit flip”), multi-point

36 Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Summary Points Evolutionary Computation  Motivation: process of natural selection  Limited population; individuals compete for membership  Method for parallelizing and stochastic search  Framework for problem solving: search, optimization, learning Prototypical (Simple) Genetic Algorithm (GA)  Steps  Selection: reproduce individuals probabilistically, in proportion to fitness  Crossover: generate new individuals probabilistically, from pairs of “parents”  Mutation: modify structure of individual randomly  How to represent hypotheses as individuals in GAs An Example: GA-Based Inductive Learning (GABIL) Schema Theorem: Propagation of Building Blocks Next Lecture: Genetic Programming, The Movie


Download ppt "Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November."

Similar presentations


Ads by Google