Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November.

Similar presentations


Presentation on theme: "Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November."— Presentation transcript:

1

2 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730http://www.kddresearch.org/Courses/Fall-2006/CIS730 Instructor home page: http://www.cis.ksu.edu/~bhsuhttp://www.cis.ksu.edu/~bhsu Reading for Next Class: Section 20.5, Russell & Norvig 2 nd edition Statistical Learning Discussion: ANNs and PS7

3 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture Outline Today’s Reading: Section 20.1, R&N 2e Friday’s Reading: Section 20.5, R&N 2e Machine Learning, Continued: Review Finding Hypotheses  Version spaces  Candidate elimination Decision Trees  Induction  Greedy learning  Entropy Perceptrons  Definitions, representation  Limitations

4 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence S2S2 = G 2 Example Trace S0S0 G0G0 d 1 : d 2 : d 3 : d 4 : = S 3 G3G3 S4S4 G4G4 S1S1 = G 1

5 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence An Unbiased Learner Example of A Biased H  Conjunctive concepts with don’t cares  What concepts can H not express? (Hint: what are its syntactic limitations?) Idea  Choose H’ that expresses every teachable concept  i.e., H’ is the power set of X  Recall: | A  B | = | B | | A | (A = X; B = {labels}; H’ = A  B)  {{Rainy, Sunny}  {Warm, Cold}  {Normal, High}  {None, Mild, Strong}  {Cool, Warm}  {Same, Change}}  {0, 1} An Exhaustive Hypothesis Language  Consider: H’ = disjunctions (  ), conjunctions (  ), negations (¬) over previous H  | H’ | = 2 (2 2 2 3 2 2) = 2 96 ; | H | = 1 + (3 3 3 4 3 3) = 973 What Are S, G For The Hypothesis Language H’?  S  disjunction of all positive examples  G  conjunction of all negated negative examples

6 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Decision Trees Classifiers: Instances (Unlabeled Examples) Internal Nodes: Tests for Attribute Values  Typical: equality test (e.g., “Wind = ?”)  Inequality, other tests possible Branches: Attribute Values  One-to-one correspondence (e.g., “Wind = Strong”, “Wind = Light”) Leaves: Assigned Classifications (Class Labels) Representational Power: Propositional Logic (Why?) Outlook? Humidity?Wind? Maybe SunnyOvercastRain YesNo HighNormal MaybeNo StrongLight Decision Tree for Concept PlayTennis

7 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Example: Decision Tree to Predict C-Section Risk Learned from Medical Records of 1000 Women Negative Examples are Cesarean Sections  Prior distribution: [833+, 167-] 0.83+, 0.17-  Fetal-Presentation = 1: [822+, 116-]0.88+, 0.12-  Previous-C-Section = 0: [767+, 81-] 0.90+, 0.10- –Primiparous = 0: [399+, 13-]0.97+, 0.03- –Primiparous = 1: [368+, 68-]0.84+, 0.16- Fetal-Distress = 0: [334+, 47-]0.88+, 0.12- – Birth-Weight  33490.95+, 0.05- – Birth-Weight < 33470.78+, 0.22- Fetal-Distress = 1: [34+, 21-]0.62+, 0.38-  Previous-C-Section = 1: [55+, 35-] 0.61+, 0.39-  Fetal-Presentation = 2: [3+, 29-]0.11+, 0.89-  Fetal-Presentation = 3: [8+, 22-]0.27+, 0.73-

8 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence [21+, 5-][8+, 30-] Decision Tree Learning: Top-Down Induction (ID3) A1A1 TrueFalse [29+, 35-] [18+, 33-][11+, 2-] A2A2 TrueFalse [29+, 35-] Algorithm Build-DT (Examples, Attributes) IF all examples have the same label THEN RETURN (leaf node with label) ELSE IF set of attributes is empty THEN RETURN (leaf with majority label) ELSE Choose best attribute A as root FOR each value v of A Create a branch out of the root for the condition A = v IF {x  Examples: x.A = v} = Ø THEN RETURN (leaf with majority label) ELSE Build-DT ({x  Examples: x.A = v}, Attributes ~ {A}) But Which Attribute Is Best?

9 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Choosing the “Best” Root Attribute Objective  Construct a decision tree that is a small as possible (Occam’s Razor)  Subject to: consistency with labels on training data Obstacles  Finding the minimal consistent hypothesis (i.e., decision tree) is NP-hard (D’oh!)  Recursive algorithm (Build-DT)  A greedy heuristic search for a simple tree  Cannot guarantee optimality (D’oh!) Main Decision: Next Attribute to Condition On  Want: attributes that split examples into sets that are relatively pure in one label  Result: closer to a leaf node  Most popular heuristic  Developed by J. R. Quinlan  Based on information gain  Used in ID3 algorithm

10 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence A Measure of Uncertainty  The Quantity  Purity: how close a set of instances is to having just one label  Impurity (disorder): how close it is to total uncertainty over labels  The Measure: Entropy  Directly proportional to impurity, uncertainty, irregularity, surprise  Inversely proportional to purity, certainty, regularity, redundancy Example  For simplicity, assume H = {0, 1}, distributed according to Pr(y)  Can have (more than 2) discrete class labels  Continuous random variables: differential entropy  Optimal purity for y: either  Pr(y = 0) = 1, Pr(y = 1) = 0  Pr(y = 1) = 1, Pr(y = 0) = 0  What is the least pure probability distribution?  Pr(y = 0) = 0.5, Pr(y = 1) = 0.5  Corresponds to maximum impurity/uncertainty/irregularity/surprise  Property of entropy: concave function (“concave downward”) Entropy: Intuitive Notion 0.5 1.0 p + = Pr(y = +) 1.0 H(p) = Entropy(p)

11 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Entropy: Information Theoretic Definition Components  D: a set of examples {,, …, }  p + = Pr(c(x) = +), p - = Pr(c(x) = -) Definition  H is defined over a probability density function p  D contains examples whose frequency of + and - labels indicates p + and p - for the observed data  The entropy of D relative to c is: H(D)  -p + log b (p + ) - p - log b (p - ) What Units is H Measured In?  Depends on the base b of the log (bits for b = 2, nats for b = e, etc.)  A single bit is required to encode each example in the worst case (p + = 0.5)  If there is less uncertainty (e.g., p + = 0.8), we can use less than 1 bit each

12 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Information Gain: Information Theoretic Definition Partitioning on Attribute Values  Recall: a partition of D is a collection of disjoint subsets whose union is D  Goal: measure the uncertainty removed by splitting on the value of attribute A Definition  The information gain of D relative to attribute A is the expected reduction in entropy due to splitting (“sorting”) on A: where D v is {x  D: x.A = v}, the set of examples in D where attribute A has value v  Idea: partition on A; scale entropy to the size of each subset D v Which Attribute Is Best? [21+, 5-][8+, 30-] A1A1 TrueFalse [29+, 35-] [18+, 33-][11+, 2-] A2A2 TrueFalse [29+, 35-]

13 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Constructing A Decision Tree for PlayTennis using ID3 [1] Selecting The Root Attribute Prior (unconditioned) distribution: 9+, 5-  H(D) = -(9/14) lg (9/14) - (5/14) lg (5/14) bits = 0.94 bits  H(D, Humidity = High) = -(3/7) lg (3/7) - (4/7) lg (4/7) = 0.985 bits  H(D, Humidity = Normal) = -(6/7) lg (6/7) - (1/7) lg (1/7) = 0.592 bits  Gain(D, Humidity) = 0.94 - (7/14) * 0.985 + (7/14) * 0.592 = 0.151 bits  Similarly, Gain (D, Wind) = 0.94 - (8/14) * 0.811 + (6/14) * 1.0 = 0.048 bits [6+, 1-][3+, 4-] Humidity HighNormal [9+, 5-] [3+, 3-][6+, 2-] Wind LightStrong [9+, 5-]

14 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Constructing A Decision Tree for PlayTennis using ID3 [2] Humidity?Wind? Yes NoYesNo Outlook? 1,2,3,4,5,6,7,8,9,10,11,12,13,14 [9+,5-] SunnyOvercastRain 1,2,8,9,11 [2+,3-] 3,7,12,13 [4+,0-] 4,5,6,10,14 [3+,2-] HighNormal 1,2,8 [0+,3-] 9,11 [2+,0-] StrongLight 6,14 [0+,2-] 4,5,10 [3+,0-]

15 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Decision Tree Overview Heuristic Search and Inductive Bias Decision Trees (DTs)  Can be boolean (c(x)  {+, -}) or range over multiple classes  When to use DT-based models Generic Algorithm Build-DT: Top Down Induction  Calculating best attribute upon which to split  Recursive partitioning Entropy and Information Gain  Goal: to measure uncertainty removed by splitting on a candidate attribute A  Calculating information gain (change in entropy)  Using information gain in construction of tree  ID3  Build-DT using Gain() ID3 as Hypothesis Space Search (in State Space of Decision Trees) Next: Artificial Neural Networks (Multilayer Perceptrons and Backprop) Tools to Try: WEKA, MLC++

16 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Perceptron Convergence Perceptron Convergence Theorem  Claim: If there exist a set of weights that are consistent with the data (i.e., the data is linearly separable), the perceptron learning algorithm will converge  Proof: well-founded ordering on search region (“wedge width” is strictly decreasing) - see Minsky and Papert, 11.2-11.3  Caveat 1: How long will this take?  Caveat 2: What happens if the data is not LS? Perceptron Cycling Theorem  Claim: If the training data is not LS the perceptron learning algorithm will eventually repeat the same set of weights and thereby enter an infinite loop  Proof: bound on number of weight changes until repetition; induction on n, the dimension of the training example vector - MP, 11.10 How to Provide More Robustness, Expressivity?  Objective 1: develop algorithm that will find closest approximation (today)  Objective 2: develop architecture to overcome representational limitation (next lecture)

17 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Principle Understanding Gradient Descent for Linear Units  Consider simpler, unthresholded linear unit:  Objective: find “best fit” to D Approximation Algorithm  Quantitative objective: minimize error over training data set D  Error function: sum squared error (SSE) How to Minimize?  Simple optimization  Move in direction of steepest gradient in weight-error space Computed by finding tangent i.e. partial derivatives (of E) with respect to weights (w i )

18 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Derivation of Delta/LMS (Widrow-Hoff) Rule Definition: Gradient Modified Gradient Descent Training Rule

19 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Gradient Descent: Algorithm using Delta/LMS Rule Algorithm Gradient-Descent (D, r)  Each training example is a pair of the form, where x is the vector of input values and t(x) is the output value. r is the learning rate (e.g., 0.05)  Initialize all weights w i to (small) random values  UNTIL the termination condition is met, DO Initialize each  w i to zero FOR each in D, DO Input the instance x to the unit and compute the output o FOR each linear unit weight w i, DO  w i   w i + r(t - o)x i w i  w i +  w i  RETURN final w Mechanics of Delta Rule  Gradient is based on a derivative  Significance: later, will use nonlinear activation functions (aka transfer functions, squashing functions)

20 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence LS Concepts: Can Achieve Perfect Classification  Example A: perceptron training rule converges Non-LS Concepts: Can Only Approximate  Example B: not LS; delta rule converges, but can’t do better than 3 correct  Example C: not LS; better results from delta rule Weight Vector w = Sum of Misclassified x  D  Perceptron: minimize w  Delta Rule: minimize error  distance from separator (I.e., maximize ) Gradient Descent: Perceptron Rule versus Delta/LMS Rule Example A + - + + - - x1x1 x2x2 + + Example B - - x1x1 x2x2 Example C x1x1 x2x2 + + + + + + + + + + + + + - - - - - - - - - - - - - - - ---

21 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Winnow Algorithm Algorithm Train-Winnow (D)  Initialize:  = n, w i = 1  UNTIL the termination condition is met, DO FOR each in D, DO 1. CASE 1: no mistake - do nothing 2. CASE 2: t(x) = 1 but w  x <  - w i  2w i if x i = 1 (promotion/strengthening) 3. CASE 3: t(x) = 0 but w  x   - w i  w i / 2 if x i = 1 (demotion/weakening)  RETURN final w Winnow Algorithm Learns Linear Threshold (LT) Functions Converting to Disjunction Learning  Replace demotion with elimination  Change weight values to 0 instead of halving  Why does this work?

22 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Winnow : An Example t(x)  c(x) = x 1  x 2  x 1023  x 1024  Initialize:  = n = 1024, w = (1, 1, 1, …, 1)  w  x  ,w = (1, 1, 1, …, 1) OK  w  x < ,w = (1, 1, 1, …, 1) OK  w  x < ,w = (2, 1, 1, …, 1) mistake  w  x < ,w = (4, 1, 2, 2, …, 1) mistake  w  x < ,w = (8, 1, 4, 2, …, 2) mistake  …w = (512, 1, 256, 256, …, 256) Promotions for each good variable:  w  x  ,w = (512, 1, 256, 256, …, 256) OK  w  x  ,w = (512, 1, 0, 256, 0, 0, 0 …, 256) mistake  Last example: elimination rule (bit mask) Final Hypothesis: w = (1024, 1024, 0, 0, 0, 1, 32, …, 1024, 1024)

23 Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence NeuroSolutions and SNNS NeuroSolutions 5.0 Specifications  Commercial ANN simulation environment (http://www.nd.com) for Windows NT  Supports multiple ANN architectures and training algorithms (temporal, modular)  Produces embedded systems Extensive data handling and visualization capabilities Fully modular (object-oriented) design Code generation and dynamic link library (DLL) facilities  Benefits Portability, parallelism: code tuning; fast offline learning Dynamic linking: extensibility for research and development Stuttgart Neural Network Simulator (SNNS) Specifications  Open source ANN simulation environment for Linux  http://www.informatik.uni-stuttgart.de/ipvr/bv/projekte/snns/  Supports multiple ANN architectures and training algorithms  Very extensive visualization facilities  Similar portability and parallelization benefits


Download ppt "Computing & Information Sciences Kansas State University Wednesday, 15 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 35 of 42 Wednesday, 15 November."

Similar presentations


Ads by Google