Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images.

Similar presentations


Presentation on theme: "Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images."— Presentation transcript:

1 Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images from Wikipedia Longin Jan Latecki Temple University latecki@temple.edu

2 Gradient Descent We have already used it in the perceptron learning. Our goal is to minimize a function f(x), where x=(x 1, …, x n ). Starting with some initial point x 0, we try to find a sequence of points x k that moves downhill to the closest local minimum. A general strategy is x k+1 = x k +  k p k

3 Steepest Gradient Descent A key question is what is p k ? We can make greedy choices and always go downhill as fast as possible. This implies that Thus, we iterate x k+1 = x k +  k p k until  f(x k )=0, which practically means until  f(x k ) < 

4 The gradient of the function f(x,y) = −(cos 2 x + cos 2 y) 2 depicted as a vector field on the bottom plane

5 For example, the gradient of the function is:

6 6 Recall the Gradient Descent Learning Rule of Perceptron Consider linear perceptron without threshold and continuous output (not just –1,1) – y=w 0 + w 1 x 1 + … + w n x n Train the w i ’s such that they minimize the squared error E[w 1,…,w n ] = ½  d  D (t d -y d ) 2 where D is the set of training examples Then w k+1 = w k -  k  f(w k ) = w k -  k  E(w k ) We wrote w k+1 = w k +  w k, thus  w k = -  k  E(w k )

7 7 Gradient Descent Gradient:  E[w]=[  E/  w 0,…  E/  w n ] (w 1,w 2 ) (w 1 +  w 1,w 2 +  w 2 )  w=-   E[w]  w i =-   E/  w i  /  w i 1/2  d (t d -y d ) 2 =  d  /  w i 1/2(t d -  i w i x i ) 2 =  d (t d - y d )(-x i )

8 161.326Stephen Marsland Gradient Descent Error  w i =-   E/  w i

9 Newton Direction Taylor Expansion: If a f(x) is a scalar function, i.e., f: R n → R, where x=(x 1, …, x n ), then  f(x)=J(x) and  2 f(x)=H(x), where J is a Jacobian a vector and H is a n×n Hessian matrix defined as

10 Jacobian vector and Hessian matrix

11 Newton Direction Since we obtain In x k+1 = x k +  k p k and the step size is always  k =1.

12 Search Algorithms Example problem: Traveling Salesman Problem (TSP), which is introduced on next slides. Then we will explore various search strategies and illustrate them on TSP: 1.Exhaustive Search 2.Greedy Search 3.Hill Climbing 4.Simulated Annealing

13 The Traveling Salesman Problem The traveling salesman problem is one of the classical problems in computer science. A traveling salesman wants to visit a number of cities and then return to his starting point. Of course he wants to save time and energy, so he wants to determine the shortest cycle for his trip. We can represent the cities and the distances between them by a weighted, complete, undirected graph. The problem then is to find the shortest cycle (of minimum total weight that visits each vertex exactly one). Finding the shortest cycle is different than Dijkstra’s shortest path. It is much harder too, no polynomial time algorithm exists!

14 The Traveling Salesman Problem Importance: – Variety of scheduling application can be solved as a traveling salesmen problem. – Examples : Ordering drill position on a drill press. School bus routing. – The problem has theoretical importance because it represents a class of difficult problems known as NP-hard problems.

15 THE FEDERAL EMERGENCY MANAGEMENT AGENCY A visit must be made to four local offices of FEMA, going out from and returning to the same main office in Northridge, Southern California.

16 FEMA traveling salesman Network representation

17 30 25 40 35 80 65 45 50 40 Home 1 2 3 4

18 FEMA - Traveling Salesman Solution approaches –Enumeration of all possible cycles. This results in (m-1)! cycles to enumerate for a graph with m nodes. Only small problems can be solved with this approach.

19 Possible cycles CycleTotal Cost 1. H-O1-O2-O3-O4-H210 2. H-O1-O2-O4-O3-H 195 3. H-O1-O3-O2-O3-H 240 4. H-O1-O3-O4-O2-H 200 5. H-O1-O4-O2-O3-H 225 6. H-O1-O4-O3-O2-H 200 7. H-O2-O3-O1-O4-H 265 8. H-O2-O1-O3-O4-H 235 9. H-O2-O4-O1-O3-H 250 10. H-O2-O1-O4-O3-H 220 11. H-O3-O1-O2-O4-H 260 12. H-O3-O1-O2-O4-H260 Minimum For this problem we have (5-1)! / 2 = 12 cycles. Symmetrical problems need to enumerate only (m-1)! / 2 cycles. Exhaustive Search by Full Enumeration

20 30 25 40 35 80 65 45 50 40 Home 1 2 3 4 FEMA – optimal solution

21 The Traveling Salesman Problem Unfortunately, no algorithm solving the traveling salesman problem with polynomial worst-case time complexity has been devised yet. This means that for large numbers of vertices, solving the traveling salesman problem is impractical. In these cases, we can use efficient approximation algorithms that determine a path whose length may be slightly larger than the traveling salesman’s path.

22 Greedy Search TSP Solution Choose the first city arbitrarily, and then repeatedly pick the city that is closest to the current city and that has not been yet visited. Stop when all cities have been visited.

23 Hill Climbing TSP Solution Choose an initial tour randomly Then keep swapping pairs of cities if the total length of tour decreases, i.e., if new dist. traveled < before dist. traveled. Stop after a predefined number of swaps or when no swap improved the solution for some time. As with greedy search, there is no way to predict how good the solution will be.

24 Exploration and Exploitation Exploration of the search space is like exhaustive search (always trying out new solutions) Exploitation of the current best solution is like hill climbing (trying local variants of the current best solution) Ideally we would like to have a combination of those two.

25 Simulated Annealing TSP Solution Like in hill climbing, keep swapping pairs of cities if new dist. traveled < before dist. traveled, or if (before dist. Traveled - new dist. Traveled) < T*log(rand) Set T=c*T, where 0<c<1 (usually 0.8<c<1) Thus, we accept a ‘bad’ solution if for some random number p

26 Search Algorithms Covered 1.Exhaustive Search 2.Greedy Search 3.Hill Climbing 4.Simulated Annealing


Download ppt "Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images."

Similar presentations


Ads by Google