Presentation is loading. Please wait.

Presentation is loading. Please wait.

2004.09.29 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

Similar presentations


Presentation on theme: "2004.09.29 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004"— Presentation transcript:

1 2004.09.29 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ SIMS 202: Information Organization and Retrieval Math Tutorial

2 2004.09.29 - SLIDE 2IS 202 – FALL 2004 Summation

3 2004.09.29 - SLIDE 3IS 202 – FALL 2004 Program of that summation public class Sumup { int n=10; int s = 0; int i = 0; while (i <= (n-1)) { s = s+i; i = i + 1; } 0 + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 or… public class Sumup2 { int n=10; int s = 0; int i; for (i = 0; i <= n-1; i++) s = s + i; }

4 2004.09.29 - SLIDE 4IS 202 – FALL 2004

5 2004.09.29 - SLIDE 5IS 202 – FALL 2004 public class multup { int n=10; int s = 0; int i = 1; int a[] = {0,1,2,3,4,5,6,7,8,9,10,11}; while (i <= n) { s = s + (a[i] * a[i+1]); i = i + 1; } or… public class multup2 { int n=10; int s = 0; int i; int a[] = {0,1,2,3,4,5,6,7,8,9,10,11}; for (i = 1; i <= n; i++) s = s + (a[i] * a[i+1]); } The value of S depends on the values for the array “a”

6 2004.09.29 - SLIDE 6IS 202 – FALL 2004 Simple tf*idf

7 2004.09.29 - SLIDE 7IS 202 – FALL 2004 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents (N = 10000)

8 2004.09.29 - SLIDE 8IS 202 – FALL 2004 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

9 2004.09.29 - SLIDE 9IS 202 – FALL 2004 tf*idf Normalization Normalize the term weights (so longer vectors are not unfairly given more weight) –Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive Additional Parentheses added to clarify the order of operations

10 2004.09.29 - SLIDE 10IS 202 – FALL 2004 Vector Space Similarity Now, the similarity of two documents is: This is also called the cosine normalized inner product –The normalization was done when weighting the terms

11 2004.09.29 - SLIDE 11IS 202 – FALL 2004 Vector Space Similarity Measure Combine tf and idf into a similarity measure

12 2004.09.29 - SLIDE 12IS 202 – FALL 2004 All in one equation is… Extra parentheses added to clarify order of operations

13 2004.09.29 - SLIDE 13IS 202 – FALL 2004 Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

14 2004.09.29 - SLIDE 14IS 202 – FALL 2004 What’s Cosine Anyway? “One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc endpoint. As a result of this definition, the cosine function is periodic with period 2pi.” From http://mathworld.wolfram.com/Cosine.html

15 2004.09.29 - SLIDE 15IS 202 – FALL 2004 Cosine vs. Degrees CosineCosine Degrees

16 2004.09.29 - SLIDE 16IS 202 – FALL 2004 Computing a Similarity Score

17 2004.09.29 - SLIDE 17IS 202 – FALL 2004 Vector Space Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)


Download ppt "2004.09.29 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004"

Similar presentations


Ads by Google