Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Vector Space Models (VSM)

Similar presentations


Presentation on theme: "The Vector Space Models (VSM)"— Presentation transcript:

1 The Vector Space Models (VSM)
doc1 The Vector Space Models (VSM)

2 | Documents as Vectors Terms are axes of the space
Documents are points or vectors in this space So we have a |V|-dimensional vector space

3 | The Matrix Doc 1 : makan makan Doc 2 : makan nasi

4 | The Matrix Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix
TF Biner Term Doc 1 Doc 2 Makan Nasi

5 | The Matrix : Binary Doc 1 : makan makan Doc 2 : makan nasi
Incidence Matrix TF Biner Term Doc 1 Doc 2 Makan 1 Nasi

6 | Documents as Vectors Terms are axes of the space
Documents are points or vectors in this space So we have a |V|-dimensional vector space

7 | The Matrix : Binary Doc 1 : makan makan Doc 2 : makan nasi Makan
Incidence Matrix TF Biner Term Doc 1 Doc 2 Makan 1 Nasi Doc 2 Doc 1

8 | The Matrix : Binary -> Count
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Term Doc 1 Doc 2 Makan 1 Nasi

9 | The Matrix : Binary -> Count
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Term Doc 1 Doc 2 Makan 1 2 Nasi

10 | The Matrix : Binary -> Count
Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index Raw TF Term Doc 1 Doc 2 Makan 2 1 Nasi Doc 2 Doc 1

11 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF Term Doc 1 Doc 2 Makan 1 2 Nasi

12 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF Term Doc 1 Doc 2 Makan 1 2 1.3 Nasi

13 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index Logaritmic TF Term Doc 1 Doc 2 Makan 1.3 1 Nasi Doc 2 Doc 1

14 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Makan 1 2 1.3 Nasi

15 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi 0.3

16 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi 0.3

17 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index TF-IDF Term Doc 1 Doc 2 Makan Nasi 0.3 Doc 2 Doc 1

18 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi Jagung

19 | The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi 0.3 Jagung

20 | Documents as Vectors Terms are axes of the space
Documents are points or vectors in this space So we have a |V|-dimensional vector space The weight can be anything : Binary, TF, TF-IDF and so on.

21 |Documents as Vectors Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero.

22 How About The Query?

23 Query as Vector too...

24 | VECTOR SPACE MODEL Key idea 1: Represent Document as vectors in the space Key idea 2: Do the same for queries: represent them as vectors in the space Key idea 3: Rank documents according to their proximity to the query in this space

25 PROXIMITY?

26 | Proximity Proximity = Kemiripan Proximity = similarity of vectors
Proximity ≈ inverse of distance Dokumen yang memiliki proximity dengan query yang terbesar akan memiliki score yang tinggi sehingga rankingnya lebih tinggi

27 How to Measure Vector Space Proximity?

28

29 | Proximity First cut: distance between two points Euclidean distance?
( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.

30 | Distance Example Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)

31 | Distance Example Query : gossip Jealous Inverted Index Logaritmic TF
TF-IDF Term Doc 1 Doc 2 Doc 3 Query IDF Gossip 1 2.95 0.17 0.50 Jealous 2.84 0.48

32 | Distance Example Query : gossip Jealous Gossip Jealous 0.4 Doc 3
Doc 3 Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query Gossip 0.17 0.50 Jealous 0.48 Doc 2 Query Doc 1

33 | Why Distance is a Bad Idea?
The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar.

34 | So, instead of Distance?
Thought experiment: take a document d and append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two documents can be quite large

35 | So, instead of Distance?
The angle between the two documents is 0, corresponding to maximal similarity. Gossip Jealous 0.4 d’ d

36 | Use angle instead of distance
Key idea: Rank documents according to angle with query.

37 | From angles to cosines
The following two notions are equivalent. Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document) Cosine is a monotonically decreasing function for the interval [0o, 180o]

38 | From angles to cosines

39 But how – and why – should we be computing cosines?

40 cos(θ) = (a · b ) / (|a| × |b|)
a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

41 qi is the tf-idf weight (or whatever) of term i in the query
di is the tf-idf weight (or whatever) of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

42 | Length normalization
A vector can be (length-) normalized by dividing each of its components by its length Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere) Unit Vector = A vector whose length is exactly 1 (the unit length)

43 | Remember this Case Gossip Jealous 0.4 d’ d

44 | Length normalization

45 | Remember this Case Gossip Jealous 1 d’ d

46 | Length normalization
Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Long and short documents now have comparable weights

47 After Normalization :

48 After Normalization : for q, d length-normalized.

49 | Cosine similarity illustrated

50 | Cosine similarity illustrated
Value of the Cosine Similarity is [0,1]

51 Example?

52 | TF-IDF Example Document: car insurance auto insurance N=1000000
Query: best car insurance N= Term Query tf-raw tf-wt df idf tf.idf n’lize auto 5000 2.3 best 1 50000 1.3 car 10000 2.0 insurance 1000 3.0 Query length =

53 | TF-IDF Example Document: car insurance auto insurance
Query: best car insurance Term Query tf-raw tf-wt df idf tf.idf n’lize auto 5000 2.3 best 1 50000 1.3 0.34 car 10000 2.0 0.52 insurance 1000 3.0 0.78 Query length =

54 | TF-IDF Example Document: car insurance auto insurance
Query: best car insurance Term Document tf-raw tf-wt idf tf.idf n’lize auto 1 2.3 best 1.3 car 2.0 insurance 2 3.0 3.9 Doc length =

55 | TF-IDF Example Document: car insurance auto insurance
Query: best car insurance Term Document tf-raw tf-wt idf tf.idf n’lize auto 1 2.3 0.3 best 1.3 car 2.0 0.5 insurance 2 3.0 3.9 0.79 Doc length =

56 After Normalization : for q, d length-normalized.

57 | TF-IDF Example Document: car insurance auto insurance
Sec. 6.4 | TF-IDF Example Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product tf.idf n’lize auto 2.3 0.3 best 1.3 0.34 car 2.0 0.52 0.5 0.26 insurance 3.0 0.78 3.9 0.79 0.62 Doc length = Score = = 0.88

58 | Summary – vector space ranking
Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

59 Cosine similarity amongst 3 documents
Sec. 6.3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 6 wuthering 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

60 3 documents example contd.
Sec. 6.3 3 documents example contd. Log frequency weighting After length normalization term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 1.78 wuthering 2.58 term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering 0.588 cos(SaS,PaP) ≈ 0.789 × × × × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SaS,WH)?


Download ppt "The Vector Space Models (VSM)"

Similar presentations


Ads by Google