The Vector Space Models (VSM)

The Vector Space Models (VSM)
doc1 The Vector Space Models (VSM)

| Documents as Vectors Terms are axes of the space
Documents are points or vectors in this space So we have a |V|-dimensional vector space

| The Matrix Doc 1 : makan makan Doc 2 : makan nasi

| The Matrix Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix
TF Biner Term Doc 1 Doc 2 Makan Nasi

| The Matrix : Binary Doc 1 : makan makan Doc 2 : makan nasi
Incidence Matrix TF Biner Term Doc 1 Doc 2 Makan 1 Nasi

Documents are points or vectors in this space So we have a |V|-dimensional vector space

| The Matrix : Binary Doc 1 : makan makan Doc 2 : makan nasi Makan
Incidence Matrix TF Biner Term Doc 1 Doc 2 Makan 1 Nasi Doc 2 Doc 1

| The Matrix : Binary -> Count
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Term Doc 1 Doc 2 Makan 1 Nasi

Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Term Doc 1 Doc 2 Makan 1 2 Nasi

Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index Raw TF Term Doc 1 Doc 2 Makan 2 1 Nasi Doc 2 Doc 1

| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF Term Doc 1 Doc 2 Makan 1 2 Nasi

Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF Term Doc 1 Doc 2 Makan 1 2 1.3 Nasi

Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index Logaritmic TF Term Doc 1 Doc 2 Makan 1.3 1 Nasi Doc 2 Doc 1

Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Makan 1 2 1.3 Nasi

Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi 0.3

Doc 1 : makan makan Doc 2 : makan nasi Makan Nasi 1 Inverted Index TF-IDF Term Doc 1 Doc 2 Makan Nasi 0.3 Doc 2 Doc 1

Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi Jagung

Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 IDF Makan 2 1 1.3 Nasi 0.3 Jagung

Documents are points or vectors in this space So we have a |V|-dimensional vector space The weight can be anything : Binary, TF, TF-IDF and so on.

|Documents as Vectors Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero.

How About The Query?

Query as Vector too...

| VECTOR SPACE MODEL Key idea 1: Represent Document as vectors in the space Key idea 2: Do the same for queries: represent them as vectors in the space Key idea 3: Rank documents according to their proximity to the query in this space

PROXIMITY?

| Proximity Proximity = Kemiripan Proximity = similarity of vectors
Proximity ≈ inverse of distance Dokumen yang memiliki proximity dengan query yang terbesar akan memiliki score yang tinggi sehingga rankingnya lebih tinggi

How to Measure Vector Space Proximity?

| Proximity First cut: distance between two points Euclidean distance?
( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.

| Distance Example Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)

| Distance Example Query : gossip Jealous Inverted Index Logaritmic TF
TF-IDF Term Doc 1 Doc 2 Doc 3 Query IDF Gossip 1 2.95 0.17 0.50 Jealous 2.84 0.48

| Distance Example Query : gossip Jealous Gossip Jealous 0.4 Doc 3
Doc 3 Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query Gossip 0.17 0.50 Jealous 0.48 Doc 2 Query Doc 1

| Why Distance is a Bad Idea?
The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar.

| So, instead of Distance?
Thought experiment: take a document d and append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two documents can be quite large

| So, instead of Distance?
The angle between the two documents is 0, corresponding to maximal similarity. Gossip Jealous 0.4 d’ d

| Use angle instead of distance
Key idea: Rank documents according to angle with query.

| From angles to cosines
The following two notions are equivalent. Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document) Cosine is a monotonically decreasing function for the interval [0o, 180o]

| From angles to cosines

But how – and why – should we be computing cosines?

cos(θ) = (a · b ) / (|a| × |b|)
a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

qi is the tf-idf weight (or whatever) of term i in the query
di is the tf-idf weight (or whatever) of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

| Length normalization
A vector can be (length-) normalized by dividing each of its components by its length Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere) Unit Vector = A vector whose length is exactly 1 (the unit length)

| Remember this Case Gossip Jealous 0.4 d’ d

| Remember this Case Gossip Jealous 1 d’ d

Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Long and short documents now have comparable weights

After Normalization :

After Normalization : for q, d length-normalized.

| Cosine similarity illustrated

| Cosine similarity illustrated
Value of the Cosine Similarity is [0,1]

Example?

| TF-IDF Example Document: car insurance auto insurance N=1000000
Query: best car insurance N= Term Query tf-raw tf-wt df idf tf.idf n’lize auto 5000 2.3 best 1 50000 1.3 car 10000 2.0 insurance 1000 3.0 Query length =

| TF-IDF Example Document: car insurance auto insurance
Query: best car insurance Term Query tf-raw tf-wt df idf tf.idf n’lize auto 5000 2.3 best 1 50000 1.3 0.34 car 10000 2.0 0.52 insurance 1000 3.0 0.78 Query length =

Query: best car insurance Term Document tf-raw tf-wt idf tf.idf n’lize auto 1 2.3 best 1.3 car 2.0 insurance 2 3.0 3.9 Doc length =

Query: best car insurance Term Document tf-raw tf-wt idf tf.idf n’lize auto 1 2.3 0.3 best 1.3 car 2.0 0.5 insurance 2 3.0 3.9 0.79 Doc length =

After Normalization : for q, d length-normalized.

Sec. 6.4 | TF-IDF Example Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product tf.idf n’lize auto 2.3 0.3 best 1.3 0.34 car 2.0 0.52 0.5 0.26 insurance 3.0 0.78 3.9 0.79 0.62 Doc length = Score = = 0.88

| Summary – vector space ranking
Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

Cosine similarity amongst 3 documents
Sec. 6.3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 6 wuthering 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

3 documents example contd.
Sec. 6.3 3 documents example contd. Log frequency weighting After length normalization term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 1.78 wuthering 2.58 term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering 0.588 cos(SaS,PaP) ≈ 0.789 × × × × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SaS,WH)?

The Vector Space Models (VSM)

Similar presentations

Presentation on theme: "The Vector Space Models (VSM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Vector Space Models (VSM)

Similar presentations

Presentation on theme: "The Vector Space Models (VSM)"— Presentation transcript:

Similar presentations

About project

Feedback