Ch 4: Information Retrieval and Text Mining

Ch 4: Information Retrieval and Text Mining
Hakam Alomari

4.1: Is Information Retrieval a Form of Text Mining?
What is the principal computer specialty for processing documents and text?? Information Retrieval (IR) The task of IR is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity A query is examined and transformed into a vector of values to be compared with stored documents

Cont. 4.1 In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document

Cont. 4.1 Specify Query Search Document Collection
Return Subset of Relevant Documents Figure 4.1. Key Steps in Information Retrieval Examine Document Collection Learn Classification Criteria Apply Criteria to New Documents Figure 4.2. Key Steps in Predictive Text Mining

Match Document Collection Get Subset of Relevant Documents
Specify Query Vector Match Document Collection Get Subset of Relevant Documents Examine Document Properties Figure 4.2. Key steps in IR Figure 4.3. Predicting from Retrieved Documents simple criteria such as document’s labels

4.2 Key Word Search The technical goal for prediction is to classify new, unseen documents The Prediction and IR are unified by the computation of similarity of documents IR based on traditional keyword search through a search engine So we should recognize that using a search engine is a special instance of prediction concept

We enter a key words to a search engine and expect relevant documents to be returned
These key words are words in a dictionary created from the document collection and can be viewed as a small document So, we want to measuring how similar the new document (query) is to the documents in the collection

So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine But, the objective of the search engine is to rank the documents, not to assign a label So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)

4.3 Nearest-Neighbor Methods
A method that compares vectors and measures similarity In Prediction: the NNMs will collect the K most similar documents and then look at their labels In IR: the NNMs will determine whether a satisfactory response to the search query has been found

4.4 Measuring Similarity These measures used to examine how documents are similar and the output is a numerical measure of similarity Three increasingly complex measures: Shared Word Count Word Count and Bonus Cosine Similarity

4.4.1 Shared Word Count Counts the shared words between documents
The words: In IR we have a global dictionary where all potential words will be included, with the exception of stopwords. In Prediction its better to preselect the dictionary relative to the label

Computing similarity by Shared words
Look at all words in the new document For each document in the collection count how many of these words appear No weighting are used, just a simple count The dictionary has true key words (weakly words removed) The results of this measure are clearly intuitive No one will question why a document was retrieved

Each document represented as a vector of key words (zeros and ones) The similarity of 2 documents is the product of the 2 vectors If 2 documents have the same key word then this word is counted (1*1) The performance of this measure depends mainly on the dictionary used

Shared words is an exact search either retrieving or not retrieving a document. No weighting can be done on terms in query, A and B, you can’t specify A is more important than B Every retrieved document are treated equally

4.4.2 Word Count and Bonus 1/4 TF – term frequency
number of times a term occurs in a document DF –Document frequency Number of documents that contain the term. IDF – inversed document frequency =log (N/df) N: the total number of documents Vector is a numerical representation for a point in a multi-dimensional space. (x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.

If word (j) occurs in both documents
4.4.2 Word Count and Bonus 2/4 Each indexing term is a dimension Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tik) Document similarity is defined as K = number of words If word (j) occurs in both documents otherwise

4.4.2 Word Count and Bonus 3/4 The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small. This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.

4.4.2 Word Count and Bonus 4/4 Labeled Spreadsheet Similarity Scores
A document Space is defined by five terms: hardware, software, user, information, index. The query is “ hardware, user, information. 1 2.83 1.33 1.5 2.67 New Document Vector Measure Similarity With Bonus 1 Figure 4.4. Computing Similarity Scores with Bonus

4.4.3 Cosine Similarity The Vector Space
A document is represented as a vector: (W1, W2, … … , Wn) Binary: Wi= 1 if the corresponding term is in the document Wi= 0 if the term is not in the document TF: (Term Frequency) Wi= tfi where tfi is the number of times the term occurred in the document TF*IDF: (Inverse Document Frequency) Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

4.4.3 Cosine Similarity The Vector Space
vec(D) = (w1, w2, ..., wt) Sim(d1,d2) = cos() = [vec(d1)  vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2| W(j) > 0 whenever j di So, 0 <= sim(d1,d2) <=1 A document is retrieved even if it matches the query terms only partially

4.4.3 Cosine Similarity How to compute the weight wj?
A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency wj = tf(j) * idf(j)

4.4.3 Cosine Similarity TF in the given document shows how important the term is in this document (makes the frequent words for the document more important) IDF makes rare words across all documents more important. A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents. Term weights in a document affects the position of the document vectors di = (wi,1 , wi,2 ….wi,t)

4.4.3 Cosine Similarity TF-IDF definitions:
fik: number occurrences of term ti in document Dk tfik: fik / max(fik) normalized term frequency dfk: number of documents which contain tk idfk: log(N / dfk) where N is the total number of documents wik: tfik idfk term weight Intuition: rare words get more weight, common words less weight

Example TF-IDF Given a document containing terms with given frequencies: Kent = 3; Ohio = 2; University = 1 and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250. THEN Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

4.4.3 Cosine Similarity Cosine W(j) = tf(j) * idf(j)
Idf(j) = log(N / df(j))

Ch 4: Information Retrieval and Text Mining

Similar presentations

Presentation on theme: "Ch 4: Information Retrieval and Text Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ch 4: Information Retrieval and Text Mining

Similar presentations

Presentation on theme: "Ch 4: Information Retrieval and Text Mining"— Presentation transcript:

Similar presentations

About project

Feedback