School of EECS, Peking University

School of EECS, Peking University
On Assign #1 Hongfei Yan School of EECS, Peking University 3/17/2015 Some slides credits: (Everything Data, spring 2014)

Announcements (Mar. 17) DUE: Monday 3/23 11:59pm
Assignment #2 : see website for details DUE: Monday 3/23 11:59pm Three exercises on preprocessing All files should be submitted through course.pku.edu.cn. DUE: Saturday 3/28 11:59pm Read chapter 3 of The Elements of Statistical Learning Time & Place changed Sci bldg :00am--12:00am Sunday(Mar. 22, 2015) Sci bldg 315 9:00am--12:00am Sunday(starting Mar. 29, 2015)

Assignment #1 DUE: Friday 3/20 11:59pm DUE: Saturday 3/21 11:59pm
Course enrollment, Join the cs410 discussion group, Learn OpenRefine and regular expressions No submission required for this part Reference: Three exercises on document Similarity All files should be submitted through course.pku.edu.cn. DUE: Saturday 3/21 11:59pm Read chapter 1-2 of The Elements of Statistical Learning

Three Exercises on Document Similarity
E1: Find two different ways of determining the number of times the word ‘situation’ appears in Emma. E2: Working with the strings below as documents and using CountVectorizer with the input='content' parameter, create a document-term matrix. E3: Using the matrix just created, calculate the Euclidean distance, Jaccard distance, and cosine distance between each pair of documents. Make sure to calculate distance (rather than similarity). Are our intuitions about which texts are most similar reflected in the measurements of distance? In [65]: text1 = "Indeed, she had a rather kindly disposition." In [66]: text2 = "The real evils, indeed, of Emma's situation were the power of having rather too much her own way, and a disposition to think a little too well of herself;“ In [67]: text3 = "The Jaccard distance is a way of measuring the distance from one set to another set."

CountVectorizer class from scikit-learn package
scikit-learn (Machine Learning in Python) tools for data mining and data analysis Built on NumPy, SciPy, and matplotlib from sklearn.feature_extraction.text import CountVectorizer Convert a collection of text documents to a matrix of token counts

E1: the number of times the word ‘situation’ appears in Emma
The first way In [6]: vocab = vectorizer.get_feature_names() # a list In [11]: situation_idx = list(vocab).index(‘situation') In [12]: dtm[0, situation_idx] The second way In [13]: dtm[0, vocab == ‘situation']

E2: using CountVectorizer with the input='content' parameter, create a document-term matrix
contents = [text1, text2, text3] vectorizer2 = CountVectorizer(input='content') dtm2 = vectorizer2.fit_transform(contents) dtm2 = dtm2.toarray() vocab2 = vectorizer2.get_feature_names() vocab2 = np.array(vocab2) print(dtm2) print(vocab2)

Jaccard index/Jaccard similarity coefficient
E3: the Euclidean distance, Jaccard distance, and cosine distance between each pair of documents Euclidean distance Jaccard index/Jaccard similarity coefficient Cosine similarity two boolean 1-D arrays Distance functions between two vectors u and v. Computing distances over a large collection of vectors is inefficient for these functions. Use pdist for this purpose.

Jaccard Distance Obtained by subtracting the Jaccard coefficient from 1 Useful similarity function for sets (and for… long strings). Let A and B be two sets Words in two documents Friends lists of two individuals

The code for E3 from sklearn.metrics.pairwise import euclidean_distances from sklearn.metrics.pairwise import cosine_similarity from sklearn.metrics.pairwise import pairwise_distances … print(np.round(pairwise_distances(dtm2, metric = 'euclidean'),2)) print(np.round(pairwise_distances(np.sign(dtm2), metric = 'jaccard'),2)) print(np.round(pairwise_distances(dtm2, metric = 'cosine'),2))

text2 and text3 are most similar?

text1 and text2 are most similar
contents = [text1, text2, text3] vectorizer2 = CountVectorizer(input='content', stop_words='english') dtm2 = vectorizer2.fit_transform(contents) dtm2 = dtm2.toarray() vocab2 = vectorizer2.get_feature_names() vocab2 = np.array(vocab2)

Visualizing distances
Multidimensional scaling (MDS) to visualizing distances is to assign a point in a plane to each text, making sure that the distance between points is proportional to the pairwise distances we calculated.

do MDS in three dimensions

Clustering texts based on distance
The ideas underlying the transition from distances to clusters are, for the most part, common sense. Any clustering of texts should result in texts that are closer to each other (in the distance matrix) residing in the same cluster. Ward’s method produces a hierarchy of clusterings All that Ward’s method requires is a set of pairwise distance

Ward’s method Start with each text in its own cluster
Until only a single cluster remains, Find the closest clusters and merge them. The distance between two clusters is the change in the sum of squared distances when they are merged. Return a tree containing a record of cluster-merges.

School of EECS, Peking University

Similar presentations

Presentation on theme: "School of EECS, Peking University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of EECS, Peking University

Similar presentations

Presentation on theme: "School of EECS, Peking University"— Presentation transcript:

Similar presentations

About project

Feedback