Presentation is loading. Please wait.

Presentation is loading. Please wait.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.

Similar presentations


Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby."— Presentation transcript:

1 The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1

2  A cluster is a group of related things  Automatic detection of clusters is a powerful data discovery tool  Detect similar user interests, buying patterns, clickthrough patterns, etc.  Also applicable to the sciences ▪ In computational biology, find groups (or clusters) of genes that exhibit similar behavior

3  Data clustering is an example of an unsupervised learning algorithm... ...which is an AI technique for discovering structure within one or more datasets  The key goal is to find the distinct group(s) that exist within a given dataset  We don’t know what we’ll find

4 We need to first identify a common set of numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?

5  If we cluster blogs based on their word frequencies, maybe we can identify groups of blogs that are... ...similar in terms of blog content ...similar in terms of writing style ...of interest for searching, cataloging, etc.

6  A feed is a simple XML document containing information about a blog and its entries  Reader apps enable users to read multiple blogs in a single window ▪ Click below to check out the Google Reader blog:

7  Check out these feeds:  http://blogs.abcnews.com/theblotter/index.rdf http://blogs.abcnews.com/theblotter/index.rdf  http://www.wired.com/rss/index.xml http://www.wired.com/rss/index.xml  http://www.tmz.com/rss.xml http://www.tmz.com/rss.xml  http://scienceblogs.com/sample/combined.xml http://scienceblogs.com/sample/combined.xml  http://www.neilgaiman.com/journal/feed/rss.xml http://www.neilgaiman.com/journal/feed/rss.xml

8  Techniques for avoiding stop words:  Ignore words on a predefined stop list  Select words from within a predefined range of occurrence percentages ▪ Lower bound of 10% ▪ Upper bound of 50% ▪ Tune as necessary

9  Study the resulting blog data  Identify any patterns in the data  Which blogs are very similar?  Which blogs are very different?  How can these techniques be applied to other types of search?  Web search?  Enterprise search?

10  Hierarchical clustering is an algorithm that groups similar items together  At each iteration, the two most similar items (or groups) are merged  For example, given five items A-E: A A B B C C D D E E

11  Calculate the distances between all items  Group the two items that are closest:  Repeat! AB A A B B C C D D E E

12  How do we compare group AB to other items? ▪ Use the midpoint of items A and B AB A A B B C C D D E E ABC DE x

13  When do we stop? ▪ When we have a top-level group that includes all items AB A A B B C C D D E E ABC DE ABCDE x

14  The hierarchical part is based on the discovery order of clusters  This diagram is called a dendrogram... A A B B C C D D E E AB DE ABC ABCDE

15  A dendrogram is a graph (or tree)  Distances between nodes of the dendrogram show how similar items (or groups) are  AB is closer (to A and B) than DE is (to D and E), so A and B are more similar than D and E  How can we define closeness? A A B B C C D D E E AB DE ABC ABCDE

16  A similarity score compares two distinct elements from a given set  To measure closeness, we need to calculate a similarity score for each pair of items in the set  Options include: ▪ The Euclidean distance score, which is based on the distance formula in two-dimensional geometry ▪ The Pearson correlation score, which is based on fitting data points to a line

17  To find the Euclidean distance between two data points, use the distance formula: distance = √ (y 2 – y 1 ) 2 + (x 2 – x 1 ) 2  The larger the distance between two items, the less similar they are  So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)

18  The Pearson correlation score is derived by determining the best-fit line for a given set x x x x x x x  The best-fit line, on average, comes as close as possible to each item  The Pearson correlation score is a coefficient measuring the degree to which items are on the best-fit line v1 v2 x

19  The Pearson correlation score tells us how closely items are correlated to one another  1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 x x x x x x x x x x x x x x x x v2 v1

20  The algorithm is:  Calculate sum(v1) and sum(v2)  Calculate the sum of the squares of v1 and v2 ▪ Call them sum1Sq and sum2Sq  Calculate the sum of the products of v1 and v2 ▪ (v1[0] * v2[0]) + (v1[1] * v2[1]) +... + (v1[n-1] * v2[n-1]) ▪ Call this pSum x x x x x x x x v2 v1

21  Calculate the Pearson score:  Much more complex, but often better than the Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1) 2 sum2Sq – sum(v2) 2 * n n √

22  Review the blog-data dendrograms  Identify any patterns in the data  Which blogs are very similar?  Which blogs are very different?  How can these techniques be applied to other types of search?  Web search?  Enterprise search?


Download ppt "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby."

Similar presentations


Ads by Google