Presentation is loading. Please wait.

Presentation is loading. Please wait.

JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy.

Similar presentations


Presentation on theme: "JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy."— Presentation transcript:

1 JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

2 Introduction  Deal with large size of internet through using better categorization techniques  Goal: Optimize search time by grouping pages using clusters  Wikipedia is the data source

3 Problem  Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered.  This creates a reduction in search space for related information.

4 Solution  If documents contain several similar links then similar data.  Focused on the link data set: Link data: 39484 2039 4952 1029 39 1920 10233 30197

5 Overall solution  Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery  Heuristics used to find relevant seeds

6 Max Flow – Min Cut  Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along.  Flow – The sum of minimum capacity of all paths from one node to another.

7 Max Flow – Min Cut (cont.)  The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.

8 Max Flow – Min Cut (cont.)

9 Max-Flow Community Discovery

10 Implementation

11 Implementation (Parsing)  Links parsed into a Graph. Graph: HashMap  Document Id to HashMap of Link Ids to Capacity.  Links structure was created Links[0] = 3244,2645,791 Links[1] = 10293,432,2,1230... Links[max] = 1012

12 Implementation (Initialization of Community Seeds)  Using the Links structure, a percentage of nodes with highest links are chosen as seeds

13 Implementation (Finding Communities)  Idea, why it didn’t work?  robots

14 Implementation (Visualization)  Walrus is an interactive 3D visualization tool that works on large directed graphs.  Input and output Parsing.  Grouped clusters by colors.

15 Results  The INEX links data was composed of 54,000 nodes and 15 million links  Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours  Cluster size is between 2-2.5 K

16 Results  Visual Images of clusters

17 Conclusion  It worked... kinda.  Looks great!  See pretty pictures.

18 References [1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki- mine/wiki-mine.asp, October 2009. [2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFl ow, November 2009. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFl ow [3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009. [4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_ and_minimum_cut_1.php, December 2009. [5] Fordfulkerson algorithm. http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009. [6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_ min-cut_theorem, November 2009.

19 Questions?  O really?


Download ppt "JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy."

Similar presentations


Ads by Google