Presentation is loading. Please wait.

Presentation is loading. Please wait.

LINGO Sandra Gama. Internet  endless document collection.

Similar presentations


Presentation on theme: "LINGO Sandra Gama. Internet  endless document collection."— Presentation transcript:

1 LINGO Sandra Gama

2 Internet  endless document collection

3

4 Search Engines

5 NO question answering

6 FAST access to Web content

7 SENSITIVE to query quality

8 we NEED meaningful RESULTS

9

10 GROUPING by Similarity

11 Semantic structure

12 Groups

13 Description

14 Luxury Car Feline, panther family

15 Description QUALITY

16 How to cluster?

17

18 Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents

19 STAGE 1/4: PREPROCESSING Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents

20 STAGE 1/4: PREPROCESSING 1. Text segmentation 2. Stemming 3. Ignore stop words

21 STAGE 2/4: PHRASE EXTRACTION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents

22

23

24 Goal

25

26

27

28

29 How it works

30

31 1234567891011 abracadabra How many non- empty suffixes? abracadabra bracadabra racadabra acadabra cadabra adabra dabra abra bra ra a 11 suffixes

32 abracadabra bracadabra racadabra acadabra cadabra adabra dabra abra bra ra a Sorted SuffixIndex a11 abra8 abracadabra1 acadabra4 adabra6 bra9 bracadabra2 cadabra5 dabra7 ra10 racadabra3 123456789101112 abracadabra$ 1 2 3 4 5 6 7 8 9 10 11

33 Sorted SuffixIndex a11 abra8 abracadabra1 acadabra4 adabra6 bra9 bracadabra2 cadabra5 dabra7 ra10 racadabra3 1181469257103 Suffix array:

34 STAGE 3/4: CLUSTER-LABEL INDUCTION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents

35

36

37 A  term x document matrix U, ∑, V such that A = U ∑ V T find matrixes

38 D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval P1: Singular value P2: Information retrieval

39 D1: Large-scale singular value computations D2: Software for the sparse singular value decomposition D3: Introduction to modern information retrieval D4: Linear algebra for intelligent information retrieval D5: Matrix computations D6: Singular value cryptogram analysis D7: Automatic information organization T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval D1D2D3D4D5D6D7 0.00 0.56 0.00 1.00 0.490.710.00 0.710.00 0.490.710.00 0.710.00 0.720.00 1.000.00 0.83 0.00

40 Abstract concept matrix (SVD) 0.000.750.00-0.660.00 0.650.00-0.280.00-0.71 0.650.00-0.280.000.71 0.390.000.920.00 0.660.000.750.00 U =

41 0.000.561.000.00 0.710.00 1.000.00 0.710.00 1.000.00 1.000.00 0.830.00 1.00 = P T1: Information P2: Information retrieval P1: Singular value T2: Singular T4: Computations T3: Value T5: Retrieval T1: Information T2: Singular T3: Value T4: Computations T5: Retrieval

42 M matrix = U k T P 0.920.00 0.65 0.390.00 0.970.750.00 0.66 Phrases/single words Abstract concepts T1: Information P2: Information retrieval P1: Singular value T2: Singular T4: Computations T3: Value T5: Retrieval

43 Last step

44 Prune overlapping label descriptions ZTZZTZ

45 STAGE 4/4: CLUSTER-CONTENT ALLOCATION Pre- processing Phrase extraction Cluster-Label Induction Cluster-content allocation Filtered docs Frequent phrases Cluster labels user query clustered documents

46 Similarity

47 Cluster Score

48 Evaluation and Results

49 Test Data 10 categories 4 subjects

50 Subject# docsContents Movies77Information about the BladeRunner movie Movies92Information about the Lord of the Rings movie Health Care77Orthopedic equipment and manufactures Photography15Infrared-photography references Computer Science27Articles about data warehouses (integrator DBs) Computer Science42MySQL database Computer Science15Native XML databases Computer Science38PostgreSQL database Computer Science39Java programming language tutorials and guides Computer Science37VI text editor

51 IdentifierMerged Categories G1LRings, MySQL G3LRings, MySQL, Ortho, Infra G5MySQL, XMLDB, Dware, Postgr, JavaTut, Vi G6MySQL, XMLDB, Dware, Postgr, Ortho

52 IdentifierMerged Categories G1Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links G3MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie G5Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database G6MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web

53 Cluster Contamination Analytical evaluation:

54 LINGO vs. Suffix Tree Clustering

55

56

57

58 Future work

59 Pointer

60 Communication!

61 LINGO Thank you. Search Results Clustering


Download ppt "LINGO Sandra Gama. Internet  endless document collection."

Similar presentations


Ads by Google