Download presentation

Presentation is loading. Please wait.

Published byJon Brawdy Modified about 1 year ago

1
Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. Lecture by: Noga Kashti Efrat Daum 11/11/01sdbi – winter 2001

2
11/11/01sdbi - winter Lets start with definitions … Hypertext - a collection of documents (or "nodes") containing cross-references or "links" which, with the aid of an interactive browser program, allow the reader to move easily from one document to another. Data Mining - Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.

3
11/11/01sdbi - winter Two Ways For Getting Information From The Web : Clicking On Hyperlinks Searching Via Keyword Queries

4
11/11/01sdbi - winter Some History … Before the popular Web, Hypertext has been used by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL LIBRARIES. The old IR (Information retrieval) deals with documents whereas the Web deals with semi-structured data.

5
11/11/01sdbi - winter Some Numbers.. The Web exceeds 800 million HTML pages on about three million servers. Almost a million pages are added daily. A typical page changes in a few months. Several hundred gigabytes change every month.

6
11/11/01sdbi - winter Difficulties With Accessing Information On The Web: Usual problems of text search (synonymy, polysemy, text sensitivity) become much more severe. Semi-structured data. Sheer size and flux. No consistent standard or style.

7
11/11/01sdbi - winter The Old Search Process Is Often Unsatisfactory! Deficiency of scale. Poor accuracy (low recall and low precision).

8
11/11/01sdbi - winter Better Solutions: Data Mining And Machine Learning NL Techniques. Statistical Techniques for learning structure in various forms from text hypertext and semi-structured data.

9
11/11/01sdbi - winter Issues We ’ ll Discuss Models Supervised learning Unsupervised learning Semi-supervised learning Social network analysis

10
11/11/01sdbi - winter Models For Text Representation for text with statistical analyses only (bag-of-words): The vector space model The binary model The multi-nominal model

11
11/11/01sdbi - winter Models For Text (cont.) The vector space model: Documents -> tokens->canonical forms. Canonical token is an axis in a Euclidean space. The t-th coordinate of d is n(d,t) t is a term d is a document

12
11/11/01sdbi - winter The Vector Space Model: Normalize The Document Length To 1

13
11/11/01sdbi - winter More Models For Text The Binary Model : A document is a set of terms, which is a subset of the lexicon. Word counts are not significant. The multinomial model : a die with |T| faces. Every face has a probability θ t of showing up when tossed. Deciding of total word count, the author tosses the die while writing the term that shows up.

14
11/11/01sdbi - winter Models For Hypertext Hypertext: text with hyperlinks. Varying levels of detail. Example: Directed Graph(D,L) D – The set of nodes/documents/pages L – The set of links

15
11/11/01sdbi - winter Models For Semi-structured Data A point of convergence for the web(documents) and database(data) communities

16
11/11/01sdbi - winter Models For Semi-structured Data(cont.) like Topic Directories with tree- structured hierarchies. Examples: Open Directory Project, Yahoo! Another representation: XML.

17
11/11/01sdbi - winter Supervised Learning (classification) Algorithm Initialization: training data, each item is marked with a label or class from a discrete finite set. Input: unlabeled data. Algorithm roll: guess the data labels.

18
11/11/01sdbi - winter Supervised Learning (cont.) Example: topic directories Advantages: help structure, restrict keyword search, can enable powerful searches.

19
11/11/01sdbi - winter Probabilistic Models For Text Learning Let c 1, …,c m be m classes or topics with some training documents D c. Prior probability of a class: T : the universe of terms in all the training documents.

20
11/11/01sdbi - winter Probabilistic Models For Text Learning (cont.) Naive Bayes classification: Assumption: for each class c, there is binary text generator model. Model parameters: Φ c,t – the probability that a document in class c will mention term t at lease once.

21
11/11/01sdbi - winter Naive Bayes classification (cont.) Problems: short documents are discouraged. Pr (d|c) estimation is likely to be greatly distorted.

22
11/11/01sdbi - winter Naive Bayes classification (cont.) With the multinomial model:

23
11/11/01sdbi - winter Naive Bayes classification (cont.) Problems: Again, short documents are discouraged. Inter-term correlation ignored. Multiplicative Φ c,t ‘ surprise ’ factor. Conclusion: Both model are effective.

24
11/11/01sdbi - winter More Probabilistic Models For Text Learning Parameter smoothing and feature selection. Limited dependence modeling. The maximum entropy technique. Support vector machines (SVMs). Hierarchies over class labels.

25
11/11/01sdbi - winter Learning Relations Classification extension : a combination of statistical and relational learning. Improve accuracy. The ability to invent predicates. Can represent hyperlink graph structure and word statistics of neighbor documents. Learned rules will not be dependent on specific keywords.

26
11/11/01sdbi - winter Unsupervised learning hypertext documents a hierarchy among the documents What is a good clustering?

27
11/11/01sdbi - winter Basic clustering techniques Techniques for Clustering: kmeans hierarchical agglomerative clustering

28
11/11/01sdbi - winter Basic clustering techniques documents unweighted vector space TFIDF vector space similarity between two documents cos( ), = the angle between their corresponding vectors the distance between the vectors lengths (normalized)

29
11/11/01sdbi - winter k means clustering the k means algorithm: input: d 1, …,d n - set of n documents k - the number of clusters desired (k n) output: C 1, …,C k – k clusters with the n classifier documents

30
11/11/01sdbi - winter k means clustering the k means algorithm (cont.): initial: guess k initial means: m 1, … m k Until there are no changes in any means: For each document d - d is in c i if ||d-mi|| is the minimum of all the k distances. For 1 i k - replace m i with the means of all the documents for c i.

31
11/11/01sdbi - winter k means clustering the k means algorithm – Example: m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 K=2 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 K=3

32
11/11/01sdbi - winter k means clustering (cont.) Problem: high dimensionality e.g.: if dimensions has only two possible values, the vector space size is Solution: Projecting out some dimensions

33
11/11/01sdbi - winter Agglomerative clustering documents are merged into super documents or groups until only one group is left Some definitions: = the similarity between documents d 1 and d 2 the self-similarity of group A:

34
11/11/01sdbi - winter Agglomerative clustering The agglomerative clustering algorithm: input: d 1, …,d n - set of n documents output: G – the final group with a nested hierarchy

35
11/11/01sdbi - winter Agglomerative clustering (cont.) The agglomerative clustering algorithm: Initial: G := {G 1, …,G n }, where G i ={d i } while |G|>1: Find A and B in G such as s(A B) is maximized G := (G – {A,B}) {A B} Times: O(n 2 )

36
11/11/01sdbi - winter Agglomerative clustering (cont.) The agglomerative clustering algorithm Example: a c d e b gf Initial: a c d e b gf Step 1: s(b,c)=0.7 bc f g a c d e b gf Step 2: s(b,c)=0.7 s(f,g)=0.6 bc a c d e b gf Step 3: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 d bc f g a c d e b gf Step 6: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 s(a,b-c-d)=0.3 s(a-b-c-d,e-f-g)=0.1 a e d bc f g a c d e b gf Step 5: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 s(a,b-c-d)=0.3 a e d bc f g a c d e b gf Step 4: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 e d bc f g

37
11/11/01sdbi - winter Techniques from linear algebra Documents and terms are represented by vectors in Euclidean space. Applications of linear algebra to text analysis: Latent semantic indexing (LSI) Random projections

38
11/11/01sdbi - winter Co-occurring terms Exemple: autocar transmissiongearbox Auto Transmission interchange W/404 to 504?? … Linear potentiometer for a racing car gearbox …

39
11/11/01sdbi - winter Latent semantic indexing (LSI) Vector Space model of documents: Let m=|T|, the lexicon size Let n=the number of documents Define A mxn = term-by document matrix where: a ij = the number of occurrences of term i in document j.

40
11/11/01sdbi - winter Latent semantic indexing (LSI) How to reduce it? terms documents car auto similarity

41
11/11/01sdbi - winter Singular Value Decomposition (SVD) Let A R mxn, m n be a matrix. The singular value decomposition of A is the factorization A=UDV T, where: U and V are orthogonals, U T U=V T V=I n D=diag( 1, … n ) with i 0, 1 i n then, U=[u 1, … u n ], u 1, … u n are the left singular vectors V=[v 1, … v n ], v 1, … v n are the right singular vectors 1, … n are the singular values of A.

42
11/11/01sdbi - winter Singular Value Decomposition (SVD) AA T =(UDV T )(VD T U T )=UDIDU T =UD 2 U T AA T U=UD 2 =[ 1 2 u 1, …, n 2 u n ] for 1 i n, AA T u i = i 2 u i the columns of U are the eigenvectors of AA T. Similary, A T A=VD 2 V T the columns of V are the eigenvectors of A T A. The eigenvalues of AA T (or A T A) are 1 2, …, n 2

43
11/11/01sdbi - winter Singular Value Decomposition (SVD) Let be the k-truncated SVD. rank(A k )=k ||A-A K || 2 ||A-M K || 2 for any matrix M k of rank k.

44
11/11/01sdbi - winter Singular Value Decomposition (SVD) Note: A, A k R mxn reduction

45
11/11/01sdbi - winter LSI with SVD Define q R m – a query vector. q i 0 if term i is a part of the query. Then, A T q R n, the answer vector. (A T q) j 0 if document j contains one or more terms in the query. How to do it better?

46
11/11/01sdbi - winter LSI with SVD Use A k instead of A: calculate A k T q Now, query on ‘ car ’ will return a document containing the word ‘ auto ’.

47
11/11/01sdbi - winter Random projections Theorem: let: - a unit vector H - a randomly oriented -dimensional subspace through the origin X - random variable of the square of the length of the projection of v on H then: and if is chosen between and where

48
11/11/01sdbi - winter Random projections A projection of a set of points to a randomly oriented subspace. Small distortion in inter-points distances The technique: reducing the dimensionality of the points speed up the distances computation

49
11/11/01sdbi - winter Semi-supervised learning Real-life applications: labeled documents unlabeled documents Between supervised and unsupervised learning

50
11/11/01sdbi - winter Learning from labeled and unlabeled documents Expectation Maximization (EM) Algorithm: Initial: train a naive Bayes classifier using only labeled data. Repeat EM iteration until near convergence : E step: M step: assign class probabilities Pr(c/d) to all documents not labeled by the c,t estimates. error is reduced by a third in the best cases.

51
11/11/01sdbi - winter Relaxation labeling The hypertext model: documents are nodes in a hypertext graph. There are other sources of information induced by the links. ? ? ? ?

52
11/11/01sdbi - winter Relaxation labeling c=class, t=term, N=neighbors In supervised learning: Pr(t|c) In hypertext, using neighbors ’ terms: Pr( t(d),t(N(d)) |c) Better model, using neighbors ’ classes : Pr( t(d),c(N(d)) |c] Circularity

53
11/11/01sdbi - winter Relaxation labeling Resolve the circularity: Initial: Pr (0) (c|d) to each document d N(d 1 ) where d 1 is a test document (use text-only) Iterations:

54
11/11/01sdbi - winter Social network analysis Social networks: between academics by co authoring, advising. between movie personnel by directing and acting. between people by making phone calls between web pages by hyperlinking to other web pages. Applications Google HITS

55
11/11/01sdbi - winter where: means “ link to ” N = total number of nodes in the Web graph simulated a random walk on the web graph used a score of popularity the popularity score is precomputed independent of the query

56
11/11/01sdbi - winter Hyperlink induced topic search (HITS) Depended on a search engine For each node u in the graph calculated Authorities scores (a u ) and Hubs scores (h u ): Initialize hu=au=1 Repeat until convergence: are normalized to 1

57
11/11/01sdbi - winter Interesting page include links to others interesting pages. The goal: many relevant pages few irrelevant pages fast

58
11/11/01sdbi - winter Conclusion Supervised learning Probabilistic models Unsupervised learning Techniques for clustering: k-means (top-down) agglomerative (bottom-up) Techniques for reducing: LSI with SVD Random projections Semi-supervised learning The EM algorithm Relaxation labeling

59
11/11/01sdbi - winter referance eans.htm eans.htm Scatter/Gather: A Cluster based Approach to Browsing Large Document Collections (Cutting, Karger, Pedersen, Tukey) Scatter/Gather: A Cluster based Approach to Browsing Large Document Collections (Cutting, Karger, Pedersen, Tukey)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google