Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. Lecture.

Similar presentations

Presentation on theme: "Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. Lecture."— Presentation transcript:

1 Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. Lecture by: Noga Kashti Efrat Daum 11/11/01sdbi – winter 2001

2 11/11/01sdbi - winter 20012 Lets start with definitions … Hypertext - a collection of documents (or "nodes") containing cross-references or "links" which, with the aid of an interactive browser program, allow the reader to move easily from one document to another. Data Mining - Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.

3 11/11/01sdbi - winter 20013 Two Ways For Getting Information From The Web : Clicking On Hyperlinks Searching Via Keyword Queries

4 11/11/01sdbi - winter 20014 Some History … Before the popular Web, Hypertext has been used by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL LIBRARIES. The old IR (Information retrieval) deals with documents whereas the Web deals with semi-structured data.

5 11/11/01sdbi - winter 20015 Some Numbers.. The Web exceeds 800 million HTML pages on about three million servers. Almost a million pages are added daily. A typical page changes in a few months. Several hundred gigabytes change every month.

6 11/11/01sdbi - winter 20016 Difficulties With Accessing Information On The Web: Usual problems of text search (synonymy, polysemy, text sensitivity) become much more severe. Semi-structured data. Sheer size and flux. No consistent standard or style.

7 11/11/01sdbi - winter 20017 The Old Search Process Is Often Unsatisfactory! Deficiency of scale. Poor accuracy (low recall and low precision).

8 11/11/01sdbi - winter 20018 Better Solutions: Data Mining And Machine Learning NL Techniques. Statistical Techniques for learning structure in various forms from text hypertext and semi-structured data.

9 11/11/01sdbi - winter 20019 Issues We ’ ll Discuss Models Supervised learning Unsupervised learning Semi-supervised learning Social network analysis

10 11/11/01sdbi - winter 200110 Models For Text Representation for text with statistical analyses only (bag-of-words): The vector space model The binary model The multi-nominal model

11 11/11/01sdbi - winter 200111 Models For Text (cont.) The vector space model: Documents -> tokens->canonical forms. Canonical token is an axis in a Euclidean space. The t-th coordinate of d is n(d,t) t is a term d is a document

12 11/11/01sdbi - winter 200112 The Vector Space Model: Normalize The Document Length To 1

13 11/11/01sdbi - winter 200113 More Models For Text The Binary Model : A document is a set of terms, which is a subset of the lexicon. Word counts are not significant. The multinomial model : a die with |T| faces. Every face has a probability θ t of showing up when tossed. Deciding of total word count, the author tosses the die while writing the term that shows up.

14 11/11/01sdbi - winter 200114 Models For Hypertext Hypertext: text with hyperlinks. Varying levels of detail. Example: Directed Graph(D,L) D – The set of nodes/documents/pages L – The set of links

15 11/11/01sdbi - winter 200115 Models For Semi-structured Data A point of convergence for the web(documents) and database(data) communities

16 11/11/01sdbi - winter 200116 Models For Semi-structured Data(cont.) like Topic Directories with tree- structured hierarchies. Examples: Open Directory Project, Yahoo! Another representation: XML.

17 11/11/01sdbi - winter 200117 Supervised Learning (classification) Algorithm Initialization: training data, each item is marked with a label or class from a discrete finite set. Input: unlabeled data. Algorithm roll: guess the data labels.

18 11/11/01sdbi - winter 200118 Supervised Learning (cont.) Example: topic directories Advantages: help structure, restrict keyword search, can enable powerful searches.

19 11/11/01sdbi - winter 200119 Probabilistic Models For Text Learning Let c 1, …,c m be m classes or topics with some training documents D c. Prior probability of a class: T : the universe of terms in all the training documents.

20 11/11/01sdbi - winter 200120 Probabilistic Models For Text Learning (cont.) Naive Bayes classification: Assumption: for each class c, there is binary text generator model. Model parameters: Φ c,t – the probability that a document in class c will mention term t at lease once.

21 11/11/01sdbi - winter 200121 Naive Bayes classification (cont.) Problems: short documents are discouraged. Pr (d|c) estimation is likely to be greatly distorted.

22 11/11/01sdbi - winter 200122 Naive Bayes classification (cont.) With the multinomial model:

23 11/11/01sdbi - winter 200123 Naive Bayes classification (cont.) Problems: Again, short documents are discouraged. Inter-term correlation ignored. Multiplicative Φ c,t ‘ surprise ’ factor. Conclusion: Both model are effective.

24 11/11/01sdbi - winter 200124 More Probabilistic Models For Text Learning Parameter smoothing and feature selection. Limited dependence modeling. The maximum entropy technique. Support vector machines (SVMs). Hierarchies over class labels.

25 11/11/01sdbi - winter 200125 Learning Relations Classification extension : a combination of statistical and relational learning. Improve accuracy. The ability to invent predicates. Can represent hyperlink graph structure and word statistics of neighbor documents. Learned rules will not be dependent on specific keywords.

26 11/11/01sdbi - winter 200126 Unsupervised learning hypertext documents a hierarchy among the documents What is a good clustering?

27 11/11/01sdbi - winter 200127 Basic clustering techniques Techniques for Clustering: k­means hierarchical agglomerative clustering

28 11/11/01sdbi - winter 200128 Basic clustering techniques documents unweighted vector space TFIDF vector space similarity between two documents cos(  ),  = the angle between their corresponding vectors the distance between the vectors lengths (normalized)

29 11/11/01sdbi - winter 200129 k ­ means clustering the k ­ means algorithm: input: d 1, …,d n - set of n documents k - the number of clusters desired (k  n) output: C 1, …,C k – k clusters with the n classifier documents

30 11/11/01sdbi - winter 200130 k ­ means clustering the k ­ means algorithm (cont.): initial: guess k initial means: m 1, … m k Until there are no changes in any means: For each document d - d is in c i if ||d-mi|| is the minimum of all the k distances. For 1  i  k - replace m i with the means of all the documents for c i.

31 11/11/01sdbi - winter 200131 k ­ means clustering the k ­ means algorithm – Example: m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 m1m1 m2m2 K=2 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 m1m1 m2m2 m3m3 K=3

32 11/11/01sdbi - winter 200132 k ­ means clustering (cont.) Problem: high dimensionality e.g.: if 30000 dimensions has only two possible values, the vector space size is 2 30000 Solution: Projecting out some dimensions

33 11/11/01sdbi - winter 200133 Agglomerative clustering documents are merged into super ­ documents or groups until only one group is left Some definitions: = the similarity between documents d 1 and d 2 the self-similarity of group A:

34 11/11/01sdbi - winter 200134 Agglomerative clustering The agglomerative clustering algorithm: input: d 1, …,d n - set of n documents output: G – the final group with a nested hierarchy

35 11/11/01sdbi - winter 200135 Agglomerative clustering (cont.) The agglomerative clustering algorithm: Initial: G := {G 1, …,G n }, where G i ={d i } while |G|>1: Find A and B in G such as s(A  B) is maximized G := (G – {A,B})  {A  B} Times: O(n 2 )

36 11/11/01sdbi - winter 200136 Agglomerative clustering (cont.) The agglomerative clustering algorithm Example: a c d e b gf Initial: - 0 - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 a c d e b gf Step 1: s(b,c)=0.7 bc f g a c d e b gf Step 2: s(b,c)=0.7 s(f,g)=0.6 bc a c d e b gf Step 3: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 d bc f g a c d e b gf Step 6: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 s(a,b-c-d)=0.3 s(a-b-c-d,e-f-g)=0.1 a e d bc f g a c d e b gf Step 5: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 s(a,b-c-d)=0.3 a e d bc f g a c d e b gf Step 4: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 s(e,f-g)=0.4 e d bc f g

37 11/11/01sdbi - winter 200137 Techniques from linear algebra Documents and terms are represented by vectors in Euclidean space. Applications of linear algebra to text analysis: Latent semantic indexing (LSI) Random projections

38 11/11/01sdbi - winter 200138 Co-occurring terms Exemple: autocar transmissiongearbox Auto Transmission interchange W/404 to 504?? … Linear potentiometer for a racing car gearbox …

39 11/11/01sdbi - winter 200139 Latent semantic indexing (LSI) Vector Space model of documents: Let m=|T|, the lexicon size Let n=the number of documents Define A mxn = term-by ­ document matrix where: a ij = the number of occurrences of term i in document j.

40 11/11/01sdbi - winter 200140 Latent semantic indexing (LSI) How to reduce it? terms documents car auto similarity

41 11/11/01sdbi - winter 200141 Singular Value Decomposition (SVD) Let A  R mxn, m  n be a matrix. The singular value decomposition of A is the factorization A=UDV T, where: U and V are orthogonals, U T U=V T V=I n D=diag(  1, …  n ) with  i  0, 1  i  n then, U=[u 1, … u n ], u 1, … u n are the left singular vectors V=[v 1, … v n ], v 1, … v n are the right singular vectors  1, …  n are the singular values of A.

42 11/11/01sdbi - winter 200142 Singular Value Decomposition (SVD) AA T =(UDV T )(VD T U T )=UDIDU T =UD 2 U T  AA T U=UD 2 =[  1 2 u 1, …,  n 2 u n ] for 1  i  n, AA T u i =  i 2 u i  the columns of U are the eigenvectors of AA T. Similary, A T A=VD 2 V T  the columns of V are the eigenvectors of A T A. The eigenvalues of AA T (or A T A) are  1 2, …,  n 2

43 11/11/01sdbi - winter 200143 Singular Value Decomposition (SVD) Let be the k-truncated SVD. rank(A k )=k ||A-A K || 2  ||A-M K || 2 for any matrix M k of rank k.

44 11/11/01sdbi - winter 200144 Singular Value Decomposition (SVD) Note: A, A k  R mxn reduction

45 11/11/01sdbi - winter 200145 LSI with SVD Define q  R m – a query vector. q i  0 if term i is a part of the query. Then, A T q  R n, the answer vector. (A T q) j  0 if document j contains one or more terms in the query. How to do it better?

46 11/11/01sdbi - winter 200146 LSI with SVD Use A k instead of A:  calculate A k T q Now, query on ‘ car ’ will return a document containing the word ‘ auto ’.

47 11/11/01sdbi - winter 200147 Random projections Theorem: let: - a unit vector H - a randomly oriented -dimensional subspace through the origin X - random variable of the square of the length of the projection of v on H then: and if is chosen between and where

48 11/11/01sdbi - winter 200148 Random projections A projection of a set of points to a randomly oriented subspace. Small distortion in inter-points distances The technique: reducing the dimensionality of the points speed up the distances computation

49 11/11/01sdbi - winter 200149 Semi-supervised learning Real-life applications: labeled documents unlabeled documents Between supervised and unsupervised learning

50 11/11/01sdbi - winter 200150 Learning from labeled and unlabeled documents Expectation Maximization (EM) Algorithm: Initial: train a naive Bayes classifier using only labeled data. Repeat EM iteration until near convergence : E ­ step: M ­ step: assign class probabilities Pr(c/d) to all documents not labeled by the  c,t estimates. error is reduced by a third in the best cases.

51 11/11/01sdbi - winter 200151 Relaxation labeling The hypertext model: documents are nodes in a hypertext graph. There are other sources of information induced by the links. ? ? ? ?

52 11/11/01sdbi - winter 200152 Relaxation labeling c=class, t=term, N=neighbors In supervised learning: Pr(t|c) In hypertext, using neighbors ’ terms: Pr( t(d),t(N(d)) |c) Better model, using neighbors ’ classes : Pr( t(d),c(N(d)) |c] Circularity

53 11/11/01sdbi - winter 200153 Relaxation labeling Resolve the circularity: Initial: Pr (0) (c|d) to each document d  N(d 1 ) where d 1 is a test document (use text-only) Iterations:

54 11/11/01sdbi - winter 200154 Social network analysis Social networks: between academics by co ­ authoring, advising. between movie personnel by directing and acting. between people by making phone calls between web pages by hyperlinking to other web pages. Applications Google HITS

55 11/11/01sdbi - winter 200155 where:  means “ link to ” N = total number of nodes in the Web graph simulated a random walk on the web graph used a score of popularity the popularity score is precomputed independent of the query

56 11/11/01sdbi - winter 200156 Hyperlink induced topic search (HITS) Depended on a search engine For each node u in the graph calculated Authorities scores (a u ) and Hubs scores (h u ): Initialize hu=au=1 Repeat until convergence: are normalized to 1

57 11/11/01sdbi - winter 200157 Interesting page include links to others interesting pages. The goal: many relevant pages few irrelevant pages fast

58 11/11/01sdbi - winter 200158 Conclusion Supervised learning Probabilistic models Unsupervised learning Techniques for clustering: k-means (top-down) agglomerative (bottom-up) Techniques for reducing: LSI with SVD Random projections Semi-supervised learning The EM algorithm Relaxation labeling

59 11/11/01sdbi - winter 200159 referance eans.htm eans.htm Scatter/Gather: A Cluster ­ based Approach to Browsing Large Document Collections (Cutting, Karger, Pedersen, Tukey) Scatter/Gather: A Cluster ­ based Approach to Browsing Large Document Collections (Cutting, Karger, Pedersen, Tukey)

Download ppt "Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. Lecture."

Similar presentations

Ads by Google