Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fig. 1 (a) The PageRank algorithm (b) The web link structure

Similar presentations


Presentation on theme: "Fig. 1 (a) The PageRank algorithm (b) The web link structure"— Presentation transcript:

1 Fig. 1 (a) The PageRank algorithm (b) The web link structure
Fig. 1 shows the PageRank Algorithm with random teleports and the web link structure: Construct the column stochastic matrix M and A. Calculate the PageRank with random transports ( = 0.8) for three iterations. Which, among the three nodes {Yahoo, Amazon, M’soft}, is the most important node? (a) (b) Fig. 1 (a) The PageRank algorithm (b) The web link structure

2 Ans: 1/2 1/2 0 1/ 0 1/2 1 M = 1/2 1/2 0 1/ 0 1/2 1 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 + 0.2 A = = rk+1 = Ark y a = m 1 r0 1.00 0.60 1.40 r1 0.84 0.60 1.56 r2 0.776 0.536 1.688 r3 The most important node

3 Fig. 2 Flowchart for feature extraction in text mining.
As shown in Fig. 2, document frequency thresholding is an important step toward feature extraction in text mining. Given the content for document D1 and D2 shown in Fig.3, fill the document-feature matrix. Given N=10 and  =1.5, what are the feature terms extracted from D1 by using inverse document frequency weighting? Given N=10 and  =1.0, what are the feature terms extracted from D2 by using entropy weighting? ps. c)小題之計算複雜度高,介於考試時間有限,故本題型在期末考出現的機率很低。 (c) Fig. 3 Fig. 2 Flowchart for feature extraction in text mining.

4 Feature Extraction: Weighting Model(3)
Entropy weighting where average entropy of j-th term gfj ::= number of times j-th term occurs in the whole training document collection -1: if word occurs once time in every document 0: if word occurs in only one document

5 a) b) c) Ans: wij = Freqij * log(N/ DocFreqj)
K O Q R S T W X D1 4 1 D2 2 b) wij = Freqij * log(N/ DocFreqj) A B K O Q R S T W X DocFreqj 10 5 3 2 1 N/DocFreqj 1.00 2.00 3.33 5.00 10.00 log2(N/DocFreqj) 0.00 1.74 2.32 3.32 Freqij for D1 4 tf×idf => feature terms: O, R, S, W c) wij = log2(Freqij +1)* (1-entropy(wi)) A B K O Q R S T W X Freqij for D2 4 2 1 Entropy(wj) 0.4 0.1 0.3 log2(N/DocFreqj) 1.39 1.43 0.90 0.00 0.70 0.60 => feature terms: A, B

6 Data Preprocessing is essential for web usage mining.
Explain the four steps data preprocessing Given the web page linkage shown in Fig 4. (c), refine the user sessions shown in Fig. 4 (a). Given the web page linkage shown in Fig 4. (c), complete the paths in Fig. 4. (b). Fig. 4 Fig. 4

7 a) b) Three Sessions: A-B-F-O-G-A-D L-R A-B-C-J Four Sessions:
Ans: a) b) Three Sessions: A-B-F-O-G-A-D L-R A-B-C-J Four Sessions: A-B-F-O-G A-D L-R A-B-C-J or c) Four Sessions: A-B-F-O-F-B-G A-D L-R A-B-A-C-J


Download ppt "Fig. 1 (a) The PageRank algorithm (b) The web link structure"

Similar presentations


Ads by Google