Presentation is loading. Please wait.

Presentation is loading. Please wait.

The identification of interesting web sites Presented by Xiaoshu Cai.

Similar presentations


Presentation on theme: "The identification of interesting web sites Presented by Xiaoshu Cai."— Presentation transcript:

1 The identification of interesting web sites Presented by Xiaoshu Cai

2 Introduction Syskill & Webert: a software agent that learns a profile of a user's interest Users can rate the page with three levels: hot(interestingness), medium and cold

3 Dataset Downloaded from UCI KDD Archive(Syskill & Webert) Contains html web pages and index files with user ratings on four separate subjects Index file format: file-name | rating | url | date-rated | title

4 Aims Predict user ratings for web pages (within a subject category) Compare three different prediction algorithms: Bayesian, Nearest Neighborhood and Rocchio’s Algorithm Analyze different choice of number of informative features

5 Learning user profile Convert HTML source of a web page into a Boolean feature vector Each feature has a Boolean value that indicates whether a particular word is present (at least once) or absent in a particular web page. f1f2f3f4…… File No. 1001……

6 Feature selection In a sample of 20 web pages, there are often 5,000 or more unique words. One would like words that occur frequently in pages on the hotlist, but infrequently on pages on the coldlist. Find k informative features using information gain

7 Feature selection musicianvarietylistenersmagazinefeature keyboardsFallFestivalLivespiritual WomenperformanceAwardsmaturityAlbum tapeSchoolfolkMiamiweave partiesworthcassettesshowcaseWire professionalflairmelodydrumsradio Eliminate uninformative words which present frequently like 600 frequently occurring English words (and HTML commands) e.g., “the,” “is,” “very,” and “if” Table below shows some of the most informative words obtained from a collection of 61 HTML documents on Bands.

8 Bayesian Classifier an example j belongs to class Ci given values of attributes of the example: An example is assigned to the class with the highest probability.

9 Nearest Neighbor Every page assigns it to the class of the most similar example in the training data Binary features The most similar example is the one that has the most feature values in common with a test example

10 Rocchio’s Algorithm TF-IDF weight for each informative word TF: raw frequency of term in document IDF: document frequency of the term TF-IDF = TF * IDF Prototype-vector for the interesting class: average TF-IDF for interesting - 0.25 * average TF-IDF for uninteresting A certain distance of prototype vector is considered as interesting. a distance threshold is chosen that maximizes the accuracy on the training set.

11 Experiment Topics used in experiment Randomly set m training examples and find 128 informative features Convert test data to feature vector Run 40 trials for each algorithm TopicBandsGoatsSheepBiomedical Pages617470136 Cold pages%75.40%51.35%72.85%72.79%

12 Result

13 Effect of number of informative features

14

15 Future work Investigate improvements to the underlying classification technology Filter the related informative words as features from using lexical knowledge eg: word relationship database

16 Thank you! Q&A?


Download ppt "The identification of interesting web sites Presented by Xiaoshu Cai."

Similar presentations


Ads by Google