The identification of interesting web sites Presented by Xiaoshu Cai.

The identification of interesting web sites Presented by Xiaoshu Cai

Introduction Syskill & Webert: a software agent that learns a profile of a user's interest Users can rate the page with three levels: hot(interestingness), medium and cold

Dataset Downloaded from UCI KDD Archive(Syskill & Webert) Contains html web pages and index files with user ratings on four separate subjects Index file format: file-name | rating | url | date-rated | title

Aims Predict user ratings for web pages (within a subject category) Compare three different prediction algorithms: Bayesian, Nearest Neighborhood and Rocchio’s Algorithm Analyze different choice of number of informative features

Learning user profile Convert HTML source of a web page into a Boolean feature vector Each feature has a Boolean value that indicates whether a particular word is present (at least once) or absent in a particular web page. f1f2f3f4…… File No. 1001……

Feature selection In a sample of 20 web pages, there are often 5,000 or more unique words. One would like words that occur frequently in pages on the hotlist, but infrequently on pages on the coldlist. Find k informative features using information gain

Feature selection musicianvarietylistenersmagazinefeature keyboardsFallFestivalLivespiritual WomenperformanceAwardsmaturityAlbum tapeSchoolfolkMiamiweave partiesworthcassettesshowcaseWire professionalflairmelodydrumsradio Eliminate uninformative words which present frequently like 600 frequently occurring English words (and HTML commands) e.g., “the,” “is,” “very,” and “if” Table below shows some of the most informative words obtained from a collection of 61 HTML documents on Bands.

Bayesian Classifier an example j belongs to class Ci given values of attributes of the example: An example is assigned to the class with the highest probability.

Nearest Neighbor Every page assigns it to the class of the most similar example in the training data Binary features The most similar example is the one that has the most feature values in common with a test example

Rocchio’s Algorithm TF-IDF weight for each informative word TF: raw frequency of term in document IDF: document frequency of the term TF-IDF = TF * IDF Prototype-vector for the interesting class: average TF-IDF for interesting － 0.25 * average TF-IDF for uninteresting A certain distance of prototype vector is considered as interesting. a distance threshold is chosen that maximizes the accuracy on the training set.

Experiment Topics used in experiment Randomly set m training examples and find 128 informative features Convert test data to feature vector Run 40 trials for each algorithm TopicBandsGoatsSheepBiomedical Pages617470136 Cold pages%75.40%51.35%72.85%72.79%

Result

Effect of number of informative features

Future work Investigate improvements to the underlying classification technology Filter the related informative words as features from using lexical knowledge eg: word relationship database

Thank you! Q&A?

The identification of interesting web sites Presented by Xiaoshu Cai.

Similar presentations

Presentation on theme: "The identification of interesting web sites Presented by Xiaoshu Cai."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The identification of interesting web sites Presented by Xiaoshu Cai.

Similar presentations

Presentation on theme: "The identification of interesting web sites Presented by Xiaoshu Cai."— Presentation transcript:

Similar presentations

About project

Feedback