Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work.

Similar presentations


Presentation on theme: "Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work."— Presentation transcript:

1 Web Page Classifiers Inmaculada Hernández

2 Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work

3 Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work

4 Introduction Page categorization

5 Introduction Wrapper Generation

6 Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work

7 Web Classification Issues High Dimensionality Which features? How many? Where do we find them? High Speed Training Set Positive & Negative Only Positive None

8 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

9 Hubs & Authorities In dynamic pages, we will use “detail” instead of “authority”

10 Classification Focus Sports Movies Political Online Stores… Domain Home Hubs Detail Error No results… Functional Angry Sad Happy… Sentiment Authorized mail Spam Spam…

11 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

12 Page Representation Brad – 2 timesMovie – 4 timesRomance – 1 timeRating – 1 time

13 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

14 Classification Features Content Structure Links URL Visual Analysis Hybrid Types On Page Neighbours Location

15 movie Brad Pitt Rating Characters Plot Release Date David Fincher Tenis Futbol Deportes Rafa Nadal Partidos Golf Content-Based Features Word Frequency Page Size Examples Hotho02 Selamat04 Pierre01 Movies Sports

16 Structure-Based Features Not used Templates Trees Regular Expressions Examples Reis04 Vieira06 Crescenzi01

17 Link-Based Features Incoming links Outcoming links Examples Pierre01 Bar-Yossef02 Blanco07 Web Site

18 URL-Based http://www.amazon.com/shoes-men-women-kids- baby/b/ref=sa_menu_shoe9_gw http://www.wordreference.com/es/en/translati on.asp?spen=newswire http://www.amazon.com/MP3-Music- Download/b/ref=sa_menu_dmusic2_gw

19 Visual Analysis Features Number of images Line spacing Distance between elements Examples Alvarez07

20 Hybrid Feature Set LinksStructureContent Features Combination of structure and content Examples Caverlee05 Markov08

21 Neutral Features Any Examples Yu03 Yu04 SVM

22 Neighbour-Based Features Anchor text Extended anchor text Headings preceeding anchor Labels Content Examples Cohen02 Furnkraz02

23 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

24 Feature Analysis Dimensionality Reduction Feature Selection Document Frequency Mutual Information Odds Ratio Cross Entropy Information Gain Chi-square Feature Extraction Latent Semantic Indexing Word Clustering

25 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

26 Baseline Classifier Statistical Bayesian Network K-Nearest Neighbour Machine Learning Decision Trees Genetic Algorithms Neural Networks Backpropagation Self- organizing maps Ad-hoc Techniques TemplateTokens UFRE Pagelet

27 Web classification Framework Focus Feature Analysis Page Representation Baseline Classifier Features Preprocessing Techniques

28 Cleansing & Formatting (jTidy) Tokenization Stemming Stop Words Removal of tags Removal of rare words

29 Related Work Author Classification TypeApproachBaseline ClassifierPage RepresentationExecutionPreprocessingFeature Analysis Pierre01 DomainContent & LinkK-Nearest NeighbourTextEager LSI Hotho02 DomainContentK-MeansConcept VectorsLazy Term Select., COSA (Ontology) Selamat04 DomainContentNeural NetworksWord VectorsEager Stemming, stop words PCA, CBPF Markov08 Domain Content & Structural C4.5, Naïve-Bayes Graph document representation / Boolean Vector Eager Stopwords, Stemming Minimal frequency threshold for subgraphs Yu03 & Yu04 BothAny1-DNF, SVMFeature VectorEager Doorenbos97 FunctionalStructural logical linesEager Crescenzi01 FunctionalStructuralUFREsTags & TextLazy jTidy, Tokenization Arasu03 FunctionalStructuralEq. classesTags & TextLazy LFEQ Bar-Yossef02 FunctionalStructural & LinkPageletParse TreeLazy Reis04 FunctionalStructuralTree Edit DistanceDOM TreeLazy Grumbach99 FunctionalStructuralMark-up Encoding Sequences of characters over a finite alphabet Lazy Flesca05 FunctionalStructuralDisc. Fourier Trans.DTDLazy Check document conformity to DTD Vieira06 FunctionalStructuralRTDM-TDDOM TreeLazy Caverlee05 Functional Content & Structural & Link K-MeansTag TreeEagerTidy Vidal07 FunctionalStructural & URLRTDMDOM TreeLazy Blanco07 FunctionalStructural & Link TemplateTokens DOM TreeEager Links

30 Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work

31 Evaluation Metrics Precision Recall F-Measure Data Set (Training & Testing) Size Source

32 Data Sets Sources Standard Data Sets Reuters (21578 docs), RCV (804,414 docs) TEL-8 (~500 docs) WebKB (8,282 docs) Open Directory (2,656,105 docs) Reference Webs: Amazon Yahoo! Sports Pages (¿?) Query results Crawled pages, …

33 Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work

34 Conclusions Several different proposals Good results in general Classifiers Not comparable Specific Data Sets Classifiers Evaluation Not frequently applied Specific techniques Preprocessing Structural Functional Our focus

35 Crawling vs. Virtual Integration CrawlingVirtual Integration

36 Research challenges Classifiers Feature Selection Standard Dataset Link classifiers Navigation Which web page classifiers are better for navigation? Post-filtering

37 Questions?

38 Thanks! Drop by our web site at http://www.tdg-seville.info inmahernandez@us.es


Download ppt "Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work."

Similar presentations


Ads by Google