Presentation is loading. Please wait.

Presentation is loading. Please wait.

Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Similar presentations


Presentation on theme: "Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco,"— Presentation transcript:

1 Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

2 Outline Project overview Web content Automated Categorization Feature Selection Metadata Experimental Setup Data Targeted Spidering System Architecture Results Conclusions

3 Project Overview Specific: Categorize large number of domain names by industry category NAICS classification scheme ~30,000 domain names for testing (.com) Text categorization approach General: Domain specific classification Metadata Targeted spidering Feature selection Classifier training

4 Web Content: Automated Categorization Challenges: Vast ( over 1 Billion pages ) Heterogeneous ( content, formats, not just HTML ) Dynamic ( growing, changing ) Benefits: Good source of information Accessible! Machine readable ( vs. machine understandable ) Semi-structured Tools: Classification Automated classification Text Categorization/Machine Learning Intelligent agents Related Work Manual : Yahoo! Open Directory Project Looksmart Automatic : Northern Light Thunderstone/Texis Inktomi Other : EU Project DESIRE II Pharos Attardi, Sebanstiani et al L. Page et al McCallum et al Related Work Manual : Yahoo! Open Directory Project Looksmart Automatic : Northern Light Thunderstone/Texis Inktomi Other : EU Project DESIRE II Pharos Attardi, Sebanstiani et al L. Page et al McCallum et al

5 Web Content: Feature Selection Text Features: (D. Lewis) Relatively few in number Moderate in frequency of assignment Low in redundancy Low in noise Related to semantic scope to the classes to be assigned Relatively unambiguous in meaning Preliminary Experiment 1125 web domains SEC+NAICS training set Use metadata if possible, use body text as last resort!

6 Web Content: Metadata

7 Experimental Setup: Targeted Spidering ‘Query’ Pages Metatags? Send Query Use live? Frames? <a href=? Try www. HTTP Get Domain name Yes No Yes No Yes prod, service, about, info, press, news No

8 Experimental Setup: Data Classification scheme: NAICS 11Agriculture, Forestry, Fishing and Hunting 21Mining 23Construction 31-33Manufacturing 42Wholesale Trade 44-45Retail Trade 48-49Transportation and Warehousing 51Information 52Finance and Insurance 53Real Estate and Rental and Leasing 54Professional, Scientific and Technical Services 55Management of Companies and Enterprise 56Admin. Support, Waste Mgmt and Remediation Srvcs 61Educational Services 62Health Care and Social Assistance 71Arts, Entertainment & Recreation 72Accommodation and Food Services 81Other services (except 92) 92Public Administration 99Unclassified Establishments Test Data ~30,000 domain names (SIC) ~13,500 pre-classified/content Training Data “SEC-NAICS”: 1504 SEC 10-K fillings (SIC) 426 NAICS labels/descriptions “Web pages”: 3618 pre-classified domains Crosswalk SIC NAICS

9 Experimental Setup: System Architecture The Web Domain Names Domain Names Spider IR Engine Decision SEC-NAICS Web pages Foo.com 11, 21, 23 Text Query Matching documents

10 Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged

11 Conclusions Domain Specific Classification Knowledge Gathering Use of specialized knowledge Targeted Spidering Efficient use of resources Extract key features, Metadata Training Prior knowledge Bootstrapping Classification Robust, tolerant of noisy data Benefits of Semantic Web Better Metadata Semantic linking & intelligent spidering


Download ppt "Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco,"

Similar presentations


Ads by Google