Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco,

Slides:



Advertisements
Similar presentations
INTRODUCTION TO THE WORLD KLEMS CONFERENCE By Dale W. Jorgenson, Mun S. Ho, and Jon D. Samuels Harvard.
Advertisements

By Albert Williams, Ph.D., Associate Professor of Finance and Economics Huizenga School of Business by Albert Williams, Ph.D., Associate Professor of Finance.
Spring Update of December 2013 Forecast for 2014 Manufacturing and Non-Manufacturing Sectors Compare 2014 Forecasts with 2013 Reported Results Broad Sector.
Tompkins Workforce Investment Board Dashboard March 2015.
NJ Treasury Economic Outlook Conference November 20, 2013 Alexander Heil, PhD Chief Economist Planning & Regional Development Department The Port Authority.
The Pulse of the 495/MetroWest Labor Market Paul E. Harrington Center for Labor Market Studies 315 Holmes Hall Northeastern University 360 Huntington Ave.
Implications of New Economic Classification Systems on Input- Output Based LCA Models H. Scott Matthews Asst. Prof., Civil/Environmental Engineering Research.
Advanced LCA – Lecture 3. Admin Issues Group Projects or Take-Home Final? Your choice (individual choice) EIO-LCA MATLAB version - some slight.
Workforce Profile. Industry Breakdown – Top 10 Industry Breakdown.
SOIC Agriculture, Forestry, Fishing & Hunting Sector Employment Momentum Track (NAICS 11) ImprovingLeading LaggingSlipping Source: Oklahoma.
Region 4 Region 4 Workforce Profile. Industry Breakdown – Top 10.
- (302) L. Jay Burks What is a NAICS Code The North American Industry Classification System (NAICS)
Updated Louisiana 2018 Employment Occupational Forecast Louisiana Workforce Commission Division of Economic Development Louisiana State University.
June 23, 2011 SHRM Survey Findings: Employee Recognition Programs In collaboration with and commissioned by Globoforce.
“In business-to-business marketing, segments are clusters of firms that are distinct from others in terms of what they need and buy as well as how they.
FIN432 Vicentiu Covrig 1 Business Environment (chapter 9)
The North American Industry Classification System (NAICS)
The Use of Criminal Background Checks in Hiring Decisions ©SHRM 2012 July 19, 2012 SHRM Survey Findings: Background Checking—The Use of Criminal Background.
June 13, 2011 SHRM Poll: Relief Efforts for Victims of the Japan Earthquake and Tsunami.
In-Demand Occupations 1. 2 JobsOhio Network - Cincinnati (Southwest Ohio) Industry Employment Projection Report: EmploymentProjected Change.
2012 Economic Census Reference Webinar Series What’s New for the 2012 Economic Census (Industries) Webinar # 2 of 4 February 12 th, 2014 Andrew W. Hait.
The implementation of tools to support the data quality of the survey frame Mario Ménard November 2008.
Measuring and Enhancing Services Trade Data and Information Conference September 14, 2010 U.S. Department of Commerce, Washington, DC Service Statistics.
United States Department of Agriculture Rural Development Rural Community College Alliance September 25, 2014.
Economy-Wide Statistics Division Erika Becker-Medina Data User Outreach and Education April 14, 2015 Disclaimer: This report is released to inform interested.
 Matt Gates. 
Presented by the 4 Seeds E nrique Garcia S teven Scileppi The Learning Team Experience D arlene Daws HOMEPAGE E lizabeth Allen.
© Thomson/South-WesternSlideCHAPTER 141 CAREER INFORMATION The World of Work Exploring Occupations Chapter 14.
May 9, 2011 SHRM Poll: Staff Levels and the Use of Contingent and Part-time Workers.
Yavapai College Regional Economic Development Center.
August 31, 2011 SHRM Poll: Disaster Planning in Organizations 10 Years After the Sept. 11 Terrorist Attacks.
In-Demand Occupations 1. 2 JobsOhio Network - Dayton (Western Ohio) Industry Employment Projection Report: EmploymentProjected Change NAICS.
Census Bureau Economic Data for New Mexico Business / Industry Data Sets New Mexico SDC/BIDC Affiliates Workshop November 12, 2014 Presented by: Andy Hait.
Using Census Bureau Data to Promote Economic Development Business / Industry Data Sets Tennessee SDC Data User Conference November 19, 2014 Presented by:
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Small Area Economic Data from the 2007 Economic Census and Economic Surveys Presented by: Andrew W Hait and Patrice C. Norman U.S. Census Bureau Economic.
Draft Report – Morro Bay Economic Development Program 2012/13 April 22, 2013 Presented by John DiNunzio, MBA 1.
Size Standards Analysis: SBA Methodology Presented to: The Council on Federal Procurement of Architectural & Engineering Services (COFPAES) By: Khem R.
Sponsored by AETNA June 29, 2011 SHRM Poll: The State of Consumer-Directed Health Plans in the Workplace.
2 Grow ACA’s Strategic Framework VISION: Arizona is the best location for high-quality business growth MISSION: Grow & diversify Arizona’s Economy & create.
August 16, 2011 SHRM Poll: Managing Employee Absences.
Coastal Bend and State Population % Distribution by Age Groups % Fewer in Working Age Source: US Census.
Chapter Two Strategic Compensation in Action: Strategic Analysis and Contextual Factors.
Employment Trends in Florida Prepared by: Darryl Crawford, MPA Center for Demography and Population Health Florida State university May 6, 2009 LeRoy Collins.
June 6, 2011 SHRM Poll: Succession Planning. Succession Planning ©SHRM 2011 Key Findings 2  What percentage of organizations currently have succession.
Florida’s Sensitivity to Monetary Policy Changes Marisela Guillen.
Chartbook 2005 Trends in the Overall Health Care Market Chapter 6: The Economic Contribution of Hospitals.
© Prentice-Hall, Inc., 2001 Chapter Two Strategic Compensation in Action: Strategic Analysis and Contextual Factors.
Table 8.1 Value Added by Sectors, 2004, billions of dollars (to be continued) IndustryValue Added Percent of GDP Primary Sector Agriculture, forestry,
Economic Impact of DR-CAFTA on Southern Agriculture: The Case of Sugar P. Lynn Kennedy Louisiana State University.
December 1, 2011 SHRM Poll: 2011Holiday Parties Holiday Parties ©SHRM 2011 Key Findings 2  Do organizations plan to have a 2011 end-of-year or.
Sectors of Industry BDI3C. Industry Sector Make-up  Canada’s system of sectioning industry is production-oriented.  Establishments using similar production.
SHRM Poll: Performance Management and Other Workplace Practices
SHRM Poll: Managing Employee Absences
GRENADA’S EXPERIENCE IN PILOTING THE TRADE IN SERVICES QUESTIONNAIRE WORKSHOP ON INTERNATIONAL TRADE IN SERVICES STATISTICS ST JOHN’S ANTIGUA NOVEMBER.
SHRM Poll: Disaster Planning in Organizations 10 Years After the Sept
SHRM Survey Findings: Background Checking—The Use of Criminal Background Checks in Hiring Decisions July 19, 2012.
Agriculture, forestry, and fishing
NAICS Business Implications
Draft ISIC Rev.4 structure
The North American Industry Classification System (NAICS)
THE GREATER ROCHESTER, NEW YORK REGION
Data Report: Orillia January 2018.
THE GREATER ROCHESTER, NEW YORK REGION
Region 8 Workforce Profile
Workforce Profile.
Longview 2020 Forum by the Hibbs Institute Wednesday, March 6, 2019
Region 5 Workforce Profile
Presentation transcript:

Practical Issues for Automated Categorization of Web Sites John M. Pierre Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

Outline Project overview Web content Automated Categorization Feature Selection Metadata Experimental Setup Data Targeted Spidering System Architecture Results Conclusions

Project Overview Specific: Categorize large number of domain names by industry category NAICS classification scheme ~30,000 domain names for testing (.com) Text categorization approach General: Domain specific classification Metadata Targeted spidering Feature selection Classifier training

Web Content: Automated Categorization Challenges: Vast ( over 1 Billion pages ) Heterogeneous ( content, formats, not just HTML ) Dynamic ( growing, changing ) Benefits: Good source of information Accessible! Machine readable ( vs. machine understandable ) Semi-structured Tools: Classification Automated classification Text Categorization/Machine Learning Intelligent agents Related Work Manual : Yahoo! Open Directory Project Looksmart Automatic : Northern Light Thunderstone/Texis Inktomi Other : EU Project DESIRE II Pharos Attardi, Sebanstiani et al L. Page et al McCallum et al Related Work Manual : Yahoo! Open Directory Project Looksmart Automatic : Northern Light Thunderstone/Texis Inktomi Other : EU Project DESIRE II Pharos Attardi, Sebanstiani et al L. Page et al McCallum et al

Web Content: Feature Selection Text Features: (D. Lewis) Relatively few in number Moderate in frequency of assignment Low in redundancy Low in noise Related to semantic scope to the classes to be assigned Relatively unambiguous in meaning Preliminary Experiment 1125 web domains SEC+NAICS training set Use metadata if possible, use body text as last resort!

Web Content: Metadata

Experimental Setup: Targeted Spidering ‘Query’ Pages Metatags? Send Query Use live? Frames? <a href=? Try www. HTTP Get Domain name Yes No Yes No Yes prod, service, about, info, press, news No

Experimental Setup: Data Classification scheme: NAICS 11Agriculture, Forestry, Fishing and Hunting 21Mining 23Construction 31-33Manufacturing 42Wholesale Trade 44-45Retail Trade 48-49Transportation and Warehousing 51Information 52Finance and Insurance 53Real Estate and Rental and Leasing 54Professional, Scientific and Technical Services 55Management of Companies and Enterprise 56Admin. Support, Waste Mgmt and Remediation Srvcs 61Educational Services 62Health Care and Social Assistance 71Arts, Entertainment & Recreation 72Accommodation and Food Services 81Other services (except 92) 92Public Administration 99Unclassified Establishments Test Data ~30,000 domain names (SIC) ~13,500 pre-classified/content Training Data “SEC-NAICS”: 1504 SEC 10-K fillings (SIC) 426 NAICS labels/descriptions “Web pages”: 3618 pre-classified domains Crosswalk SIC NAICS

Experimental Setup: System Architecture The Web Domain Names Domain Names Spider IR Engine Decision SEC-NAICS Web pages Foo.com 11, 21, 23 Text Query Matching documents

Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged

Conclusions Domain Specific Classification Knowledge Gathering Use of specialized knowledge Targeted Spidering Efficient use of resources Extract key features, Metadata Training Prior knowledge Bootstrapping Classification Robust, tolerant of noisy data Benefits of Semantic Web Better Metadata Semantic linking & intelligent spidering