CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Slides:



Advertisements
Similar presentations
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Acroterion Search Engine Solutions Presentation The next mousetrap Mom, What's a Library? Search Engine Marketing is Necessity Billions of dollars poured.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
Search Engine Marketing Free Traffic for Your Web Site Paul Allen, CEO
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Sigir’99 Inside Internet Search Engines: Fundamentals Jan Pedersen and William Chang.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS 345 Data Mining Online algorithms Search advertising.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
Overview of Web Data Mining and Applications Part I
Search Engine Optimization (SEO)
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Internet Research Search Engines & Subject Directories.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
1 1 Lecture 4: Information Retrieval and Web Mining
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
Search Engines & Search Engine Optimization (SEO).
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
Search Engine Marketing The Tools of Online Marketing and Thought Leadership.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Search Engines By: Faruq Hasan.
1 1 Lecture 4: Information Retrieval and Web Mining
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Retrieval (9) Prof. Dragomir R. Radev
Week 1 Introduction to Search Engine Optimization.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
22C:145 Artificial Intelligence
Search Engine Optimization
Dr. Frank McCown Comp 250 – Web Development Harding University
Web Mining Ref:
Introduction to Web Mining
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

CS 345 Data Mining Lecture 1 Introduction to Web Mining

What is Web Mining?  Discovering useful information from the World-Wide Web and its usage patterns  Applications Web search e.g., Google, Yahoo,… Vertical Search e.g., FatLens, Become,… Recommendations e.g., Amazon.com Advertising e.g., Google, Yahoo Web site design e.g., landing page optimization

How does it differ from “classical” Data Mining?  The web is not a relation Textual information and linkage structure  Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses  Ability to react in real-time to usage patterns No human in the loop

The World-Wide Web  Huge  Distributed content creation, linking (no coordination)  Structured databases, unstructured text, semistructured  Content includes truth, lies, obsolete information, contradictions, …  Our modern-day Library of Alexandria The Web

Size of the Web  Number of pages Technically, infinite  Because of dynamically generated content  Lots of duplication (30-40%) Best estimate of “unique” static HTML pages comes from search engine claims  Google = 8 billion, Yahoo = 20 billion  Lots of marketing hype  Number of unique web sites Netcraft survey says 72 million sites (

Netcraft survey

The web as a graph  Pages = nodes, hyperlinks = edges Ignore content Directed graph  High linkage 8-10 links/page on average Power-law degree distribution

Source: Broder et al, 2000

Power-laws galore  In-degrees  Out-degrees  Number of pages per site  Number of visitors  Let’s take a closer look at structure Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls Bow-tie structure  Not a “small world”

Bow-tie Structure Source: Broder et al, 2000

Searching the Web Content aggregators The Web Content consumers

Ads vs. search results

 Search advertising is the revenue model Multi-billion-dollar industry Advertisers pay for clicks on their ads  Interesting problems How to pick the top 10 results for a search from 2,230,000 matching pages? What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid?

Sidebar: What’s in a name?  Geico sued Google, contending that it owned the trademark “Geico” Thus, ads for the keyword geico couldn’t be sold to others  Court Ruling: search engines can sell keywords including trademarks  No court ruling yet: whether the ad itself can use the trademarked word(s)

Extracting Structured Data

Extracting structured data

The Long Tail Source: Chris Anderson (2004)

The Long Tail  Shelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,…  The web enables near-zero-cost dissemination of information about products  More choices necessitate better filters Recommendation engines (e.g., Amazon) How Into Thin Air made Touching the Void a bestseller

Web Mining topics  Crawling the web  Web graph analysis  Structured data extraction  Classification and vertical search  Collaborative filtering  Web advertising and optimization  Mining web logs  Systems Issues

Web search basics The Web Ad indexes Web crawler Indexer Indexes Search User

Search engine components  Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively  For each known URL, fetch the page, parse it, and extract new URLs  Repeat Additional pages from direct submissions & other sources  The indexer – creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.  Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them