Download presentation
Presentation is loading. Please wait.
Published byAugustine Harrison Modified over 10 years ago
1
Pay for Placement Search
2
Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines Where did they come from? How do they work? Who’s the biggest? Why GoTo is the coolest. l What type of stuff do you need to support the web’s 2 nd* largest search engine? Architecture, infrastructure, nuts and bolts Performance Operations l What kind of people (and how many) do you need to do this kind of business? l Where is the Internet going? What's going to happen to search engines? *Don’t quote me
3
Copyright GoTo.com, 2/19/2001, 3 Ancient History l The Pre-cursors Archie (1990) – ftp based file indexing and retrieval Gopher (1992) – document network (non-ftp) l The early ‘bots (1992-1993) WWW Wanderer (wandex) –servers, then URLs Aliweb – index web like Archie w/site index retrieval l Then came the spiders (1993+) WWW Worm Excite (Architext), 2/93 from Stanford
4
Copyright GoTo.com, 2/19/2001, 4 All Done? Wrong! l Problems with Spiders: Get lots of data, but no intelligence to map pages to concept space Problem still exist today (spamming) l The Solution? Searchable Directories. Human crafted hierarchies. Tradewave Galaxy (1/94) Yahoo! (4/94), Filo and Yang of Stanford
5
Copyright GoTo.com, 2/19/2001, 5 I Give Up – Let’s Search Everyone! l Here Come the Metasearchers! MetaCrawler, go2net, dogpile (1995) Momma Search.com (CNet) l Spray out searches to several engines – combine the results
6
Copyright GoTo.com, 2/19/2001, 6 The Universe Divides (kinda) l The Crawler-based Search Engines Lycos (7/94) – the wolf spider Infoseek (4/94) Altavista (12/95) Inktomi (Slurp) – HotBot (5/96) – the plains Indians spider myth Google, Northern Lights, Excite, FAST, direct hit, and more… l The Directory/Editorial based Search Engines Yahoo! (4/94) LookSmart (5/95) Snap.com ODP (NewHoo) -- dmoz (1/98) Ask Jeeves (4/97) GoTo (6/98)
7
Copyright GoTo.com, 2/19/2001, 7 How Crawlers Work (or don’t) l Start with list of URLs (submitted, generated from somewhere) l For each Site Get the base page ‘Catalog’ the page based on crawler-specific implementation Follow links on page and recurse l Some Details META tags Robots.txt # /robots.txt file for http://goto.com/ # disallow all robots from crawling GoTo User-agent: * Disallow: /
8
Copyright GoTo.com, 2/19/2001, 8 Some Search Engine Examples l Inktomi Infrastructure only – you pay for the search results Used to power Yahoo! (now Google), HotBot, many others Now typically a fall-though placement (bidded or other paid inclusion first, then Inktomi results l Google Sergey and Larry Power Yahoo!, virgin.net, some others Searching for a revenue model
9
Copyright GoTo.com, 2/19/2001, 9 Inktomi ‘Slurp’ Crawler Slurp Characteristics Starts with active submitted URLs Hierarchy of Importance – Page Title – Description meta – Keyword meta – Text in document (not in images ) No frames Looks for spoofing tricks (drop page) 4 week full cycle (constant incremental) Many different indices created (or various customers), different depths, etc.
10
Copyright GoTo.com, 2/19/2001, 10 Some Cataloging Approaches (cont.) l Google Backrub/Googlebot crawler PageRank™ Page A, Pages linking to A T1..Tn, Links on A C(A) PR(A) = (1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) ~probability distribution that random surfer hits a page based on links Cache the documents (no kidding) All kinds of tweaks to the PageRank, including: Domain tweaks (.org,.gov,.edu) Serious bias against large pages Bias against dynamic pages (.asp,.jhtml,.jsp) Check out http://www.searchengineworld.com/googlehttp://www.searchengineworld.com/google Original design at http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
11
Copyright GoTo.com, 2/19/2001, 11 Who’s the ‘biggest’ Search Engine l What is ‘big’ Number of documents indexed (SearchEngineWatch, 11/8/200) KEY: GG=Google, FAST=FAST, WT=WebTop.com, INK=Inktomi, AV=AltaVista, NL=Northern Light, EX=Excite, Go=Go (Infoseek).
12
Copyright GoTo.com, 2/19/2001, 12 Who’s the ‘biggest’ Search Engine l What is ‘big’ Searches/Day – Total Web 500mm/day (ptr estimate) Yahoo! – 100mm Alta Vista – 50mm (International too) Google – 50mm Inktomi – 40mm Everyone else – 10mm or fewer Where’s GoTo? Hint
13
Copyright GoTo.com, 2/19/2001, 13 Let’s Talk About GoTo l Basic Business Model – Middlemen for Textual Advertisements (Search Results) Advertisers provide us Search Listings (Title, URL, Description, bid) for a search term We charge advertisers for user clicks on Search Listings We serve search listings to our own site (www.goto.com - 5%), and other partners sites (affiliates like Alta Vista, AOL, Netscpae, Cnet, etc. etc. – 95%)www.goto.com l Since we make money when people search (and click), we pay for sites to include our listings l Live auction for search results
14
Copyright GoTo.com, 2/19/2001, 14 The Scale of Operations l Search Volume – 70mm+/day, capacity for 210mm/day l 300mm impressions/day l 10mm clicks/day – Med/Large Phone company l 6mm+ search listings l 40,000+ advertisers l Wow
15
Copyright GoTo.com, 2/19/2001, 15 Systems Strategic Bombing View
16
Copyright GoTo.com, 2/19/2001, 16 It Can’t be that Simple, Right? l Right!
17
Copyright GoTo.com, 2/19/2001, 17 It Can’t be that Simple, Right? GoTo’s systems seem deceptively simple. l GoTo’s pay-for-performance search product seems simple to execute – advertisers provide the content in the form of search listings, the content is ordered by bid price, and advertisers are charged for resulting clicks. l The complexity of these systems is based on the scale of the problem (number of advertisers, search listings, searches per day, etc.), In addition to some non-apparent complications (e.g. fraud detection).
18
Copyright GoTo.com, 2/19/2001, 18 Architecture Features l High Availability -- Noah’s Ark Approach – no single point of failure Load balancers State migration l Scalability: no architectural changes to scale serving capacity. l Extensibility: can add search features incrementally. l Distributed content: multiple sites currently serving all partners.
19
Copyright GoTo.com, 2/19/2001, 19 Advertiser Management
20
Copyright GoTo.com, 2/19/2001, 20 Advertiser Tools l DirecTraffic Center ® DirecTraffic Center Functions – manage account balance, report on activity, real-time bid charges, add/modify/delete search listings ATG/Dynamo (jhtml)/Java, EJB search Listing services (BEA/Weblogic), custom cache reporting scheme based on Oracle 8i
21
Copyright GoTo.com, 2/19/2001, 21 Advertiser Management Systems
22
Copyright GoTo.com, 2/19/2001, 22 Account Monitoring l The real ‘special sauce’ Listens to real-time clicks and monitors account activity to process notifications, automated changes, status changes Manages credit limits, monthly advertiser budgets, activation and de-activation of accounts, and over 300 different business rules around accounts EJB – Weblogic
23
Copyright GoTo.com, 2/19/2001, 23 Editorial Processing l We are a publishing business 100 editors Workflow fo 50,000-100,000 work orders a month Review all listings (with some help) EJB/Desktop App (Swing)
24
Copyright GoTo.com, 2/19/2001, 24 Fraud Detection and Reporting
25
Copyright GoTo.com, 2/19/2001, 25 Event Processing – What Are Events? l LWES – Light Weight Event Systems UDP-multicast based events thrown by front end systems Events include Searches Clicks (redirects) Navigation Events are Key/Value pairs ‘Caught by separate Journaling Systems
26
Copyright GoTo.com, 2/19/2001, 26 What do we do with these events? l Result Clicks (I.e. we charge advertiser) goto fraud detection patent pending system that monitors our web site behavior to detect potentially fraudulent activity. The systems analyze millions of transactions daily for suspicious behavior, whether malicious or benign, and perform sophisticated rule-based and statistically-derived event filtering. GoTo’s Fraud Squad of 8 developers and analysts constantly monitor and improve the fraud detection techniques and tools, and manage the issue treatment and resolution processes.
27
Copyright GoTo.com, 2/19/2001, 27 More About Fraud l Fraud Detection -- Attacks and Filters Attacks Inadvertent Crawling spiders run amok Advertisers testing their own listings Malicious Stockholder -- the revenue goosers Advertiser Vs. Advertisers Bored Crackers Filters Deterministic - rules based filters covering user sessions, IP addresses and search terms. The deterministic filters catch all the blatant abuses (repetitive clicking, repetitive searching, “speed” clicking). Probabilistic -- behavior pattern based, these filters discard anomalous click groupings. The probabilistic filters are very good at catching subtle abuses of advertiser resources: traversal of consecutive paid listings, randomized but obviously scripted clicking, expensive clicking. Both deterministic and probabilistic filters are routinely updated to reflect changes in site usage patterns.
28
Copyright GoTo.com, 2/19/2001, 28 How do you do this in near-real-time? l Data Pipeline The ‘backbone’ of fraud detection A flexible array (~30) of commodity machines that perform simple aggregations and other arithmetic calculations in a networked and coordinated way A control and processing language used to describe the required calculations, and processed by the data pipeline machines. l Click Scoring Assignment of a click score for click events that classifies them into various ‘buckets’ of validity. Formulas that define the ‘buckets’ based on historical patterns of behavior of the site, and analysis of previous fraudulent attempts.
29
Copyright GoTo.com, 2/19/2001, 29 Search Serving Systems
30
Copyright GoTo.com, 2/19/2001, 30 Search Serving Systems
31
Copyright GoTo.com, 2/19/2001, 31 The Nitty-Gritty l Search Serving Platforms: 100+ Sun e420R, 450mhz (4), 4GB ATG/Dynamo/Java, and Apache/mod_perl Gigabit site backbone InterNAP Multiple (3) co-location facilities Search serving feeds include HTML and XML all through HTTP (1.0 or 1.1) Global Load Balancing (Arrowpoint) Distributed content caching (Akamai) l Backend Platforms: Data repository (16TB) for search and click events – several (4) e4500 Sun/Oracle 8i machines connected to a MTI SAN Fraud Detection through an array (3) or Intel/Linux machines, utilizing custom detection systems. CRM via Silknet (NT/2000) N-tier application backbone via EJB (Weblogic) servers – application integration all through XML Complete DR site for fast recovery
32
Copyright GoTo.com, 2/19/2001, 32 Facilities l 6 Facilities: Search Serving Sites Global Center – Sunnyvale CA Cable & Wireless – Reston VA ESAT – Dublin, Ireland Offices Pasadena San Mateo Raleigh-Durham London Development & Test Site Qwest CyberCenter – Burbank CA Backend Processing Site (New) Las Vegas, Nevada
33
Copyright GoTo.com, 2/19/2001, 33 Search Serving Performance
34
Copyright GoTo.com, 2/19/2001, 34 Network Operations Center
35
Copyright GoTo.com, 2/19/2001, 35 Network Operations Center
36
Copyright GoTo.com, 2/19/2001, 36 GoTo Technology Organization l Three Major Technology Groups (groupings): Development Groups (4) Technical Operations Architecture and Planning l About 115 people. l Number/Email to Remember: Me – 626-685-5743, ptryan@goto.comptryan@goto.com
37
Copyright GoTo.com, 2/19/2001, 37 The perils of an open office plan
38
Copyright GoTo.com, 2/19/2001, 38 The future… l Stickiness models are dead l The vultures are circling… l The end for ‘search engines’ Everyone needs a revenue model Search Portal ? Pay for placement the norm
39
Copyright GoTo.com, 2/19/2001, 39 References l Web Sites about Search Engines www.searchenginewatch.com www.searchenginewatch.com www.searchengineworld.com www.searchengineworld.com l Services www.wordtracker.com www.wordtracker.com l Articles
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.