Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pay for Placement Search. Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines  Where did they come from?  How do they work?  Who’s the biggest?

Similar presentations


Presentation on theme: "Pay for Placement Search. Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines  Where did they come from?  How do they work?  Who’s the biggest?"— Presentation transcript:

1 Pay for Placement Search

2 Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines  Where did they come from?  How do they work?  Who’s the biggest?  Why GoTo is the coolest. l What type of stuff do you need to support the web’s 2 nd* largest search engine?  Architecture, infrastructure, nuts and bolts  Performance  Operations l What kind of people (and how many) do you need to do this kind of business? l Where is the Internet going? What's going to happen to search engines? *Don’t quote me

3 Copyright GoTo.com, 2/19/2001, 3 Ancient History l The Pre-cursors  Archie (1990) – ftp based file indexing and retrieval  Gopher (1992) – document network (non-ftp) l The early ‘bots (1992-1993)  WWW Wanderer (wandex) –servers, then URLs  Aliweb – index web like Archie w/site index retrieval l Then came the spiders (1993+)  WWW Worm  Excite (Architext), 2/93 from Stanford

4 Copyright GoTo.com, 2/19/2001, 4 All Done? Wrong! l Problems with Spiders:  Get lots of data, but no intelligence to map pages to concept space  Problem still exist today (spamming) l The Solution? Searchable Directories. Human crafted hierarchies.  Tradewave Galaxy (1/94)  Yahoo! (4/94), Filo and Yang of Stanford

5 Copyright GoTo.com, 2/19/2001, 5 I Give Up – Let’s Search Everyone! l Here Come the Metasearchers!  MetaCrawler, go2net, dogpile (1995)  Momma  Search.com (CNet) l Spray out searches to several engines – combine the results

6 Copyright GoTo.com, 2/19/2001, 6 The Universe Divides (kinda) l The Crawler-based Search Engines  Lycos (7/94) – the wolf spider  Infoseek (4/94)  Altavista (12/95)  Inktomi (Slurp) – HotBot (5/96) – the plains Indians spider myth  Google, Northern Lights, Excite, FAST, direct hit, and more… l The Directory/Editorial based Search Engines  Yahoo! (4/94)  LookSmart (5/95)  Snap.com  ODP (NewHoo) -- dmoz (1/98)  Ask Jeeves (4/97)  GoTo (6/98)

7 Copyright GoTo.com, 2/19/2001, 7 How Crawlers Work (or don’t) l Start with list of URLs (submitted, generated from somewhere) l For each Site  Get the base page  ‘Catalog’ the page based on crawler-specific implementation  Follow links on page and recurse l Some Details  META tags  Robots.txt # /robots.txt file for http://goto.com/ # disallow all robots from crawling GoTo User-agent: * Disallow: /

8 Copyright GoTo.com, 2/19/2001, 8 Some Search Engine Examples l Inktomi  Infrastructure only – you pay for the search results  Used to power Yahoo! (now Google), HotBot, many others  Now typically a fall-though placement (bidded or other paid inclusion first, then Inktomi results l Google  Sergey and Larry  Power Yahoo!, virgin.net, some others  Searching for a revenue model

9 Copyright GoTo.com, 2/19/2001, 9 Inktomi ‘Slurp’ Crawler  Slurp Characteristics Starts with active submitted URLs Hierarchy of Importance – Page Title – Description meta – Keyword meta – Text in document (not in images  ) No frames Looks for spoofing tricks (drop page)  4 week full cycle (constant incremental) Many different indices created (or various customers), different depths, etc.

10 Copyright GoTo.com, 2/19/2001, 10 Some Cataloging Approaches (cont.) l Google  Backrub/Googlebot crawler  PageRank™ Page A, Pages linking to A T1..Tn, Links on A C(A) PR(A) = (1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) ~probability distribution that random surfer hits a page based on links  Cache the documents (no kidding)  All kinds of tweaks to the PageRank, including: Domain tweaks (.org,.gov,.edu) Serious bias against large pages Bias against dynamic pages (.asp,.jhtml,.jsp)  Check out http://www.searchengineworld.com/googlehttp://www.searchengineworld.com/google  Original design at http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

11 Copyright GoTo.com, 2/19/2001, 11 Who’s the ‘biggest’ Search Engine l What is ‘big’  Number of documents indexed (SearchEngineWatch, 11/8/200) KEY: GG=Google, FAST=FAST, WT=WebTop.com, INK=Inktomi, AV=AltaVista, NL=Northern Light, EX=Excite, Go=Go (Infoseek).

12 Copyright GoTo.com, 2/19/2001, 12 Who’s the ‘biggest’ Search Engine l What is ‘big’  Searches/Day – Total Web 500mm/day (ptr estimate) Yahoo! – 100mm Alta Vista – 50mm (International too) Google – 50mm Inktomi – 40mm Everyone else – 10mm or fewer  Where’s GoTo? Hint 

13 Copyright GoTo.com, 2/19/2001, 13 Let’s Talk About GoTo l Basic Business Model – Middlemen for Textual Advertisements (Search Results)  Advertisers provide us Search Listings (Title, URL, Description, bid) for a search term  We charge advertisers for user clicks on Search Listings  We serve search listings to our own site (www.goto.com - 5%), and other partners sites (affiliates like Alta Vista, AOL, Netscpae, Cnet, etc. etc. – 95%)www.goto.com l Since we make money when people search (and click), we pay for sites to include our listings l Live auction for search results

14 Copyright GoTo.com, 2/19/2001, 14 The Scale of Operations l Search Volume – 70mm+/day, capacity for 210mm/day l 300mm impressions/day l 10mm clicks/day – Med/Large Phone company l 6mm+ search listings l 40,000+ advertisers l Wow

15 Copyright GoTo.com, 2/19/2001, 15 Systems Strategic Bombing View

16 Copyright GoTo.com, 2/19/2001, 16 It Can’t be that Simple, Right? l Right!

17 Copyright GoTo.com, 2/19/2001, 17 It Can’t be that Simple, Right? GoTo’s systems seem deceptively simple. l GoTo’s pay-for-performance search product seems simple to execute – advertisers provide the content in the form of search listings, the content is ordered by bid price, and advertisers are charged for resulting clicks. l The complexity of these systems is based on the scale of the problem (number of advertisers, search listings, searches per day, etc.), In addition to some non-apparent complications (e.g. fraud detection).

18 Copyright GoTo.com, 2/19/2001, 18 Architecture Features l High Availability -- Noah’s Ark Approach – no single point of failure  Load balancers  State migration l Scalability: no architectural changes to scale serving capacity. l Extensibility: can add search features incrementally. l Distributed content: multiple sites currently serving all partners.

19 Copyright GoTo.com, 2/19/2001, 19 Advertiser Management

20 Copyright GoTo.com, 2/19/2001, 20 Advertiser Tools l DirecTraffic Center ® DirecTraffic Center  Functions – manage account balance, report on activity, real-time bid charges, add/modify/delete search listings  ATG/Dynamo (jhtml)/Java, EJB search Listing services (BEA/Weblogic), custom cache reporting scheme based on Oracle 8i

21 Copyright GoTo.com, 2/19/2001, 21 Advertiser Management Systems

22 Copyright GoTo.com, 2/19/2001, 22 Account Monitoring l The real ‘special sauce’  Listens to real-time clicks and monitors account activity to process notifications, automated changes, status changes  Manages credit limits, monthly advertiser budgets, activation and de-activation of accounts, and over 300 different business rules around accounts  EJB – Weblogic

23 Copyright GoTo.com, 2/19/2001, 23 Editorial Processing l We are a publishing business  100 editors  Workflow fo 50,000-100,000 work orders a month  Review all listings (with some help)  EJB/Desktop App (Swing)

24 Copyright GoTo.com, 2/19/2001, 24 Fraud Detection and Reporting

25 Copyright GoTo.com, 2/19/2001, 25 Event Processing – What Are Events? l LWES – Light Weight Event Systems  UDP-multicast based events thrown by front end systems  Events include Searches Clicks (redirects) Navigation  Events are Key/Value pairs  ‘Caught by separate Journaling Systems

26 Copyright GoTo.com, 2/19/2001, 26 What do we do with these events? l Result Clicks (I.e. we charge advertiser) goto fraud detection patent pending system that monitors our web site behavior to detect potentially fraudulent activity. The systems analyze millions of transactions daily for suspicious behavior, whether malicious or benign, and perform sophisticated rule-based and statistically-derived event filtering. GoTo’s Fraud Squad of 8 developers and analysts constantly monitor and improve the fraud detection techniques and tools, and manage the issue treatment and resolution processes.

27 Copyright GoTo.com, 2/19/2001, 27 More About Fraud l Fraud Detection -- Attacks and Filters  Attacks Inadvertent Crawling spiders run amok Advertisers testing their own listings Malicious Stockholder -- the revenue goosers Advertiser Vs. Advertisers Bored Crackers  Filters Deterministic - rules based filters covering user sessions, IP addresses and search terms. The deterministic filters catch all the blatant abuses (repetitive clicking, repetitive searching, “speed” clicking). Probabilistic -- behavior pattern based, these filters discard anomalous click groupings. The probabilistic filters are very good at catching subtle abuses of advertiser resources: traversal of consecutive paid listings, randomized but obviously scripted clicking, expensive clicking. Both deterministic and probabilistic filters are routinely updated to reflect changes in site usage patterns.

28 Copyright GoTo.com, 2/19/2001, 28 How do you do this in near-real-time? l Data Pipeline  The ‘backbone’ of fraud detection  A flexible array (~30) of commodity machines that perform simple aggregations and other arithmetic calculations in a networked and coordinated way  A control and processing language used to describe the required calculations, and processed by the data pipeline machines. l Click Scoring  Assignment of a click score for click events that classifies them into various ‘buckets’ of validity.  Formulas that define the ‘buckets’ based on historical patterns of behavior of the site, and analysis of previous fraudulent attempts.

29 Copyright GoTo.com, 2/19/2001, 29 Search Serving Systems

30 Copyright GoTo.com, 2/19/2001, 30 Search Serving Systems

31 Copyright GoTo.com, 2/19/2001, 31 The Nitty-Gritty l Search Serving Platforms:  100+ Sun e420R, 450mhz (4), 4GB  ATG/Dynamo/Java, and Apache/mod_perl  Gigabit site backbone  InterNAP  Multiple (3) co-location facilities  Search serving feeds include HTML and XML all through HTTP (1.0 or 1.1)  Global Load Balancing (Arrowpoint)  Distributed content caching (Akamai) l Backend Platforms:  Data repository (16TB) for search and click events – several (4) e4500 Sun/Oracle 8i machines connected to a MTI SAN  Fraud Detection through an array (3) or Intel/Linux machines, utilizing custom detection systems.  CRM via Silknet (NT/2000)  N-tier application backbone via EJB (Weblogic) servers – application integration all through XML  Complete DR site for fast recovery

32 Copyright GoTo.com, 2/19/2001, 32 Facilities l 6 Facilities:  Search Serving Sites Global Center – Sunnyvale CA Cable & Wireless – Reston VA ESAT – Dublin, Ireland  Offices Pasadena San Mateo Raleigh-Durham London  Development & Test Site Qwest CyberCenter – Burbank CA  Backend Processing Site (New) Las Vegas, Nevada

33 Copyright GoTo.com, 2/19/2001, 33 Search Serving Performance

34 Copyright GoTo.com, 2/19/2001, 34 Network Operations Center

35 Copyright GoTo.com, 2/19/2001, 35 Network Operations Center

36 Copyright GoTo.com, 2/19/2001, 36 GoTo Technology Organization l Three Major Technology Groups (groupings):  Development Groups (4)  Technical Operations  Architecture and Planning l About 115 people. l Number/Email to Remember:  Me – 626-685-5743, ptryan@goto.comptryan@goto.com

37 Copyright GoTo.com, 2/19/2001, 37 The perils of an open office plan

38 Copyright GoTo.com, 2/19/2001, 38 The future… l Stickiness models are dead l The vultures are circling… l The end for ‘search engines’  Everyone needs a revenue model  Search  Portal  ?  Pay for placement the norm

39 Copyright GoTo.com, 2/19/2001, 39 References l Web Sites about Search Engines  www.searchenginewatch.com www.searchenginewatch.com  www.searchengineworld.com www.searchengineworld.com l Services  www.wordtracker.com www.wordtracker.com l Articles


Download ppt "Pay for Placement Search. Copyright GoTo.com, 2/19/2001, 2 Agenda l Search Engines  Where did they come from?  How do they work?  Who’s the biggest?"

Similar presentations


Ads by Google