Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

Cloud Computing at GES DISC Presented by: Long Pham Contributors: Aijun Chen, Bruce Vollmer, Ed Esfandiari and Mike Theobald GES DISC UWG May 11, 2011.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Cloud Computing Saneel Bidaye uni-slb2181. What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CLOUD COMPUTING. What is cloud computing ? History Virtualization Cloud Computing hardware Cloud Computing services Cloud Architecture Advantages & Disadvantages.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Copyright © 2012 Axceleon Intellectual Property All rights reserved HPC User Forum, Dearborn MI. Our Focus: Enable HPC solutions in the Cloud for our Customer.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Lost in the Fog: Is Cloud Computing The Future for Digital Information?” Adam Stapleton Government Technology Services.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Cloud Computing John Engates CTO, Rackspace Presented: Rackspace Customer Conference, 2008 October 29, 2008.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
(C) 2008 Clusterpoint(C) 2008 ClusterPoint Ltd. Empowering You to Manage and Drive Down Database Costs April 17, 2009 Gints Ernestsons, CEO © 2009 Clusterpoint.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Building micro-service based applications using Azure Service Fabric
Hadoop implementation of MapReduce computational model Ján Vaňo.
CLOUD COMPUTING. What is cloud computing ? History Virtualization Cloud Computing hardware Cloud Computing services Cloud Architecture Advantages & Disadvantages.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Web Technologies Lecture 13 Introduction to cloud computing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
This is a free Course Available on Hadoop-Skills.com.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Hadoop Aakash Kag What Why How 1.
Clouds , Grids and Clusters
Software Systems Development
Modern Databases NoSQL and NewSQL
Hadoop Clusters Tess Fulkerson.
INFO 344 Web Tools And Development
Ch 4. The Evolution of Analytic Scalability
Zoie Barrett and Brian Lam
Presentation transcript:

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline The paradigm shift: BigData Search Engine Models for BigData –Map Reduce –GFS Looking forward: what to do with the data Upcoming technologies Challenges

May-20-10CS572-Summer2010CAM-3 Grand Data Challenges We’ve talked about the end to end search lifecycle So, now what Projects are collecting huge amounts of data –Let’s take a few examples

May-20-10CS572-Summer2010CAM-4 The Square Kilometer Array 1 sq. km of antennas Never-before seen resolution looking into the sky 700 TB –Per second!

May-20-10CS572-Summer2010CAM-5 NASA DESDynI Mission 16 TB/day Geographically distributed 10s of 1000s of jobs per day Tier 1 Earth Science Decadal Mission

May-20-10CS572-Summer2010CAM-6 How do we scale? Biggest search engines are on the order of 40B records –Size on disk in the s of GB range –Web pages, other forms of content are fairly small What happens when we have –Indexes on the order of 10 x 40B? What about 100x? –Large data files that folks want to make available?

May-20-10CS572-Summer2010CAM-7 One solution: Commodity Early 2000s –Google decides to buy up a bunch of Intel P3 computers with IDE slab disk –Super cheap –Everyone thought exotic expensive hardware was the way to do large scale computing –Problem: cheap hardware fails a lot

May-20-10CS572-Summer2010CAM-8 One solution: Commodity Solve the reliability problem in software –Replicate data across the disks for resiliency –Queue up multiple copies of the same job to ensure at least one completes CPU and disk are cheap, and otherwise under spent, so why not Suggests an infrastructure as the means of dealing with resiliency –Developers need to be able to write their code in familiar programming constructs, while leveraging the underlying commodity hardware

May-20-10CS572-Summer2010CAM-9 Google: GFS and Map Reduce 2 seminal papers published –Google File System: ACM SOPS, –Map Reduce distributed programming model: OSDI, Teaches the world how Google was able to make use of those 1000s of node clusters built on cheap Pentium 3s and IDE disk

May-20-10CS572-Summer2010CAM-10 Google Infrastructure Infusion Rewrote their production crawling system on top of GFS and Map Reduce –Reduced time to crawl the web by orders of magnitude –Allowed developers to write simple map and reduce functions that could then scale out Users wanted structured data on top of the underlying core –Big Table: OSDI, Column Oriented Database

May-20-10CS572-Summer2010CAM-11 The Open Source World Doug Cutting decided in 2006 that the Google papers on Map Reduce and GFS were the appropriate guidance to take his open source search engine project, Nutch, and overcome its limitations of scaling to multiple computers He and Mike Cafarella went off and branched Nutch and implemented a version of Nutch built on a GFS like system, and on M/R

May-20-10CS572-Summer2010CAM-12 The origin of scalable OSS ecosystems Once M/R and NDFS were implemented, many folks became interested in just the M/R and NDFS infra Branched off into Hadoop project Eventually Mike Cafarella and others decided to implement BigTable =>HBase

May-20-10CS572-Summer2010CAM-13 Assumptions You have a job that runs for a really long time on sets of independent, “shared nothing” infrastructure –Your job is mostly data independent (i.e. your job doesn’t have to wait on the results of the prior job to run, etc.) –“Embarrassingly” parallel You can program your algorithm or job in M/R –Not always the easiest mapping –See: platform-present-and-future for how Nutch did ithttp://berlinbuzzwords.de/content/nutch-web-mining- platform-present-and-future

May-20-10CS572-Summer2010CAM-14 Science Data Systems Need search –Have web-scale knowledge bases that need to be made available to scientists –Job processing is traditionally not embarrassingly parallel How to leverage Hadoop and Nutch and all of the scalable search technologies?

May-20-10CS572-Summer2010CAM-15 Build out Reusable SDS Infra

May-20-10CS572-Summer2010CAM-16 Dump the data Scale out and treat SDS as gold source Make Search available as a “service” back to the SDS jobs Leverage commodity hardware and open source infrastructures

May-20-10CS572-Summer2010CAM-17 Example: NASA PDS

May-20-10CS572-Summer2010CAM-18 Where it’s going Amazon –Elastic Compute Cloud (EC2) –Simple Storage Service (S3) –…and many others Rackspace Microsoft Azure Public versus Private cloud

May-20-10CS572-Summer2010CAM-19 Clouds vs. Grids: Clouds lowest common denominator services (compute/store), that are broadly applicable independent of application domain scalability and performance improvements come at economic cost, amortized must provide externally accessible APIs or service interfaces to the internal workings of the cloud to leverage “cloud” in your application. I.e., you aren’t “cloud” if you are doing computation and storage locally using UNIX pipe and filters... does not explicitly deal with virtual organizations constructing clouds is hard and should not be attempted by those with inexperience in the domain of discourse

May-20-10CS572-Summer2010CAM-20 Clouds vs. Grids: Grids focused on creation of virtual organizations focused on scientific applications –at least the successful attempts goal is to provide all software to enable creation of virtual organizations –very few grid solutions that provide services in all 5 of the grid’s architectural layers. grid systems/applications are not built with extensibility in mind. –More exploratory –focused on the creation of entire “systems” rather than low level “services”

May-20-10CS572-Summer2010CAM-21 Challenges Overcoming the complexity of new programming models –It’s not terribly easy to program in M/R or even in newer constructs like leveraging cloud services Testing things at scale is difficult –Do you have a 2000 node cluster lying around? –Do you have the $$$ to pay for it on EC2? –Makes it hard to integrate patches and update software because you have to test it at scale

May-20-10CS572-Summer2010CAM-22 Wrapup The scalability of the web is only increasing Software to deal with the web scale has to be resilient against failure –If you use commodity hardware, which seems to be a great trend Several successful commercial and open source examples at scale Stormy weather ahead: clouds Dealing with the challenges