Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline The paradigm shift: BigData Search Engine Models for BigData –Map Reduce –GFS Looking forward: what to do with the data Upcoming technologies Challenges

May-20-10CS572-Summer2010CAM-3 Grand Data Challenges We’ve talked about the end to end search lifecycle So, now what Projects are collecting huge amounts of data –Let’s take a few examples

May-20-10CS572-Summer2010CAM-4 The Square Kilometer Array 1 sq. km of antennas Never-before seen resolution looking into the sky 700 TB –Per second!

May-20-10CS572-Summer2010CAM-5 NASA DESDynI Mission 16 TB/day Geographically distributed 10s of 1000s of jobs per day Tier 1 Earth Science Decadal Mission

May-20-10CS572-Summer2010CAM-6 How do we scale? Biggest search engines are on the order of 40B records –Size on disk in the 10-100s of GB range –Web pages, other forms of content are fairly small What happens when we have –Indexes on the order of 10 x 40B? What about 100x? –Large data files that folks want to make available?

May-20-10CS572-Summer2010CAM-7 One solution: Commodity Early 2000s –Google decides to buy up a bunch of Intel P3 computers with IDE slab disk –Super cheap –Everyone thought exotic expensive hardware was the way to do large scale computing –Problem: cheap hardware fails a lot

May-20-10CS572-Summer2010CAM-8 One solution: Commodity Solve the reliability problem in software –Replicate data across the disks for resiliency –Queue up multiple copies of the same job to ensure at least one completes CPU and disk are cheap, and otherwise under spent, so why not Suggests an infrastructure as the means of dealing with resiliency –Developers need to be able to write their code in familiar programming constructs, while leveraging the underlying commodity hardware

May-20-10CS572-Summer2010CAM-9 Google: GFS and Map Reduce 2 seminal papers published –Google File System: ACM SOPS, 2003 http://labs.google.com/papers/gfs.html –Map Reduce distributed programming model: OSDI, 2004 http://labs.google.com/papers/mapreduce.html Teaches the world how Google was able to make use of those 1000s of node clusters built on cheap Pentium 3s and IDE disk

May-20-10CS572-Summer2010CAM-10 Google Infrastructure Infusion Rewrote their production crawling system on top of GFS and Map Reduce –Reduced time to crawl the web by orders of magnitude –Allowed developers to write simple map and reduce functions that could then scale out Users wanted structured data on top of the underlying core –Big Table: OSDI, 2006 http://labs.google.com/papers/bigtable.html Column Oriented Database

May-20-10CS572-Summer2010CAM-11 The Open Source World Doug Cutting decided in 2006 that the Google papers on Map Reduce and GFS were the appropriate guidance to take his open source search engine project, Nutch, and overcome its limitations of scaling to multiple computers He and Mike Cafarella went off and branched Nutch and implemented a version of Nutch built on a GFS like system, and on M/R

May-20-10CS572-Summer2010CAM-12 The origin of scalable OSS ecosystems Once M/R and NDFS were implemented, many folks became interested in just the M/R and NDFS infra Branched off into Hadoop project Eventually Mike Cafarella and others decided to implement BigTable =>HBase

May-20-10CS572-Summer2010CAM-13 Assumptions You have a job that runs for a really long time on sets of independent, “shared nothing” infrastructure –Your job is mostly data independent (i.e. your job doesn’t have to wait on the results of the prior job to run, etc.) –“Embarrassingly” parallel You can program your algorithm or job in M/R –Not always the easiest mapping –See: http://berlinbuzzwords.de/content/nutch-web-mining- platform-present-and-future for how Nutch did ithttp://berlinbuzzwords.de/content/nutch-web-mining- platform-present-and-future

May-20-10CS572-Summer2010CAM-14 Science Data Systems Need search –Have web-scale knowledge bases that need to be made available to scientists –Job processing is traditionally not embarrassingly parallel How to leverage Hadoop and Nutch and all of the scalable search technologies?

May-20-10CS572-Summer2010CAM-15 Build out Reusable SDS Infra

May-20-10CS572-Summer2010CAM-16 Dump the data Scale out and treat SDS as gold source Make Search available as a “service” back to the SDS jobs Leverage commodity hardware and open source infrastructures

May-20-10CS572-Summer2010CAM-17 Example: NASA PDS

May-20-10CS572-Summer2010CAM-18 Where it’s going Amazon –Elastic Compute Cloud (EC2) –Simple Storage Service (S3) –…and many others Rackspace Microsoft Azure Public versus Private cloud

May-20-10CS572-Summer2010CAM-19 Clouds vs. Grids: Clouds lowest common denominator services (compute/store), that are broadly applicable independent of application domain scalability and performance improvements come at economic cost, amortized must provide externally accessible APIs or service interfaces to the internal workings of the cloud to leverage “cloud” in your application. I.e., you aren’t “cloud” if you are doing computation and storage locally using UNIX pipe and filters... does not explicitly deal with virtual organizations constructing clouds is hard and should not be attempted by those with inexperience in the domain of discourse

May-20-10CS572-Summer2010CAM-20 Clouds vs. Grids: Grids focused on creation of virtual organizations focused on scientific applications –at least the successful attempts goal is to provide all software to enable creation of virtual organizations –very few grid solutions that provide services in all 5 of the grid’s architectural layers. grid systems/applications are not built with extensibility in mind. –More exploratory –focused on the creation of entire “systems” rather than low level “services”

May-20-10CS572-Summer2010CAM-21 Challenges Overcoming the complexity of new programming models –It’s not terribly easy to program in M/R or even in newer constructs like leveraging cloud services Testing things at scale is difficult –Do you have a 2000 node cluster lying around? –Do you have the $$$ to pay for it on EC2? –Makes it hard to integrate patches and update software because you have to test it at scale

May-20-10CS572-Summer2010CAM-22 Wrapup The scalability of the web is only increasing Software to deal with the web scale has to be resilient against failure –If you use commodity hardware, which seems to be a great trend Several successful commercial and open source examples at scale Stormy weather ahead: clouds Dealing with the challenges

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback