Presentation is loading. Please wait.

Presentation is loading. Please wait.

大规模数据处理 / 云计算 Lecture 1 - Introduction to MapReduce 彭波 北京大学信息科学技术学院 4/21/2011 This work is licensed under a Creative.

Similar presentations


Presentation on theme: "大规模数据处理 / 云计算 Lecture 1 - Introduction to MapReduce 彭波 北京大学信息科学技术学院 4/21/2011 This work is licensed under a Creative."— Presentation transcript:

1 大规模数据处理 / 云计算 Lecture 1 - Introduction to MapReduce 彭波 北京大学信息科学技术学院 4/21/2011 http://net.pku.edu.cn/~course/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland 课程建设 SEWMGroup

2 What is this course about? Data-intensive information processing Large-data (“web-scale”) problems Focus on MapReduce programming An entry-level course~ 2

3 What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop 3

4 Why Large Data?

5 How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??) 640K ought to be enough for anybody. 5

6 6

7 7 Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec

8 8 Maximilien Brice, © CERN

9 9

10 10 Maximilien Brice, © CERN

11 11 Maximilien Brice, © CERN

12 No data like more data! (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) s/knowledge/data/g; How do we get here if we’re not Google? 12

13 Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Who shot Abraham Lincoln?  X shot Abraham Lincoln Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 PERSON (DATE – PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … ) 13

14 14 Example: Scene Completion Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images … Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007

15 More Data More Gains? CNNIC 中国互联网络发展状况统计 截至 2010 年 6 月底,我国网民规模达 4.2 亿人,互 联网普及率持续上升增至 31.8% 。手机网民成为 拉动中国总体网民规模攀升的主要动力,半年内 新增 4334 万,达到 2.77 亿人,增幅为 18.6% 。值 得关注的是,互联网商务化程度迅速提高,全国 网络购物用户达到 1.4 亿,网上支付、网络购物和 网上银 行半年用户增长率均在 30% 左右,远远超 过其他类网络应用。 15

16 2009 年全国新闻出版业基本情况 2009 年:出版书籍 238868 种(初版 145475 种, 重版、重印 93393 种),总印数 37. 88 亿册(张), 总印张 312.46 亿印张,折合用纸量 73.4 万吨(包 括附录用纸 1.41 亿印张,折合用纸量 0.33 万吨), 定价总金额 567.27 亿 元(包括附录定价总金额 4.73 亿元)。与上年相比种数增长 8.86% (初版 增长 11.24% ,重版、重印增长 5.36% ),总印数 增长 4.53% ,总印 张增长 4.61% ,定价总金额增 长 8.94% 。 16

17 Did you know? 17

18 Did you know? “We are currently preparing our students for jobs that don’t yet exist …” “It is estimated that a week’s worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century” “The amount of new technical information is doubling every 2 years” “So what does IT ALL MEAN?”“So what does IT ALL MEAN?” 18

19 “We are living in exponential times “ 19

20 20 Two Different Views a “thrower-awayer”MyLifeBits “ 丢弃,必要时再找回来的代价 要比维护它们要小得多 ” “trying to live an efficient life so that one has time to work and be with one’s family. “ Jennifer Widom Gordon Bell

21 Information Overloading 不能学以致用的原因之一: 信息超载 – 对于那些只接触过一次的信息, 我们通常只能记住其中一小部 分。 – 我们应该少而精而非多而浅地 去学习。 – 要想掌握某件事,关键在于间 隔性重复。 – 一旦真正透彻地掌握了自己的 工作,人们就会变得更有创造 性,甚至能够创造奇迹。 21

22 What is Cloud Computing?

23 The best thing since sliced bread? Before clouds… Grids Vector supercomputers … Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23

24 Rebranding of web 2.0 Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, … “The network is the computer”: take two User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS 24

25 Source: Wikipedia (Electricity meter)

26 Utility Computing What? Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26

27 Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazon’s EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, … Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27

28 Utility Computing “pay-as-you-go” 好比让用户把电源插头插在墙上,你得到 的电压和 Microsoft 得到的一样,只是你用得少, pay less ; utility computing 的目标就是让计算资源也具有这样的服务 能力,用户可以使用 500 强公司所拥有的计算资源,只是 use less pay less 。这是 cloud computing 的一个重要方面 28

29 Platform as a Service (PaaS) 对于开发 Web Application 和 Services , PaaS 提供了一整套 基于 Internet 的,从开发,测试,部署,运营到维护的全方 位的集成环境。特别它从一开始就具备了 Multi-tenant architecture ,用户不需要考虑多用户并发的问题,而由 platform 来解决,包括并发管理,扩展性,失效恢复,安全 。 29

30 Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.software deployment 30

31 Who cares? Ready-made large-data problems Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31

32 Story around Hadoop

33 Google-IBM Cloud Computing Initiative 2007 年 10 月初, Google 和 IBM 联 合与 6 所大学签署协议,提供在大 型分布式计算系统上开发软件的 课程和支持服务,帮助学生和研 究人员获得开发网络级应用软件 的经验。这个项目的主要内容是 传授 MapReduce 算法和 Hadoop 文件系统。两家公司将各自出资 2000 ~ 2500 万美元,为从事计算 机科学研究的教授和学生提供所 需的电脑软硬件和相关服务。开发软件 33

34 Cloud Computing Initiative Google and IBM team on cloud computing initiative for universities(2007-1008) –provide several hundred computers –access through the Internet to test parallel programming projects The idea for the program from Google senior software engineer Christophe Bisciglia –Google Code UniversityGoogle Code University 34

35 35 The Information Factories Googleplex(pre-2008) –servers number 450,000, according to the lowest estimate –200 petabytes of hard disk storage –four petabytes of RAM –To handle the current load of 100 million queries a day, –input-output bandwidth must be in the neighborhood of 3 petabits per second

36 Google Infrastructure 2003 "The Google file system," in sosps. Bolton Landing, NY, USA: ACM Press, 2003. 2004 "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in Osdi, 2006 ……. 36

37 37 Hadoop Project Doug Cutting

38 38 History of Hadoop 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo!Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!

39 Google Code University 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39

40 Startup: Cloudera Cloudera is pushing a commercial distribution for Hadoop Hadoop Mike Olson Aaron Kimball Doug Cutting Christophe Bisciglia Tom White 40

41 Course Administrivia

42 TextBooks [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 2009. [Chinese Edition][Chinese Edition] [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2010. [EBook][EBook] 42

43 IDTopicsContentsReading 1Background 1.CloudComputing 概念出现及演变, 现状, 2. 本课程主要内容、要求 [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop 2System1.MapReduce & runtime system 2.HDFS [Lin]Ch2:Mapreduce Basic [Tom]Ch6:How mapreduce works [paper] 3Environment 1. 熟悉 Hadoop 环境 (*)Hadoop environment setup (*)WordCount [Tom]Ch9:Setting up hadoop cluster [Tom]Appendix A:Installing apache hadoop [Tom]Ch2:Mapreduce 4Algorithm Design 1.MapReduce Program Develop 2.Basic MapReduce algorithm design and design patterns (*)WordCount (Mapreduce library class) [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design 5Text retrieval(*)Inverted Index[Lin]Ch4:Inverted Indexing for Text Retrieval 6Graph Algorithm (*)PageRank (Mapreduce features: side data distribution) [Lin]Ch5:Graph Algorithms 43

44 Recap Why large data? Cloud computing and Hadoop What is MapReduce? 44

45 Q&A? Thanks you!


Download ppt "大规模数据处理 / 云计算 Lecture 1 - Introduction to MapReduce 彭波 北京大学信息科学技术学院 4/21/2011 This work is licensed under a Creative."

Similar presentations


Ads by Google