Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Theory and Practice 彭波 北京大学信息科学技术学院 7/15/2010.

Similar presentations


Presentation on theme: "MapReduce Theory and Practice 彭波 北京大学信息科学技术学院 7/15/2010."— Presentation transcript:

1 MapReduce Theory and Practice http://net.pku.edu.cn/~course/cs402/2010/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/15/2010

2 Last Course Review

3 3 Quiz What are they? 1. 数据 (data) 1.Bit 2.Byte 2. 数据类型 (data types) 3. 信息 (information) What are they? 1. 数据 (data) 1.Bit 2.Byte 2. 数据类型 (data types) 3. 信息 (information)

4 4 Data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables.variable Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. graphsimages Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.abstraction Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed. numberscharacters

5 5 Bit 位(英语: Bit ),亦称二 进制位,指二进制中的一位, 是信息的最小单位。 Bit 是 Binary digit (二 进制数位) 的缩写英语二进制 假设一事件以 A 或 B 的方式发 生,且 A 、 B 发生的概率相等, 都为 0.5 ,则一个二进位可用 来代表 A 或 B 之一。 例如: 二进位可以用来表示一个简单 的正负 有两种状态的开关 ( 如电灯开关 ) 晶体管的通断 晶体管 某根导线上电压的有无电压 一个抽像的逻辑上的是否

6 6 Byte 字节,英文名称是 Byte 。 Byte 是 Binary Term 的 缩写。一个字节代表八 个比特。它是通常被作 为计算机信息计量单位, 不论被存储数据的类型 为何。计算机

7 7 History of “Information” Latin origin: a representation implanted in the mind-> idea Language and Coding : hide information in messages and then decode them 。 莫尔斯电码 Mathematics: Shannon 在 channel transmission 工作中,定 义了一个 message 所包含的信息量为它在 source 中出现概率 的 log2 ,单位为 ’bits’ 。 Logic and linguistics : communication-oriented sense of information 涉及到 semantic meaning 语义, knowledge 知识 Society : information as something that is contained in the message used to inform. “information is the tennis ball of communication”

8 8

9 9 How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??) 640K ought to be enough for anybody.

10 10 “We are living in exponential times “

11 11 Information Overloading Political theorist Neil Postman spoke to the German Informatics Society in 1990, claiming that we are informing ourselves to death. He argued that the development of computer technology is not as positive as it has been heralded to be. With our focus on technology, we are forfeiting our humanity. We are drowning in information that contains empty promises of improving our lives. (Postman 1990).

12 12 怎样应对信息过载?

13 13 What’s matter with ME?! What you want to do with 1000pcs, or even 100,000 pcs?

14 14 Cloud is coming… Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's LawMoore's Law “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools

15 15 What’s Mapreduce Parallel/Distributed Computing Programming Model Input split shuffleoutput

16 16 Word Frequencies in Web pages 输入: one document per record 用户实现 map function ,输入为 key = document URL value = document contents map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录

17 17 Example continued: MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一 起 (shuffle/sort) 用户实现 reduce function 对一个 key 对应的 values 计算 求和 sum Reduce 输出

18 Homework Reading

19 19 Checklist What’s the title? What’s the main point of view? What’s the most impact on you?

20 20 Introduction to Distributed System Design How many times physicist occurs in this document? Tell me something about Remote Procedure Calls Tell me something about the types of failures that can occur in a distributed system

21 21 Introduction to Parallel Programming and MapReduce MASTER/WORKER technique approximating pi MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

22 End


Download ppt "MapReduce Theory and Practice 彭波 北京大学信息科学技术学院 7/15/2010."

Similar presentations


Ads by Google