陆嘉恒中国人民大学 www.jiahenglu.net 云计算与云数据管理陆嘉恒中国人民大学 www.jiahenglu.net.

陆嘉恒中国人民大学 www.jiahenglu.net
云计算与云数据管理陆嘉恒中国人民大学

CLOUD COMPUTING

云计算概述主要内容 Google 云计算技术：GFS，Bigtable 和 Mapreduce Yahoo云计算技术和Hadoop
云数据管理的挑战 3

Cloud computing

Why we use cloud computing?

Case 1: Write a file Save Computer down, file is lost Files are always stored in cloud, never lost

Case 2: Use IE --- download, install, use Use QQ --- download, install, use Use C download, install, use …… Get the serve from the cloud

What is cloud and cloud computing?
Demand resources or services over Internet scale and reliability of a data center.

What is cloud and cloud computing?
Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.

Characteristics of cloud computing
Virtual. software, databases, Web servers, operating systems, storage and networking as virtual servers. On demand. add and subtract processors, memory, network bandwidth, storage.

Infrastructure as a Service
Types of cloud service SaaS Software as a Service PaaS Platform as a Service IaaS Infrastructure as a Service

Software delivery model
SaaS Software delivery model No hardware or software to manage Service delivered through a browser Customers use the service on demand Instant Scalability

SaaS Examples Your current CRM package is not managing the load or you simply don’t want to host it in-house. Use a SaaS provider such as Salesforce.com Your is hosted on an exchange server in your office and it is very slow. Outsource this using Hosted Exchange.

Platform delivery model
PaaS Platform delivery model Platforms are built upon Infrastructure, which is expensive Estimating demand is not a science! Platform management is not fun!

PaaS Examples You need to host a large file (5Mb) on your website and make it available for 35,000 users for only two months duration. Use Cloud Front from Amazon. You want to start storage services on your network for a large number of files and you do not have the storage capacity…use Amazon S3.

Computer infrastructure delivery model
IaaS Computer infrastructure delivery model A platform virtualization environment Computing resources, such as storing and processing capacity. Virtualization taken a step further Sometimes called Utility computing 17

IaaS Examples You want to run a batch job but you don’t have the infrastructure necessary to run it in a timely manner. Use Amazon EC2. You want to host a website, but only for a few days. Use Flexiscale.

Cloud computing and other computing techniques

CLOUD COMPUTING

The 21st Century Vision Of Computing
Leonard Kleinrock , one of the chief scientists of the original Advanced Research Projects Agency Network (ARPANET) project which seeded the Internet, said: “ As of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of ‘computer utilities’ which, like present electric and telephone utilities, will service individual homes and offices across the country.” 特征以服务的形式将计算机技术（Computing）提供给用户，从而隐藏了技术细节的复杂性（Cloud）对服务的访问大多以Web Service或类似形式通过网络提供，已达到最好的兼容性、可扩展性和位置无关性服务提供商以单位价格收取费用；（使IT服务类似传统水电的公共服务Computing is being transformed to a model consisting of services that are commoditized and delivered in a manner similar to traditional utilities such as water, electricity, gas, and telephony.）用户并不拥有服务器硬件等it基础设施，而是按照自身需要随时订购计算力或所需空间等it服务，可以大大节约成本云计算的技术实现通常是网格技术的自然延伸，并大量应用了虚拟化技术。 21

Sun Microsystems co-founder Bill Joy 特征以服务的形式将计算机技术（Computing）提供给用户，从而隐藏了技术细节的复杂性（Cloud）对服务的访问大多以Web Service或类似形式通过网络提供，已达到最好的兼容性、可扩展性和位置无关性服务提供商以单位价格收取费用；（使IT服务类似传统水电的公共服务Computing is being transformed to a model consisting of services that are commoditized and delivered in a manner similar to traditional utilities such as water, electricity, gas, and telephony.）用户并不拥有服务器硬件等it基础设施，而是按照自身需要随时订购计算力或所需空间等it服务，可以大大节约成本云计算的技术实现通常是网格技术的自然延伸，并大量应用了虚拟化技术。 22

特征以服务的形式将计算机技术（Computing）提供给用户，从而隐藏了技术细节的复杂性（Cloud）对服务的访问大多以Web Service或类似形式通过网络提供，已达到最好的兼容性、可扩展性和位置无关性服务提供商以单位价格收取费用；（使IT服务类似传统水电的公共服务Computing is being transformed to a model consisting of services that are commoditized and delivered in a manner similar to traditional utilities such as water, electricity, gas, and telephony.）用户并不拥有服务器硬件等it基础设施，而是按照自身需要随时订购计算力或所需空间等it服务，可以大大节约成本云计算的技术实现通常是网格技术的自然延伸，并大量应用了虚拟化技术。 23

Definitions Cloud Grid utility Cluster 24 1. 厂商一如既往模糊新术语的真实定义。
笔者认为（也是其他人的看法）云计算与公用计算不同，而后者与网格计算也不同： “网格计算通常指的是用于运行计算任务（如图像处理）而不是很长的流程（如Web网站或电子邮件服务器）的汇集资源的环境。” “公用计算通常指的是支持很长的流程的汇集资源的环境，公用计算一般关注于通过提供完成任务所需的最优数量的资源来满足服务水平。” “云计算（对于许多人来说）是指通过Internet提供的各种服务，这些服务在服务提供商的基础设施上提供计算功能（例如，Google Apps或Amazon EC2或Salesforce.com）。云计算环境可能实际存在于网格中，或存在于公用计算环境中，但这对服务的用户并不重要。” 2. 云计算与网格、软件服务化、平台服务化云计算＝网格计算。工作负载被传送给由分派任务的主控节点和工作的从属节点组成的IT基础设施。主控节点控制分配给工作负载的资源（多少从属节点运行并行化的工作负载）。这些对于客户机是透明的，客户机只看到工作负载被分配给云/网格，然后结果返回给它。从属节点可以是，也可以不是虚拟主机。云计算＝软件服务化。这是Google的应用模型。在这种模型中，应用处在“云”中，即Web中的某个地方。云计算＝平台服务化。这是Amazon EC2等的模型。在这种模型中，一个外部实体维护IT基础设施（主/从节点），而客户购买这个基础设施上的时间/资源。正是这种“在云中”致使云计算分布在Web上，处在从它租用时间的机构之外。 3. 云仅仅指的是从本地向Web上的服务迁移。从本地保存文件到把它们存储在安全的、可伸缩的环境中。从开发存储容量限制在GB空间的应用到不存在存储容量上限的应用，从使用微软 Office到使用基于Web的office。在2005到2008年的某个时候，在线存储变得比本地存储或存储在你自己的服务器上费用更低廉、更安全，这就是云。它包括网格计算、像Bigtable这样的更大的数据库、缓存、永远可访问、故障切换、冗余、可伸缩以及各种东西。可以把这认为是更深地进入到Internet中。它还对于像静态对动态、RDBMS 对BigTable和扁平数据视图这样的斗争有着巨大的影响。依靠IT基础设施的整个业务架构将发生变化，程序员将驱动云，最终将有出现很多富有的程序员。这就像从大型机向个人计算机迁移。现在，你在云中有了个人空间。 “这是个花招，就像Web 2.0，但有作为这些东西基础的真正变化。营销一直是围绕着技术进步形成的。” 4. 网格和云不是互斥的......有人发表了如下的评论： “云是购买使用（即，你不一定拥有资源）。” “网格是如何调度工作－－不管你在何处运行它。” “你可以使用没网格的云，或者没有云的网格。或者，你可以在云上使用网格。” 5.笔者一般将云计算概念划分为3个阵营： “使能者－－它们是使基本的基础设施或基本构建成为可能的公司。这些公司通常关注数据中心自动化和/或服务器虚拟化(VMware/EMC、Citrix、BladeLogic、RedHat、Intel、Sun、IBM、Enomalism等）。” “提供者－－(Amazon Web Services、Rackspace、Google、Microsoft)。它们是拥有建设价值数百万、甚至几十亿美元的全球计算环境的预算和技术诀窍的公司。云提供商通常提供基础设施或平台。这些‘服务化’产品常常以公用设施方式计费和消费。” “云消费者可能是相当大的一个群体，包括通过像Webmail这样的基于Web的服务、博客、社交网等提供的任何应用。从消费者观点看，云计算正在成为你构建、管理和部署可伸缩Web应用的惟一途径。” 至少以上的五种定义辨析，从某种程度上来说澄清了这点。 24

Definitions Cloud Grid
Utility computing is the packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility Cloud Grid Cluster utility 1. 厂商一如既往模糊新术语的真实定义。笔者认为（也是其他人的看法）云计算与公用计算不同，而后者与网格计算也不同： “网格计算通常指的是用于运行计算任务（如图像处理）而不是很长的流程（如Web网站或电子邮件服务器）的汇集资源的环境。” “公用计算通常指的是支持很长的流程的汇集资源的环境，公用计算一般关注于通过提供完成任务所需的最优数量的资源来满足服务水平。” “云计算（对于许多人来说）是指通过Internet提供的各种服务，这些服务在服务提供商的基础设施上提供计算功能（例如，Google Apps或Amazon EC2或Salesforce.com）。云计算环境可能实际存在于网格中，或存在于公用计算环境中，但这对服务的用户并不重要。” 2. 云计算与网格、软件服务化、平台服务化云计算＝网格计算。工作负载被传送给由分派任务的主控节点和工作的从属节点组成的IT基础设施。主控节点控制分配给工作负载的资源（多少从属节点运行并行化的工作负载）。这些对于客户机是透明的，客户机只看到工作负载被分配给云/网格，然后结果返回给它。从属节点可以是，也可以不是虚拟主机。云计算＝软件服务化。这是Google的应用模型。在这种模型中，应用处在“云”中，即Web中的某个地方。云计算＝平台服务化。这是Amazon EC2等的模型。在这种模型中，一个外部实体维护IT基础设施（主/从节点），而客户购买这个基础设施上的时间/资源。正是这种“在云中”致使云计算分布在Web上，处在从它租用时间的机构之外。 3. 云仅仅指的是从本地向Web上的服务迁移。从本地保存文件到把它们存储在安全的、可伸缩的环境中。从开发存储容量限制在GB空间的应用到不存在存储容量上限的应用，从使用微软 Office到使用基于Web的office。在2005到2008年的某个时候，在线存储变得比本地存储或存储在你自己的服务器上费用更低廉、更安全，这就是云。它包括网格计算、像Bigtable这样的更大的数据库、缓存、永远可访问、故障切换、冗余、可伸缩以及各种东西。可以把这认为是更深地进入到Internet中。它还对于像静态对动态、RDBMS 对BigTable和扁平数据视图这样的斗争有着巨大的影响。依靠IT基础设施的整个业务架构将发生变化，程序员将驱动云，最终将有出现很多富有的程序员。这就像从大型机向个人计算机迁移。现在，你在云中有了个人空间。 “这是个花招，就像Web 2.0，但有作为这些东西基础的真正变化。营销一直是围绕着技术进步形成的。” 4. 网格和云不是互斥的......有人发表了如下的评论： “云是购买使用（即，你不一定拥有资源）。” “网格是如何调度工作－－不管你在何处运行它。” “你可以使用没网格的云，或者没有云的网格。或者，你可以在云上使用网格。” 5.笔者一般将云计算概念划分为3个阵营： “使能者－－它们是使基本的基础设施或基本构建成为可能的公司。这些公司通常关注数据中心自动化和/或服务器虚拟化(VMware/EMC、Citrix、BladeLogic、RedHat、Intel、Sun、IBM、Enomalism等）。” “提供者－－(Amazon Web Services、Rackspace、Google、Microsoft)。它们是拥有建设价值数百万、甚至几十亿美元的全球计算环境的预算和技术诀窍的公司。云提供商通常提供基础设施或平台。这些‘服务化’产品常常以公用设施方式计费和消费。” “云消费者可能是相当大的一个群体，包括通过像Webmail这样的基于Web的服务、博客、社交网等提供的任何应用。从消费者观点看，云计算正在成为你构建、管理和部署可伸缩Web应用的惟一途径。” 至少以上的五种定义辨析，从某种程度上来说澄清了这点。 25

Definitions Cloud Grid utility
Cluster utility A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. 1. 厂商一如既往模糊新术语的真实定义。笔者认为（也是其他人的看法）云计算与公用计算不同，而后者与网格计算也不同： “网格计算通常指的是用于运行计算任务（如图像处理）而不是很长的流程（如Web网站或电子邮件服务器）的汇集资源的环境。” “公用计算通常指的是支持很长的流程的汇集资源的环境，公用计算一般关注于通过提供完成任务所需的最优数量的资源来满足服务水平。” “云计算（对于许多人来说）是指通过Internet提供的各种服务，这些服务在服务提供商的基础设施上提供计算功能（例如，Google Apps或Amazon EC2或Salesforce.com）。云计算环境可能实际存在于网格中，或存在于公用计算环境中，但这对服务的用户并不重要。” 2. 云计算与网格、软件服务化、平台服务化云计算＝网格计算。工作负载被传送给由分派任务的主控节点和工作的从属节点组成的IT基础设施。主控节点控制分配给工作负载的资源（多少从属节点运行并行化的工作负载）。这些对于客户机是透明的，客户机只看到工作负载被分配给云/网格，然后结果返回给它。从属节点可以是，也可以不是虚拟主机。云计算＝软件服务化。这是Google的应用模型。在这种模型中，应用处在“云”中，即Web中的某个地方。云计算＝平台服务化。这是Amazon EC2等的模型。在这种模型中，一个外部实体维护IT基础设施（主/从节点），而客户购买这个基础设施上的时间/资源。正是这种“在云中”致使云计算分布在Web上，处在从它租用时间的机构之外。 3. 云仅仅指的是从本地向Web上的服务迁移。从本地保存文件到把它们存储在安全的、可伸缩的环境中。从开发存储容量限制在GB空间的应用到不存在存储容量上限的应用，从使用微软 Office到使用基于Web的office。在2005到2008年的某个时候，在线存储变得比本地存储或存储在你自己的服务器上费用更低廉、更安全，这就是云。它包括网格计算、像Bigtable这样的更大的数据库、缓存、永远可访问、故障切换、冗余、可伸缩以及各种东西。可以把这认为是更深地进入到Internet中。它还对于像静态对动态、RDBMS 对BigTable和扁平数据视图这样的斗争有着巨大的影响。依靠IT基础设施的整个业务架构将发生变化，程序员将驱动云，最终将有出现很多富有的程序员。这就像从大型机向个人计算机迁移。现在，你在云中有了个人空间。 “这是个花招，就像Web 2.0，但有作为这些东西基础的真正变化。营销一直是围绕着技术进步形成的。” 4. 网格和云不是互斥的......有人发表了如下的评论： “云是购买使用（即，你不一定拥有资源）。” “网格是如何调度工作－－不管你在何处运行它。” “你可以使用没网格的云，或者没有云的网格。或者，你可以在云上使用网格。” 5.笔者一般将云计算概念划分为3个阵营： “使能者－－它们是使基本的基础设施或基本构建成为可能的公司。这些公司通常关注数据中心自动化和/或服务器虚拟化(VMware/EMC、Citrix、BladeLogic、RedHat、Intel、Sun、IBM、Enomalism等）。” “提供者－－(Amazon Web Services、Rackspace、Google、Microsoft)。它们是拥有建设价值数百万、甚至几十亿美元的全球计算环境的预算和技术诀窍的公司。云提供商通常提供基础设施或平台。这些‘服务化’产品常常以公用设施方式计费和消费。” “云消费者可能是相当大的一个群体，包括通过像Webmail这样的基于Web的服务、博客、社交网等提供的任何应用。从消费者观点看，云计算正在成为你构建、管理和部署可伸缩Web应用的惟一途径。” 至少以上的五种定义辨析，从某种程度上来说澄清了这点。 26

Cluster utility Grid computing is the application of several computers to a single problem at the same time — usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data 1. 厂商一如既往模糊新术语的真实定义。笔者认为（也是其他人的看法）云计算与公用计算不同，而后者与网格计算也不同： “网格计算通常指的是用于运行计算任务（如图像处理）而不是很长的流程（如Web网站或电子邮件服务器）的汇集资源的环境。” “公用计算通常指的是支持很长的流程的汇集资源的环境，公用计算一般关注于通过提供完成任务所需的最优数量的资源来满足服务水平。” “云计算（对于许多人来说）是指通过Internet提供的各种服务，这些服务在服务提供商的基础设施上提供计算功能（例如，Google Apps或Amazon EC2或Salesforce.com）。云计算环境可能实际存在于网格中，或存在于公用计算环境中，但这对服务的用户并不重要。” 2. 云计算与网格、软件服务化、平台服务化云计算＝网格计算。工作负载被传送给由分派任务的主控节点和工作的从属节点组成的IT基础设施。主控节点控制分配给工作负载的资源（多少从属节点运行并行化的工作负载）。这些对于客户机是透明的，客户机只看到工作负载被分配给云/网格，然后结果返回给它。从属节点可以是，也可以不是虚拟主机。云计算＝软件服务化。这是Google的应用模型。在这种模型中，应用处在“云”中，即Web中的某个地方。云计算＝平台服务化。这是Amazon EC2等的模型。在这种模型中，一个外部实体维护IT基础设施（主/从节点），而客户购买这个基础设施上的时间/资源。正是这种“在云中”致使云计算分布在Web上，处在从它租用时间的机构之外。 3. 云仅仅指的是从本地向Web上的服务迁移。从本地保存文件到把它们存储在安全的、可伸缩的环境中。从开发存储容量限制在GB空间的应用到不存在存储容量上限的应用，从使用微软 Office到使用基于Web的office。在2005到2008年的某个时候，在线存储变得比本地存储或存储在你自己的服务器上费用更低廉、更安全，这就是云。它包括网格计算、像Bigtable这样的更大的数据库、缓存、永远可访问、故障切换、冗余、可伸缩以及各种东西。可以把这认为是更深地进入到Internet中。它还对于像静态对动态、RDBMS 对BigTable和扁平数据视图这样的斗争有着巨大的影响。依靠IT基础设施的整个业务架构将发生变化，程序员将驱动云，最终将有出现很多富有的程序员。这就像从大型机向个人计算机迁移。现在，你在云中有了个人空间。 “这是个花招，就像Web 2.0，但有作为这些东西基础的真正变化。营销一直是围绕着技术进步形成的。” 4. 网格和云不是互斥的......有人发表了如下的评论： “云是购买使用（即，你不一定拥有资源）。” “网格是如何调度工作－－不管你在何处运行它。” “你可以使用没网格的云，或者没有云的网格。或者，你可以在云上使用网格。” 5.笔者一般将云计算概念划分为3个阵营： “使能者－－它们是使基本的基础设施或基本构建成为可能的公司。这些公司通常关注数据中心自动化和/或服务器虚拟化(VMware/EMC、Citrix、BladeLogic、RedHat、Intel、Sun、IBM、Enomalism等）。” “提供者－－(Amazon Web Services、Rackspace、Google、Microsoft)。它们是拥有建设价值数百万、甚至几十亿美元的全球计算环境的预算和技术诀窍的公司。云提供商通常提供基础设施或平台。这些‘服务化’产品常常以公用设施方式计费和消费。” “云消费者可能是相当大的一个群体，包括通过像Webmail这样的基于Web的服务、博客、社交网等提供的任何应用。从消费者观点看，云计算正在成为你构建、管理和部署可伸缩Web应用的惟一途径。” 至少以上的五种定义辨析，从某种程度上来说澄清了这点。 27

Cluster utility Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. 1. 厂商一如既往模糊新术语的真实定义。笔者认为（也是其他人的看法）云计算与公用计算不同，而后者与网格计算也不同： “网格计算通常指的是用于运行计算任务（如图像处理）而不是很长的流程（如Web网站或电子邮件服务器）的汇集资源的环境。” “公用计算通常指的是支持很长的流程的汇集资源的环境，公用计算一般关注于通过提供完成任务所需的最优数量的资源来满足服务水平。” “云计算（对于许多人来说）是指通过Internet提供的各种服务，这些服务在服务提供商的基础设施上提供计算功能（例如，Google Apps或Amazon EC2或Salesforce.com）。云计算环境可能实际存在于网格中，或存在于公用计算环境中，但这对服务的用户并不重要。” 2. 云计算与网格、软件服务化、平台服务化云计算＝网格计算。工作负载被传送给由分派任务的主控节点和工作的从属节点组成的IT基础设施。主控节点控制分配给工作负载的资源（多少从属节点运行并行化的工作负载）。这些对于客户机是透明的，客户机只看到工作负载被分配给云/网格，然后结果返回给它。从属节点可以是，也可以不是虚拟主机。云计算＝软件服务化。这是Google的应用模型。在这种模型中，应用处在“云”中，即Web中的某个地方。云计算＝平台服务化。这是Amazon EC2等的模型。在这种模型中，一个外部实体维护IT基础设施（主/从节点），而客户购买这个基础设施上的时间/资源。正是这种“在云中”致使云计算分布在Web上，处在从它租用时间的机构之外。 3. 云仅仅指的是从本地向Web上的服务迁移。从本地保存文件到把它们存储在安全的、可伸缩的环境中。从开发存储容量限制在GB空间的应用到不存在存储容量上限的应用，从使用微软 Office到使用基于Web的office。在2005到2008年的某个时候，在线存储变得比本地存储或存储在你自己的服务器上费用更低廉、更安全，这就是云。它包括网格计算、像Bigtable这样的更大的数据库、缓存、永远可访问、故障切换、冗余、可伸缩以及各种东西。可以把这认为是更深地进入到Internet中。它还对于像静态对动态、RDBMS 对BigTable和扁平数据视图这样的斗争有着巨大的影响。依靠IT基础设施的整个业务架构将发生变化，程序员将驱动云，最终将有出现很多富有的程序员。这就像从大型机向个人计算机迁移。现在，你在云中有了个人空间。 “这是个花招，就像Web 2.0，但有作为这些东西基础的真正变化。营销一直是围绕着技术进步形成的。” 4. 网格和云不是互斥的......有人发表了如下的评论： “云是购买使用（即，你不一定拥有资源）。” “网格是如何调度工作－－不管你在何处运行它。” “你可以使用没网格的云，或者没有云的网格。或者，你可以在云上使用网格。” 5.笔者一般将云计算概念划分为3个阵营： “使能者－－它们是使基本的基础设施或基本构建成为可能的公司。这些公司通常关注数据中心自动化和/或服务器虚拟化(VMware/EMC、Citrix、BladeLogic、RedHat、Intel、Sun、IBM、Enomalism等）。” “提供者－－(Amazon Web Services、Rackspace、Google、Microsoft)。它们是拥有建设价值数百万、甚至几十亿美元的全球计算环境的预算和技术诀窍的公司。云提供商通常提供基础设施或平台。这些‘服务化’产品常常以公用设施方式计费和消费。” “云消费者可能是相当大的一个群体，包括通过像Webmail这样的基于Web的服务、博客、社交网等提供的任何应用。从消费者观点看，云计算正在成为你构建、管理和部署可伸缩Web应用的惟一途径。” 至少以上的五种定义辨析，从某种程度上来说澄清了这点。 28

Grid Computing & Cloud Computing
share a lot commonality intention, architecture and technology Difference programming model, business model, compute model, applications, and Virtualization.

the problems are mostly the same manage large facilities; define methods by which consumers discover, request and use resources provided by the central facilities; implement the often highly parallel computations that execute on those resources.

Virtualization Grid do not rely on virtualization as much as Clouds do, each individual organization maintain full control of their resources Cloud an indispensable ingredient for almost every Cloud

Any question and any comments ?
2017/3/27 33 33

Google Cloud computing techniques

Cloud Systems BigTable HBase HyperTable Hive HadoopDB GreenPlum
CouchDB Voldemort PNUTS SQL Azure OSDI’06 BigTable-like MapReduce VLDB’09 VLDB’09 DBMS-based Operating Systems Design and Implementation VLDB’08 36

The Google File System

The Google File System (GFS)
A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones have: 1000+ storage nodes 300+ TeraBytes of disk storage heavily accessed by hundreds of clients on distinct machines

Introduction Shares many same goals as previous distributed file systems performance, scalability, reliability, etc GFS design has been driven by four key observation of Google application workloads and technological environment

Intro: Observations 1 1. Component failures are the norm
constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. Huge files (by traditional standards) Multi GB files are common I/O operations and blocks sizes must be revisited 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

Intro: Observations 2 3. Most files are mutated by appending new data
This is the focus of performance optimization and atomicity guarantees 4. Co-designing the applications and APIs benefits overall system by increasing flexibility 3. Rather than overwriting old data, Random writes virtual non existent, Reads are almost all sequential

Files are broken into chunks.
The Design Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Files are broken into chunks.

The Master Maintains all file system metadata.
names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state Controls system wide activities (more later)

The Master Helps make sophisticated chunk placement and replication decision, using global knowledge For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers Master is not a bottleneck for reads/writes Usually asks for locations of more than one chunk in one request

Chunkservers Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. handle is assigned by the master at chunk creation Chunk size is 64 MB Each chunk is replicated on 3 (default) servers

Clients Linked to apps using the file system API.
Communicates with master and chunkservers for reading and writing Master interactions only for metadata Chunkserver interactions for data Only caches metadata information Data is too large to cache.

Chunk Locations Master does not keep a persistent record of locations of chunks and replicas. Polls chunkservers at startup, and when new chunkservers join/leave for this. Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)

Operation Log Record of all critical metadata changes
Stored on Master and replicated on other machines Defines order of concurrent operations Also used to recover the file system state Central to GFS Recovering FS state: checkpoints to state when log grows to some size. Loads from last checkpoint and replays records after that. Master starts new log file and creates the checkpoint in a separate thread. When checkpoint is created (a few minutes), it is stored locally and remotely. Can delete older checkpoints and log files

System Interactions: Leases and Mutation Order
Leases maintain a mutation order across all chunk replicas Master grants a lease to a replica, called the primary The primary choses the serial mutation order, and all replicas follow this order Minimizes management overhead for the Master Mutation is an Op to change the contents/metadata of a chunk. Performed on all replicas FOR CONCURRENT WRITES: writes may be interleaved with and overwritten by concurrent OPS from other clients. The shared region may end up containing fragments from diff clients, HOWEVER replicas will be identical because the indiv ops are done in the same order.

Atomic Record Append Client specifies the data to write; GFS chooses and returns the offset it writes to and appends the data to each replica at least once Heavily used by Google’s Distributed applications. No need for a distributed lock manager GFS choses the offset, not the client Traditionally, the client specifies the offset to write to. Concurrent writes to the same region are not serializable: the section may contain data from different writers.

Atomic Record Append: How?
Follows similar control flow as mutations Primary tells secondary replicas to append at the same offset as the primary If a replica append fails at any replica, it is retried by the client. So replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record

Atomic Record Append: How?
GFS does not guarantee that all replicas are bitwise identical. Only guarantees that data is written at least once in an atomic unit. Data must be written at the same offset for all chunk replicas for success to be reported.

Detecting Stale Replicas
Master has a chunk version number to distinguish up to date and stale replicas Increase version when granting a lease If a replica is not available, its version is not increased master detects stale replicas when a chunkservers report chunks and versions Remove stale replicas during garbage collection Client is given the version number when requesting a chunks location so it can verify it is using the most up to date replica

Garbage collection When a client deletes a file, master logs it like other changes and changes filename to a hidden file. Master removes files hidden for longer than 3 days when scanning file system name space metadata is also erased During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. Chunkserver removes these files on its own

Fault Tolerance: High Availability
Fast recovery Master and chunkservers can restart in seconds Chunk Replication Master Replication “shadow” masters provide read-only access when primary master is down mutations not done until recorded on all master replicas

Fault Tolerance: Data Integrity
Chunkservers use checksums to detect corrupt data Since replicas are not bitwise identical, chunkservers maintain their own checksums For reads, chunkserver verifies checksum before sending chunk Update checksums during writes FOR READ, returns an error if checksum doesn’t match updating CS for WRITES... OPTIMIZED for append writes, since that’s what’s dominant

Introduction to MapReduce
57

MapReduce: Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ? 58

MapReduce Programming Model
Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Users implement interface of two primary methods: 1. Map: (key1, val1) → (key2, val2) 2. Reduce: (key2, [val2]) → [val3] 59

Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 60

Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 61

Pseudo-code for each word w in input_value: EmitIntermediate(w, "1");
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 62

MapReduce: Execution overview
63

MapReduce: Example 64

MapReduce in Parallel: Example
65

MapReduce: Fault Tolerance
Handled via re-execution of tasks. Task completion committed through master What happens if Mapper fails ? Re-execute completed + in-progress map tasks What happens if Reducer fails ? Re-execute in progress reduce tasks What happens if Master fails ? Potential trouble !! 66

Walk through of One more Application
MapReduce: Walk through of One more Application 67

MapReduce : PageRank PageRank models the behavior of a “random surfer”. C(t) is the out-degree of t, and (1-d) is a damping factor (random jump) The “random surfer” keeps clicking on successive links at random not taking content into consideration. Distributes its pages rank equally among all pages it links to. The dampening factor takes the surfer “getting bored” and typing arbitrary URL. 69

PageRank : Key Insights
Effects at each iteration is local. i+1th iteration depends only on ith iteration At iteration i, PageRank for individual nodes can be computed independently 70

PageRank using MapReduce
Use Sparse matrix representation (M) Map each row of M to a list of PageRank “credit” to assign to out link neighbours. These prestige scores are reduced to a single PageRank value for a page by aggregating over them. 71

PageRank using MapReduce
Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence Source of Image: Lin 2008 72

Phase 1: Process HTML Map task takes (URL, page-content) pairs and maps them to (URL, (PRinit, list-of-urls)) PRinit is the “seed” PageRank for URL list-of-urls contains all pages pointed to by URL Reduce task is just the identity function 73

Phase 2: PageRank Distribution
Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d to get new PR Emit (URL, (new_rank, url_list)) Check for convergence using non parallel component 74

MapReduce: Some More Apps
Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems MapReduce Programs In Google Source Tree 75

MapReduce: Extensions and similar apps
PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft) 76

Large Scale Systems Architecture using MapReduce
User App MapReduce Distributed File Systems (GFS) 77

BigTable: A Distributed Storage System for Structured Data
78

Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 79

Motivation Lots of (semi-)structured data at Google Scale is large
URLs: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands or q/sec 100TB+ of satellite image data 80

Why not just use commercial DB?
Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 81

Goals Want asynchronous processes to be continuously updating different pieces of data Want access to most current data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time E.g. Contents of a web page over multiple crawls 82

BigTable Distributed multi-level map Fault-tolerant, persistent
Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 83

Building Blocks Building blocks: BigTable uses of building blocks:
Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent data (SSTable file format for storage of data) Scheduler: schedules jobs involved in BigTable serving Lock service: master election, location bootstrapping Map Reduce: often used to read/write BigTable data 84

(row, column, timestamp) -> cell contents
Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 85

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 86

Rows Name is an arbitrary string Rows ordered lexicographically
Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 87

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 88

Columns Columns have two-level name structure: Column family
family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired 89

Timestamps Used to store different versions of data in a cell
New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds” 90

Implementation – Three Major Components
Library linked into every client One master server Responsible for: Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large 91

Implementation (cont.)
Client data doesn’t move through master server. Clients communicate directly with tablet servers for reads and writes. Most clients never communicate with the master server, leaving it lightly loaded in practice. 92

Tablets Large tables broken into tablets at row boundaries
Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet for failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 93

Tablet Location Since tablets move around from server to server, given a row, how do clients find the right machine? Need to find tablet whose row range covers the target row 94

Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room. 95

API Metadata operations Writes (atomic) Reads
Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns 96

Refinements: Compression
Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1) 97

Refinements: Bloom Filters
Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk 98

Refinements: Bloom Filters
Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk 99

Yahoo！ Cloud computing

Search Results of the Future
yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd

What’s in the Horizontal Cloud?
Simple Web Service API’s Horizontal Cloud Services Provisioning & Virtualization e.g., EC2 Batch Storage & Processing e.g., Hadoop & Pig Operational Storage e.g., S3, MObStor, Sherpa Edge Content Services e.g., YCS, YCPI Other Services Messaging, Workflow, virtual DBs & Webserving ID & Account Management Metering, Billing, Accounting Shared Infrastructure Security Monitoring & QoS Common Approaches to QA, Production Engineering, Performance Engineering, Datacenter Management, and Optimization 103

Yahoo! Cloud Stack EDGE WEB APP Data Highway STORAGE BATCH
Horizontal Cloud Services YCS YCPI Brooklyn … WEB Horizontal Cloud Services VM/OS yApache PHP App Engine APP Provisioning (Self-serve) Monitoring/Metering/Security Horizontal Cloud Services VM/OS Serving Grid … Data Highway STORAGE Horizontal Cloud Services Sherpa MOBStor … BATCH Horizontal Cloud Services Hadoop …

Structured record storage
Web Data Management CRUD Point lookups and short scans Index organized table and random I/Os $ per latency Scan oriented workloads Focus on sequential disk I/O $ per cpu cycle Structured record storage (PNUTS/Sherpa) Large data analysis (Hadoop) Object retrieval and streaming Scalable file storage $ per GB Blob storage (SAN/NAS)

The World Has Changed Web serving applications need:
Scalability! Preferably elastic Flexible schemas Geographic distribution High availability Reliable storage Web serving applications can do without: Complicated queries Strong transactions

PNUTS / SHERPA To Help You Scale Your Mountains of Data
A project in Y!R focused on a long-range problem, origins in earlier work at Wisconsin. Basis for the Goldrush hack, which won the recent Local hack competition, and could contribute to creation/refinement of Y! Local content and Next Gen Search.

Yahoo! Serving Storage Problem
Small records – 100KB or less Structured records – lots of fields, evolving Extreme data scale - Tens of TB Extreme request scale - Tens of thousands of requests/sec Low latency globally datacenters worldwide High Availability - outages cost $millions Variable usage patterns - as applications and users change 108

The PNUTS/Sherpa Solution
The next generation global-scale record store Record-orientation: Routing, data storage optimized for low-latency record access Scale out: Add machines to scale throughput (while keeping latency low) Asynchrony: Pub-sub replication to far-flung datacenters to mask propagation delay Consistency model: Reduce complexity of asynchrony for the application programmer Cloud deployment model: Hosted, managed service to reduce app time-to-market and enable on demand scale and elasticity 109

What is PNUTS/Sherpa? CREATE TABLE Parts ( ID VARCHAR,
StockNumber INT, Status VARCHAR … ) E C A E B W C W D E F E A E B W C W D E E C F E Structured, flexible schema E C A E B W C W D E F E Geographic replication Parallel database Hosted, managed infrastructure 110

What Will It Become? Indexes and views CREATE TABLE Parts (
A E B W C W D E F E E C A E B W C W D E F E Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) E C A E B W C W D E F E Geographic replication Parallel database Structured, flexible schema Hosted, managed infrastructure

What Will It Become? Indexes and views A 42342 E B 42521 W A 42342 E
C W D E F E E C A E B W C W D E F E Indexes and views E C A E B W C W D E F E

Design Goals Scalability Geographic replication
Thousands of machines Easy to add capacity Restrict query language to avoid costly queries Geographic replication Asynchronous replication around the globe Low-latency local access High availability and fault tolerance Automatically recover from failures Serve reads and writes despite failures Consistency Per-record guarantees Timeline model Option to relax if needed Multiple access paths Hash table, ordered table Primary, secondary access Hosted service Applications plug and play Share operational cost 113 113

Technology Elements Applications PNUTS API Tabular API PNUTS
Query planning and execution Index maintenance Distributed infrastructure for tabular data Data partitioning Update consistency Replication YCA: Authorization YDOT FS Ordered tables YDHT FS Hash tables Tribble Pub/sub messaging Zookeeper Consistency service 114 114

Data Manipulation Per-record operations Multi-record operations Get
Set Delete Multi-record operations Multiget Scan Getrange 115

Tablets—Hash Table 0x0000 0x2AF3 0x911F 0xFFFF Name Description Price
Grape Grapes are good to eat $12 Lime Limes are green $9 Apple Apple is wisdom $1 Strawberry Strawberry shortcake $900 0x2AF3 Orange Arrgh! Don’t get scurvy! $2 Avocado But at what price? $3 Lemon How much did you pay for this lemon? $1 Tomato Is this a vegetable? $14 0x911F Banana The perfect fruit $2 Kiwi New Zealand $8 0xFFFF 116

Tablets—Ordered Table
Name Description Price A Apple Apple is wisdom $1 Avocado But at what price? $3 Banana The perfect fruit $2 Grape Grapes are good to eat $12 H Kiwi New Zealand $8 Lemon How much did you pay for this lemon? $1 Lime Limes are green $9 Orange Arrgh! Don’t get scurvy! $2 Q Strawberry Strawberry shortcake $900 Tomato Is this a vegetable? $14 Z 117

Flexible Schema Posted date Listing id Item Price 6/1/07 424252 Couch
$570 763245 Bike $86 6/3/07 211242 Car $1123 6/5/07 421133 Lamp $15 Condition Good Fair Color Red

Detailed Architecture
Local region Remote regions Clients REST API Routers Tribble Tablet Controller Storage units 119

Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Storage unit Tablet Overfull tablets split Tablets may grow over time Shed load by moving tablets to other servers 120

QUERY PROCESSING 121

Accessing Data SU SU SU Get key k Get key k Record for key k 1 4 2 3
122

Bulk Read SU SU SU {k1, k2, … kn} Get k1 Get k2 Get k3 1 2 123
Scatter/ gather server SU SU SU 123

Range Queries in YDOT Clustered, ordered retrieval of records
Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Router Apple Avocado Banana Blueberry Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Grapefruit…Lime? Grapefruit…Pear? Lime…Pear? Canteloupe Grape Kiwi Lemon Lime Mango Orange Storage unit 1 Storage unit 2 Storage unit 3 Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon

Updates SU SU SU Write key k Sequence # for key k Write key k
8 1 Sequence # for key k Write key k Routers Message brokers 2 Write key k 3 Write key k 7 4 Sequence # for key k 5 SUCCESS SU SU SU 6 Write key k 125

ASYNCHRONOUS REPLICATION AND CONSISTENCY
126

Asynchronous Replication
127

Consistency Model Goal: Make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Alice”? Record inserted Update Update Update Update Update Update Update Delete v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Time Generation 1 As the record is updated, copies may get out of sync. 128

Example: Social Alice West East Record Timeline User Status Alice ___
Busy Busy User Status Alice Busy User Status Alice Free Free User Status Alice ??? User Status Alice ??? Free

Consistency Model Read Stale version Stale version Current version
Time Generation 1 In general, reads are served using a local copy 130

Consistency Model Read up-to-date Stale version Stale version
Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 But application can request and get current version 131

Consistency Model Read ≥ v.6 Stale version Stale version
Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 Or variations such as “read forward”—while copies may lag the master record, every copy goes through the same sequence of changes 132

Consistency Model Write Stale version Stale version Current version
Time Generation 1 Achieved via per-record primary copy protocol (To maximize availability, record masterships automaticlly transferred if site fails) Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors) 133

Consistency Model Write if = v.7 Stale version Stale version
ERROR Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 Test-and-set writes facilitate per-record transactions 134

Consistency Techniques
Per-record mastering Each record is assigned a “master region” May differ between records Updates to the record forwarded to the master region Ensures consistent ordering of updates Tablet-level mastering Each tablet is assigned a “master region” Inserts and deletes of records forwarded to the master region Master region decides tablet splits These details are hidden from the application Except for the latency impact!

Mastering Tablet master 136 A 42342 E B 42521 W C 66354 W D 12352 E
E C F E A E B W Tablet master C W D E E C F E A E B W C W D E E C F E 136

Bulk Insert/Update/Replace
Client feeds records to bulk manager Bulk loader transfers records to SU’s in batches Bypass routers and message brokers Efficient import into storage unit Client Bulk manager Source Data

Bulk Load in YDOT YDOT bulk inserts can cause performance hotspots
Solution: preallocate tablets

Index Maintenance How to have lots of interesting indexes and views, without killing performance? Solution: Asynchrony! Indexes/views updated asynchronously when base table updated

SHERPA IN CONTEXT 140

Retrieval from single table of objects/records
Types of Record Stores Query expressiveness S3 PNUTS Oracle Simple Feature rich Object retrieval Retrieval from single table of objects/records SQL

Types of Record Stores Consistency model S3 PNUTS Oracle Best effort
Strong guarantees Eventual consistency Timeline consistency ACID Program centric consistency Object-centric consistency

Types of Record Stores Data model PNUTS CouchDB Oracle Flexibility,
Schema evolution Optimized for Fixed schemas Object-centric consistency Consistency spans objects

Elasticity (ability to add resources on demand)
Types of Record Stores Elasticity (ability to add resources on demand) PNUTS S3 Oracle Inelastic Elastic Limited (via data distribution) VLSD (Very Large Scale Distribution /Replication)

Data Stores Comparison
Versus PNUTS More expressive queries Users must control partitioning Limited elasticity Highly optimized for complex workloads Limited flexibility to evolving applications Inherit limitations of underlying data management system Object storage versus record management User-partitioned SQL stores Microsoft Azure SDS Amazon SimpleDB Multi-tenant application databases Salesforce.com Oracle on Demand Mutable object stores Amazon S3

Application Design Space
Get a few things Sherpa MObStor YMDB MySQL Oracle Filer BigTable Scan everything Everest Hadoop Records Files 146

Alternatives Matrix Consistency model Global low Structured SQL/ACID
latency Structured access Consistency model SQL/ACID Operability Availability Updates Elastic Sherpa Y! UDB MySQL Oracle HDFS BigTable Dynamo Cassandra 147

QUESTIONS? 148

Hadoop

Problem How do you scale up applications? Need lots of cheap computers
Run jobs processing 100’s of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers), but… Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure Must be efficient and reliable 150

Solution Open Source Apache Project Hadoop Core includes:
Distributed File System - distributes data Map/Reduce - distributes application Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware 151

Hardware Cluster of Hadoop
Typically in 2 level architecture Nodes are commodity PCs 40 nodes/rack Uplink from rack is 8 gigabit Rack-internal is 1 gigabit 152

Distributed File System
Single namespace for entire cluster Managed by a single namenode. Files are single-writer and append-only. Optimized for streaming reads of large files. Files are broken in to large blocks. Typically 128 MB Replicated to several datanodes, for reliability Access from Java, C, or command line. 153

Block Placement Default is 3 replicas, but settable
Blocks are placed (writes are pipelined): On same node On different rack On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated.

How is Yahoo using Hadoop?
Started with building better applications Scale up web scale batch applications (search, ads, …) Factor out common code from existing systems, so new applications will be easier to write Manage the many clusters 155

Running Production WebMap
Search needs a graph of the “known” web Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce Uses a chain of ~100 map/reduce jobs Scale 1 trillion edges in graph Largest shuffle is 450 TB Final output is 300 TB compressed Runs on 10,000 cores Raw disk used 5 PB

Terabyte Sort Benchmark
Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won the general category in 209 seconds 910 nodes 2 quad-core 2.0Ghz / node 4 SATA disks / node 8 GB ram / node 1 gb ethernet / node 40 nodes / rack 8 gb ethernet uplink / rack Previous records was 297 seconds

Hadoop clusters We have ~20,000 machines running Hadoop
Our largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) We run hundreds of thousands of jobs every month 158

Research Cluster Usage

Who Uses Hadoop? Amazon/A9 AOL Facebook Fox interactive media
Google / IBM New York Times PowerSet (now Microsoft) Quantcast Rackspace/Mailtrust Veoh Yahoo! More at 160

Q&A For more information: Website: http://hadoop.apache.org/core
Mailing lists: 161

Summary of Applications
BigTable HBase HyperTable Hive HadoopDB… Data Analysis Internet Service Private Cloud Web Applications Some operations that can tolerate relaxed consistency PNUTS

Architecture MapReduce-based DBMS-based Hybrid of MapReduce and DBMS
BigTable HBase Hypertable Hive SQL Azure PNUTS Voldemort HadoopDB scalability easy to support SQL sounds good fault tolerance Performance? ability to run in a heterogeneous environment easy to utilize index, optimization method bottleneck of data storage data replication in file system data replication upon DBMS a lot of work to do to support SQL

Consistency Two kinds of consistency:
A BigTable,HBase, Hive,Hypertable,HadoopDB C Two kinds of consistency: strong consistency – ACID(Atomicity Consistency Isolation Durability) weak consistency – BASE(Basically Available Soft-state Eventual consistency ) P A C PNUTS P Cluster里面至少有一个replica是最新的，其他的replica最终会达到一致 SQL Azure ?

A tailor RDBMS LOCK ACID SAFETY TRANSACTION 3NF

Further Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases, Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan SIGMOD 2009 Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, O’Reilly Media, 2009

Further Reading F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazon’s highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

即将出版云计算教材清华大学出版社 2010年6月分布式系统和云计算三大部分：分布式系统云计算技术概述云计算平台和编程指导

全书章节《分布式系统及云计算概论》第1章绪论 1.1 分布式系统概述 1.2 分布式云计算的兴起 1.3 分布式云计算的主要服务和应用
1.4 小结分布式系统综述第2章分布式系统入门 2.1 分布式系统的定义 2.2 分布式系统中的软硬件 2.3分布系统中的主要特征（比如安全性，容错性，安全性等等） 2.4小结第3章客户-服务器端构架 3.1 客户-服务器端构架和体系结构 3.2 客户-服务器端通信协议 3.3 客户-服务器端模型的变种 3.4 小结

全书章节《分布式系统及云计算概论》第4章分布式对象 4.1 分布式对象的基本模型 4.2 远程过程调用 4.3 远程方法调用
4.3 小结第5章公共对象请求代理结构 (CORBA) 5.1 CORBA基本概述 5.2 CORBA 的基本服务 5.3 容错性和安全性 5.4 Java IDL语言 5.5 小结分布式云计算技术第6章分布式云计算概述 6.1 云计算入门 6.2 云服务 6.3 云计算与其他技术比较 6.4 小结第7章 Google云平台的三大技术 7.1 Google 文件系统 7.2 Bigtable技术 7.3 Mapreduce技术 7.4 小结

全书章节《分布式系统及云计算概论》第8章 Yahoo云平台的技术 8.1 PNUTS: 灵活通用的表存储平台
8.2 Pig: 分析大型数据集的平台 8.3 ZooKeeper: 提供团体服务的集中化服务平台 8.4 小结第9章 Aneka 云平台的技术 9.1 Aneka 云平台 9.2 面向市场的云架构 9.3 Aneka:从企业网格到面向市场的云计算 9.4 小结第10章 Greenplum云平台的技术 10.1 GreenPlum系统概述 10.2 GreenPlum分析数据库 10.3 GreenPlum数据库的体系结构和特点 10.4 GreenPlum的关键特性和优点 10.5 小结第11章 Amazon dynamo云平台的技术 11.1 Amazon dynamo概述 11.2 Amazon dynamo的研发背景 11.3 Amazon dynamo系统体系结构 11.4 小结第12章 IBM技术 12.1 IBM云计算概述 12.2 IBM云风暴 12.3 IBM智能商业服务 12.4 IBM智慧地球计划 12.5 IBM Z系统 12.6 IBM虚拟化的动态基础架构技术 12.7 小结

全书章节《分布式系统及云计算概论》分布式云计算的程序开发第13章基于Hadoop系统开发 13.1 Hadoop系统概述
13.2 Map/Reduce用户接口 13.3 任务执行和执行环境 13.4 实际编程例子 13.5 小结第14章基于HBase系统开发 14.1 什么是HBase系统 14.2 HBase的数据模型 14.3 HBase的结构和功能 14.4 如何使用HBase 14.5 小结

全书章节《分布式系统及云计算概论》第15章基于Google Apps系统开发 15.1 Google App Engine 简介
15.4 小结第16章基于MS Azure系统开发 16.1 MS Azure系统简介 16.2 WINDOWS AZURE服务使用 16.3 小结第17章基于Amazon EC2系统开发实例 17.1 Amazon Elastic Compute Cloud 介绍 17.2 如何使用AmazonEC2 17.3 小结

Q&A Thanks

陆嘉恒中国人民大学 www.jiahenglu.net 云计算与云数据管理陆嘉恒中国人民大学 www.jiahenglu.net.

Similar presentations

Presentation on theme: "陆嘉恒中国人民大学 www.jiahenglu.net 云计算与云数据管理陆嘉恒中国人民大学 www.jiahenglu.net."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

陆嘉恒 中国人民大学 www.jiahenglu.net 云计算与云数据管理 陆嘉恒 中国人民大学 www.jiahenglu.net.

Similar presentations

Presentation on theme: "陆嘉恒 中国人民大学 www.jiahenglu.net 云计算与云数据管理 陆嘉恒 中国人民大学 www.jiahenglu.net."— Presentation transcript:

Similar presentations

About project

Feedback

陆嘉恒中国人民大学 www.jiahenglu.net 云计算与云数据管理陆嘉恒中国人民大学 www.jiahenglu.net.

Presentation on theme: "陆嘉恒中国人民大学 www.jiahenglu.net 云计算与云数据管理陆嘉恒中国人民大学 www.jiahenglu.net."— Presentation transcript: