Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March 1 2015 Geoffrey Fox

Slides:



Advertisements
Similar presentations
Web Service Architecture
Advertisements

Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
CSCI-1680 RPC and Data Representation Rodrigo Fonseca.
Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August Geoffrey Fox
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Implementing Remote Procedure Calls Andrew Birrell and Bruce Nelson Presented by Kai Cong.
Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox
Scale Up Access to your 4GL Application using Web Services
Technical Architectures
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
J2ME Web Services Specification.  With the promise to ease interoperability and allow for large scale software collaboration over the Internet by offering.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Big Data Open Source Software and Projects ABDS in Summary XVII: Layer 13 Part 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary VI: Layer 6 Part 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary VII: Level 10 I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XXI: Layer 15B Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XII: Level 13 I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary IX: Level 11C I590 Data Science Curriculum August Geoffrey Fox
Client-Server Processing and Distributed Databases
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
CCSDS Message Bus Comparison Shames, Barkley, Burleigh, Cooper, Haddow 28 Oct 2010.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Hosted on the Powerful Microsoft Azure Platform, Advent Countdown Lets Companies Run Reliable and Scalable Holiday Marketing Campaigns MICROSOFT AZURE.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
CSCI-1680 RPC and Data Representation Rodrigo Fonseca.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Web Services Architecture1 - Deepti Agarwal. Web Services Architecture2 The Definition.. A Web service is a software system identified by a URI, whose.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Connecting to the Network Networking for Home and Small Businesses.
Networks – Network Architecture Network architecture is specification of design principles (including data formats and procedures) for creating a network.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Network Services Networking for Home and Small Businesses – Chapter 6.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
1 Cisco Unified Application Environment Developers Conference 2008© 2008 Cisco Systems, Inc. All rights reserved.Cisco Public Introduction to Etch Scott.
Introduction to Server-Side Web Development Introduction to Server-Side Web Development using JSP and Web Services JSP and Web Services 18 th March 2005.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
XML and Web Services (II/2546)
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Kemal Baykal Rasim Ismayilov
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
TACTIC | Workflow: Project Management OSS on Microsoft Azure Helps Enterprises to Create Streamline, Manage, and Track Digital Content MICROSOFT AZURE.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Tekla Model Sharing and Microsoft Azure Create Secure and Seamless Collaboration Environment for Construction Projects, Locally and Globally MICROSOFT.
Big Data Open Source Software and Projects ABDS in Summary XII: Level 13 I590 Data Science Curriculum August Geoffrey Fox
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
Distributed Tracing How to do latency analysis for microservice-based applications Reshmi
The Client/Server Database Environment
I590 Data Science Curriculum August
Data Science Curriculum March
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Inventory of Distributed Computing Concepts
CSCI-1680 RPC and Data Representation
Big Data Open Source Software and Projects ABDS in Summary I
Department of Intelligent Systems Engineering
COM, DCOM and Software Components
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
I590 Data Science Curriculum August
Presentation transcript:

Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox School of Informatics and Computing Digital Science Center Indiana University Bloomington

Functionality of 21 HPC-ABDS Layers 1)Message Protocols: 2)Distributed Coordination: 3)Security & Privacy: 4)Monitoring: 5)IaaS Management from HPC to hypervisors: 6)DevOps: 7)Interoperability: 8)File systems: 9)Cluster Resource Management: 10)Data Transport: 11)A) File management B) NoSQL C) SQL 12)In-memory databases&caches / Object-relational mapping / Extraction Tools 13)Inter process communication Collectives, point-to-point, publish-subscribe, MPI: 14)A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15)A) High level Programming: B) Application Hosting Frameworks 16)Application and Analytics: 17)Workflow-Orchestration: Here are 21 functionalities. (including 11, 14, 15 subparts) 4 Cross cutting at top 17 in order of layered diagram starting at bottom

Functionality of 21 HPC-ABDS Layers 1)Message Protocols: 2)Distributed Coordination: 3)Security & Privacy: 4)Monitoring: 5)IaaS Management from HPC to hypervisors: 6)DevOps: 7)Interoperability: 8)File systems: 9)Cluster Resource Management: 10)Data Transport: 11)A) File management B) NoSQL C) SQL 12)In-memory databases&caches / Object-relational mapping / Extraction Tools 13)Inter process communication Collectives, point-to-point, publish-subscribe, MPI: 14)A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15)A) High level Programming: B) Application Hosting Frameworks 16)Application and Analytics: 17)Workflow-Orchestration: Here are 21 functionalities. (including 11, 14, 15 subparts) 4 Cross cutting at top 17 in order of layered diagram starting at bottom

Apache Thrift Thrift is an interface definition language and binary communication protocol that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development". It combines a software stack with a code generation engine to build services that work efficiently to a varying degree and seamlessly between C#, C++ (on POSIX-compliant systems), Cappuccino, Cocoa, Delphi, Erlang, Go, Haskell, Java, Node.js, OCaml, Perl, PHP, Python, Ruby and Smalltalk Note this type of capability augmented by serializers such as Java Kyro

Google Protobuf (Protocol Buffers) Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. Protocol Buffers are a method of serializing structured data. As such, they are useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates from that description source code in various programming languages for generating or parsing a stream of bytes that represents the structured data. Protocol Buffers are serialized into a binary wire format which is compact, forwards-compatible, and backwards-compatible, but not self-describing (that is, there is no way to tell the names, meaning, or full datatypes of fields without an external specification). C++, Java, Python Protocol Buffers are very similar to the Apache Thrift protocol (used by Facebook for example), except that the public Protocol Buffers implementation does not include a concrete RPC protocol stack to use for defined services.

Apache Avro Apache Avro relies on schemas defined with Json. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. Avro differs from Thrift and Protocol Buffers in these ways – Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. – Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. – No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Functionality of 21 HPC-ABDS Layers 1)Message Protocols: 2)Distributed Coordination: 3)Security & Privacy: 4)Monitoring: 5)IaaS Management from HPC to hypervisors: 6)DevOps: 7)Interoperability: 8)File systems: 9)Cluster Resource Management: 10)Data Transport: 11)A) File management B) NoSQL C) SQL 12)In-memory databases&caches / Object-relational mapping / Extraction Tools 13)Inter process communication Collectives, point-to-point, publish-subscribe, MPI: 14)A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15)A) High level Programming: B) Application Hosting Frameworks 16)Application and Analytics: 17)Workflow-Orchestration: Here are 21 functionalities. (including 11, 14, 15 subparts) 4 Cross cutting at top 17 in order of layered diagram starting at bottom

Apache Zookeeper & Google Chubby Important technology to provide reliable control metadata in distributed scalable systems Zookeeper is a distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub project of Hadoop but is now a top-level project in its own right. Based on Google Chubby ZooKeeper's architecture supports high availability through redundant services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a trie (digital tree) datastructure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered. ZooKeeper is used by companies including Rackspace, Yahoo and eBay as well as open source enterprise search systems like Solr and Storm. See improved technology Giraffe

JGroups JGroups is a reliable multicast system written in the Java language and Open Source under LGPL JGroups adds a "grouping" layer over a transport protocol, internally keeping a list of participants. This list is used to: – Make the application aware of the listeners – Make some or all transmissions reliable – Allow totally ordered transmissions JGroups is a toolkit for reliable multicast communication. It can be used to create groups of processes whose members can send messages to each other. JGroups enables developers to create reliable multipoint (multicast) applications where reliability is a deployment issue. JGroups also relieves the application developer from implementing this logic themselves. This saves significant development time and allows for the application to be deployed in different environments without having to change code The most powerful feature of JGroups is its flexible protocol stack, which allows developers to adapt it to exactly match their application requirements and network characteristics. The benefit of this is that you only pay for what you use. By mixing and matching protocols, various differing application requirements can be satisfied. JGroups comes with a number of protocols UDP (IP Multicast), TCP, JMS (but anyone can write their own).