Big Data Yuan Xue CS 292 Special topics on.

Slides:



Advertisements
Similar presentations
Nokia Technology Institute Natural Partner for Innovation.
Advertisements

Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Why Spark on Hadoop Matters
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Hadoop Ecosystem Overview
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Tyson Condie.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Himanshu Grover.
© Copyright 2013 STI INNSBRUCK
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.
Matthew Winter and Ned Shawa
Big Data, Learning Analytics and Education Aleksanda Klašnja-Milićević Mirjana Ivanović.
Nov 2006 Google released the paper on BigTable.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Mining of Massive Datasets Edited based on Leskovec’s from
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Messy Data: Teaching Students Early on About the Realities of Data.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
FACULTY EXTERNSHIP OPPORTUNITIES IN DATA SCIENCE AND DATA ANALYTICS Facilitated by: FilAm Software Technology, Clark Freeport Zone Ecuiti, San Francisco,
Big Data Yuan Xue CS 292 Special topics on.
Microsoft Ignite /28/2017 6:07 PM
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Data Analytics (CS40003) Introduction to Data Lecture #1
OMOP CDM on Hadoop Reference Architecture
CSCI5570 Large Scale Data Processing Systems
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Big Data A Quick Review on Analytical Tools
CS122B: Projects in Databases and Web Applications Winter 2017
Dremel.
NOSQL databases and Big Data Storage Systems
Project Project mid-term report due on 25th October at midnight Format
Team 2 – Mike, Rich, Sam and Steven DPS – PACE University
Storage Systems for Managing Voluminous Data
Data Warehousing and Data Mining
Parallel Analytic Systems
Overview of big data tools
Big Data Young Lee BUS 550.
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
Big Data.
Presentation transcript:

Big Data Yuan Xue CS 292 Special topics on

It All Starts with Data Big data- a growing torrent Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company

What is Big Data  Volume  size of the data  Velocity  latency of data processing relative to the growing demand for interactivity  Variety  diversity of sources, formats, quality, structures  Veracity  uncertainty, imprecision of data Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company

Put Data To Use  Help domain scientists achieve new discoveries  Help companies provide better services  Help governments become more efficient  And more.. The transformative potential of big data in five domains 37 3a. Health care (United States) 39 3b. Public sector administration (European Union) 54 3c. Retail (United States) 64 3d. Manufacturing (global) 76 3e. Personal location data (global) Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company Need computer scientists and engineers to help manage the data

Data management and Analytics  Data management (Data engineering)  Storage, access, manipulation, integration  Real-time update, access  Ad hoc query  Batch processing  Distributed system design  Data analytics (Data science)  Extraction of knowledge from data  Automatic,semi-automatic  Structured, unstructured  statistical estimation and prediction  machine learning, data mining  Visualization and Communication Data Data Management Data Analysis support

This course  Learn how to use data management systems  Understand how to build scalable data management systems  Hands-on learning interesting facts from data Data Data Management Data Analysis support

This course  Along Multiple Dimensions  From small to big (in scale)  Sql to nosql  From simple to complex (in data modeling)  Key-value  column family  document  graph? (no plan to cover for now)  From Disk to In-memory  Redis  Memcached,  MapReduce  Spark  Method: Top down  How to use  How it works  When to use SQL Data Model Operations System Design Performance Optimization NoSQL NewSQL

Tools and System  Hands-on System  mySQL  MapReduce (YARN)  HDFS  Hbase  DynamoDB  Cassandra  Memcached  Redis  MongoDB  Pig  HIVE  Impala  Mahout  Spark Items that you can put on your resume!  Design Knowledge  BigTable  Dynamo  Dremel  Spanner  Storm Resource management YARN File System (HDFS) Database (SQL, NoSQL, NewSQL) Data Storage Data Processing and Analysis MapReduce PigHIVE Batch Processing/Analysis Interactive Access Impala/ Drill Storm Mahout Real time stream

Put This Course To Big Data Landscape Lecture Lab (guest )Lecture Project (define by you)

Background Required  Strong programming and hands-on capability  Lots of time-consuming system setup, development, debugging, etc..  Solid data structure and algorithm knowledge  Hash Table, B-Tree, etc…  Operating System  Concurrency (e.g., race condition, lock, synchronization)  Network  Network delay, loss, bandwidth  How data is transferred from one host to another  Basic concepts in network programming (i.e., socket programming)

Course Information Check out our website:  Presentation (team work)  Comprehensive and concise introduction  Demonstration based on example application  Review and revision by me.  4 Labs (team work)  Pick an application/data set  2 Quizes  Project (team work)  Pick your own topic  Start early Start teaming asap!

Logistics  Development Platform  Local Environment – your choice, but Eclipse is recommendedEclipse  Code repository -- GitHubGitHub  Experiment Platform  Your own machine  EECS Linux system  Amazon Web Services