Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Yuan Xue CS 292 Special topics on.

Similar presentations


Presentation on theme: "Big Data Yuan Xue CS 292 Special topics on."— Presentation transcript:

1 Big Data Yuan Xue (yuan.xue@vanderbilt.edu) CS 292 Special topics on

2 It All Starts with Data Big data- a growing torrent Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

3 What is Big Data  Volume  size of the data  Velocity  latency of data processing relative to the growing demand for interactivity  Variety  diversity of sources, formats, quality, structures  Veracity  uncertainty, imprecision of data Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company http://en.wikipedia.org/wiki/Big_data

4 Put Data To Use  Help domain scientists achieve new discoveries  Help companies provide better services  Help governments become more efficient  And more.. The transformative potential of big data in five domains 37 3a. Health care (United States) 39 3b. Public sector administration (European Union) 54 3c. Retail (United States) 64 3d. Manufacturing (global) 76 3e. Personal location data (global) Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation Need computer scientists and engineers to help manage the data

5 Data management and Analytics  Data management (Data engineering)  Storage, access, manipulation, integration  Real-time update, access  Ad hoc query  Batch processing  Distributed system design  Data analytics (Data science)  Extraction of knowledge from data  Automatic,semi-automatic  Structured, unstructured  statistical estimation and prediction  machine learning, data mining  Visualization and Communication Data Data Management Data Analysis support

6 This course  Learn how to use data management systems  Understand how to build scalable data management systems  Hands-on learning interesting facts from data Data Data Management Data Analysis support

7 This course  Along Multiple Dimensions  From small to big (in scale)  Sql to nosql  From simple to complex (in data modeling)  Key-value  column family  document  graph? (no plan to cover for now)  From Disk to In-memory  Redis  Memcached,  MapReduce  Spark  Method: Top down  How to use  How it works  When to use SQL Data Model Operations System Design Performance Optimization NoSQL http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ NewSQL

8 Tools and System  Hands-on System  mySQL  MapReduce (YARN)  HDFS  Hbase  DynamoDB  Cassandra  Memcached  Redis  MongoDB  Pig  HIVE  Impala  Mahout  Spark Items that you can put on your resume!  Design Knowledge  BigTable  Dynamo  Dremel  Spanner  Storm Resource management YARN File System (HDFS) Database (SQL, NoSQL, NewSQL) Data Storage Data Processing and Analysis MapReduce PigHIVE Batch Processing/Analysis Interactive Access Impala/ Drill Storm Mahout Real time stream

9 Put This Course To Big Data Landscape Lecture Lab (guest )Lecture Project (define by you) http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

10 Background Required  Strong programming and hands-on capability  Lots of time-consuming system setup, development, debugging, etc..  Solid data structure and algorithm knowledge  Hash Table, B-Tree, etc…  Operating System  Concurrency (e.g., race condition, lock, synchronization)  Network  Network delay, loss, bandwidth  How data is transferred from one host to another  Basic concepts in network programming (i.e., socket programming)

11 Course Information Check out our website: http://vanets.vuse.vanderbilt.edu/dokuwiki/doku.php?id=teaching:cs292-spring2014  Presentation (team work)  Comprehensive and concise introduction  Demonstration based on example application  Review and revision by me.  4 Labs (team work)  Pick an application/data set  2 Quizes  Project (team work)  Pick your own topic  Start early Start teaming asap!

12 Logistics  Development Platform  Local Environment – your choice, but Eclipse is recommendedEclipse  Code repository -- GitHubGitHub  Experiment Platform  Your own machine  EECS Linux system  Amazon Web Services


Download ppt "Big Data Yuan Xue CS 292 Special topics on."

Similar presentations


Ads by Google