Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science Background and Course Software setup Week 1.

Similar presentations


Presentation on theme: "Data Science Background and Course Software setup Week 1."— Presentation transcript:

1 Data Science Background and Course Software setup Week 1

2 Index Installation process Lecture 1: Introduction to big data and data science Lecture 2: Performing data science and preparing data

3 Installation process (I) The same development environment: Two free software packages: VirtualBox and Vagrant Virtual Machine Hardware and Software Prerequisites Minimum Hardware Requirements Free disk space: 3.5 GB RAM memory: 2.5 GB (4+ GB preferred) Processor: Any recent Intel or AMD multicore processor should be sufficient. Supported Operating Systems Windows, Linux, MAC OS X

4 Installation process (II) Installation of the Virtual box: virtualbox.org  Downloads  Choose the appropriate version of the Virtual box for your OS Installation of Vagrant: www.vagrantup.com -> Downloads  Choose the appropriate version of the Vagrant for your OS Installation of the Virtual Machine: Create a custom directory (e.g., /home/marrval/myvagrant) Download the file: https://github.com/spark-mooc/mooc- setup/archive/master.zip to the custom directory and unzip it. Copy Vagrantfile to the custom directory you created in step #1 Open a DOS prompt (Windows) or Terminal (Mac/Linux), change to the custom directory, and issue the command vagrant up (the Virtual box opened in the background) Sparkvm is running!

5 Installation process (III) Basic Instructions for Using the Virtual Machine To start the VM, from a DOS prompt (Windows) or Terminal (Mac/Linux), issue the command vagrant up. To stop the VM, use the command vagrant halt You should always stop the VM before you log off, turn off, or reboot your computer. To erase or delete the VM, vagrant destroy Once the VM is running, to access the notebook, open a web browser to "http://localhost:8001/" : start the iPython notebook on port 8001 (so we can have access to an IPython notebook with a Spark)http://localhost:8001/

6 Installation process (IV) Running Your First Notebook Start the VM Open a web browser to "http://localhost:8001/"http://localhost:8001/ Upload the file "lab0_student.ipynb”, which is contained in the.zip Verify that you do not encounter any errors in the run of the cells

7 Introduction to big data and data science (I) Correlation doesn’t imply causation Use more data Explore more types of data/factors

8 Introduction to big data and data science (II) Big Data: Why all this excitement? From 2003 to 2008, they looked at weekly search queries  Identify 45 terms relevant to people searching about flu  Build a model Google rolled out flu stories in Google News during this period + reading stories  skewed the results

9 Introduction to big data and data science (III) Big Data: Why all this excitement? Bloggers used data science to analyze the elections The campaigns were using data science (database that modeled the behavior of the electorate) Pollsters try to predict the outcome by polling people  they have biases (+errors)  incorrect results Challenge: remove biases + errors

10 Introduction to big data and data science (IV) Cautionary tale How did they come to this conclusion? Look at Google trend searches for MySpace and use the same model to Facebook Correlation doesn’t imply causation Identify important factors

11 Introduction to big data and data science (V) Where Does Big Data Come From? Online (And can be recorded). Many data are collected and few analyzed Users (user-generated content) Individually is not very large

12 Introduction to big data and data science (VI) Where Does Big Data Come From? Health and scientific computing Graphs Log files (generated by servers around The Internet) The Internet of Things (e.g., sensors in a forest, toll collection transponder to traffic reporting)

13 Performing Data Science and preparing Data (I) What is Data Science? Data Science aims to derive knowledge from big data, efficiently and intelligently” Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government Apply algorithms at scale to large amounts of data, and understand both the algorithms and the results Collect data, analyze them and understands the analytical process and results Collect knowledge, apply algorithms, but do not understand

14 Performing Data Science and preparing Data (I) What is Data Science? Data Science aims to derive knowledge from big data, efficiently and intelligently” Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government Apply domain-specific knowledge at very large scale, and understand both the algorithms and the results

15 Performing Data Science and preparing Data (II) Contrasting Data Science: Database

16 Performing Data Science and preparing Data (III) Contrasting Data Science: Database Contrasting Data Science: Scientific computing

17 Performing Data Science and preparing Data (IV) Contrasting Data Science: Traditional Machine Learning

18 Performing Data Science and preparing Data (V) Doing data science Problem  Collect data  clean the data  build a model  communicate the results

19 Performing Data Science and preparing Data (V) Cloud computing: key enabler of data science Allows date science on a massive scale Data science practice

20 Performing Data Science and preparing Data (VI) What is hard about Data Science?

21 Performing Data Science and preparing Data (VII) Data acquisition and Preparation 1.Extract data from sources 2.Load data into the sink 3.Transform data (source, sink, staging area)

22 Performing Data Science and preparing Data (VIII) Data acquisition and Preparation We create pipelines or workflows, which can be scheduled Recording the execution of a workflow is known as capturing lineage or provenance (Spark does it automatically) Impediments to collaboration: diversity of tools/programming languages, finding a script is hard, most analysis work is ‘thrown away’

23 Performing Data Science and preparing Data (VIII) Data Science roles Individual Organizational


Download ppt "Data Science Background and Course Software setup Week 1."

Similar presentations


Ads by Google