Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.

Similar presentations


Presentation on theme: "Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major."— Presentation transcript:

1

2 Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major with CS minor Emphasis on Stats & Machine Learning jameslohse.com – download slides and paper Contact: jim @ supportml.com 385 985-DATA Changing name to Mega Learning LLC, watch for that

3 Welcome to the Intermountain Big Data Conference! 3 Big Data Utah / UTGE Big Data Utah and Utah Geek Events Nick Baguley / Pat Wright http://www.bigdatautah.org/ On Meetup.com November is next event @IHC Next Big Data Utah event is January 13 look at http://www.meetup.com/BigDataUtah UTGE: Big Mountain Data Conference and others

4 Welcome to the Intermountain Big Data Conference! 4 Data Mining and Machine Learning Primer Tools and infrastructure for being a Data Scientist can be overwhelming at first Much more to it than just programming This is true for all development, lots of tools So you know Java? How about Maven? Gradle? Eclipse, IntelliJ? Android Studio? Ant, SVN, Git, Github, Mercurial, Ivy, etc etc?

5 Welcome to the Intermountain Big Data Conference! 5 Big Server vs. Cluster Storing large data sets – local vs. cloud? GPU? Hadoop / HDFS / Hbase for cluster storage Cluster of Unreliable Commodity Hardware Hadoop is Apache Open Source project Often associated with MapReduce They are not the same, MapReduce can work on a Hadoop file system

6 Welcome to the Intermountain Big Data Conference! 6 Hadoop Spreads large data sets across clusters Clusters can be very cheap hardware Based on Google white papers on MapReduce and Google File System HDFS – Hadoop Distributed File System Framework mostly written in Java

7 Welcome to the Intermountain Big Data Conference! 7 MapReduce Part of Hadoop Separate from HDFS, layers on top of HDFS Was originally proprietary Google technology Splits jobs across a cluster Facilitates parallel processing for higher speed Implemented in MongoDb, for example

8 Welcome to the Intermountain Big Data Conference! 8 Apache Spark MapReduce replacement from UC Berkeley In-memory primitives, not disk based Cluster management - Spark, YARN or Mesos, Hbase, Cassandra Distributed storage interfaces with HDFS, Cassandra, Openstack Swift, Amazon S3 Pseudo-distributed mode for testing locally Most active Apache project in 2014

9 Welcome to the Intermountain Big Data Conference! 9 Apache Spark Components Spark Core / Resilient Distributed Datasets RDD in Java, Python and Scala Spark SQL – SQL over unstructured data Spark Streaming – Kafka, Flume, Twitter, TCP sockets, ZeroMQ, Kinesis MLlib Machine Learning Library MLlib 10X faster than Apache Mahout GraphX – Graph processing library

10 Welcome to the Intermountain Big Data Conference! 10 R Like Matlab, more a statisics environment than a pure programming language Learn more about R on Coursera.com www.coursera.org/course/rprog Part of Johns Hopkins “Data Science” track Supposedly funny: “A Data Scientist is a statistician who is a better software developer than other statisticians, and a software developer who is a better statistician than other software developers”

11 Welcome to the Intermountain Big Data Conference! 11 CRAN / Rstudio / Rpy2 Comprehensive R Archive Network RStudio is the IDE for R programming Free / open source from Desktop app or RStudio Server for web access Rpy2 is a Python Interface to R Also PyPy, Rpy, Rpython Python taking over as the language for ML

12 Welcome to the Intermountain Big Data Conference! 12 Web Crawlers in Python & Java Scrapy (Python) – http://scrapy.org/companies/ Tag Soup (Java) – http://home.ccil.org/~cowan/tagsoup/ Beautiful Soup (Python) – http://www.crummy.com/software/BeautifulSoup Taggle is Tag Soup in C++

13 Welcome to the Intermountain Big Data Conference! 13 Ipython Notebook / Jupyter Display / formatting of multiple languages and codesets in one place, for publishing Numerous ML-based notebooks online: Interesting notebooks:http://bit.ly/1DQ8I5c Jupyter is now separated from iPython http://jupyter.org/ – “Language-agnostic” parts of iPython now on Jupyter.org

14 Welcome to the Intermountain Big Data Conference! 14 What? NO SQL? Not Only SQL – there is SQL Solves problems relational can't touch Amazon, Facebook, Twitter, LinkedIn “eventually consistent” not ACID Many many choices! http://db-engines.com/en/ranking http://nosql-database.org/

15 Welcome to the Intermountain Big Data Conference! 15 Key – Value store Stores keys and values – that's it! Not up to more complex tasks Great for simple needs, very fast! Redis, Memcached, Amazon DynamoDB

16 Welcome to the Intermountain Big Data Conference! 16 Graph and other types Graph DB, just that, stores data as a graph with nodes and edges, nodes not all indexed Neo4j, FlockDB, OrientDB, IBM DB2, Stardog Many other models for databases, each has its own benefits of speed vs. reliability/consistency According to http://en.wikipedia.org/wiki/NoSQL Object, Tabular, Tuple Store, Triple/quad store, Hosted, Multi-value, Correlation, Cell

17 Welcome to the Intermountain Big Data Conference! 17 MongoDb, Cassandra, HBase Article claims analysis of LinkedIn shows these are becoming the top three NoSQL databases to know: http://bit.ly/1xUrV5G http://www.infoworld.com/article/2848722/nosql/m ongodb-cassandra-hbase-three-nosql-databases- to-watch.html

18 Welcome to the Intermountain Big Data Conference! 18 Kaggle.com / competitions Where the money is, Big Data competition If you are at the top of Kaggle you are going to make a lot of money (and change the world?) Good community and starter projects Facial Keypoints Detection in R Big Data Utah also runs competitions http://www.bigdatautah.org/competitions/

19 Welcome to the Intermountain Big Data Conference! 19

20 Welcome to the Intermountain Big Data Conference! 20

21 Welcome to the Intermountain Big Data Conference! 21

22 Welcome to the Intermountain Big Data Conference! 22

23

24

25

26

27

28

29

30

31

32 Welcome to the Intermountain Big Data Conference! 32 Deploying Shiny Apps https://jimlohse.shinyapps.io/PaulsR-eality ShinyApps.io has free limited ac Rstudio Shiny Server: https://www.rstudio.com/products/shiny/shiny- server/ Not to be confused with RStudio Server / Pro https://www.rstudio.com/products/rstudio/downloa d-server/

33 Welcome to the Intermountain Big Data Conference! 33 Thanks for attending! Q&A if there's time... http://supportml.com/data-science-machine- learning-tools-r-shiny-python/ http://goo.gl/eBRoir


Download ppt "Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major."

Similar presentations


Ads by Google