CSCE 4013/5013 Big Data Analytics and Management Fall 2017.

CSCE 4013/5013 Big Data Analytics and Management Fall 2017

Overview Class hour 11:00am – 12:15pm, Tuesday&Thursday, JBHT 239
Office hour 3:00 – 5:00pm, Tuesday, JBHT 516 Instructor - Dr. Xintao Wu - Office – JBHT 516 Webpage Textbook No textbook is required Reading materials are posted on the course website.

Topic Description Traditional DBMS/DW revisited NoSQL NewSQL
Hadoop, AWS Classic (and some advanced) data mining Big data analytics and machine learning (Spark) 3

Course Prerequisite CSCE 3193 Programming Paradigms and either INEG 2313 or STAT 3103 Familiarity with programming with Java or C++ is assumed Script languages (e.g., Python, Scala) are preferred. Probability and statistics basic concept Knowledge of data mining or machine learning will be a plus 3

Grading Composition Homework & quiz 10% Group Projects 30% Midterm 20%
Final % 3

Project Reports Late policy: Hard copy is preferred
No acceptable. Hard copy is preferred Electronic submission (word or pdf) accepted 3

Project Each group consists 3-4 students and works on two projects
Develop/implement/apply big data management and analytics systems on real large data sets Each group consists 3-4 students and works on two projects Project I - Big Data Management Project II – Big Data Analytics Individual Research Project (optional) More information 3

Midterm & Final Open books/notes/internet Cumulative No makeup
No discussion No help from any entity, e.g., by posting/uploading your questions on Web Cumulative No makeup Class attendance is not required Bonus is expected 3

Textbook & Reading Materials
5/23/2018 Textbook & Reading Materials Textbook None is required Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, pdf download Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, ISBN: Recommended reading materials 9

Big Data Era Google: every 2 days we create as much data as we did up to 2003. Facebook: 500+ TB of new data every day including 2.5 billion items shared 2.7 billon Likes 300 million photos 100+ PB Hadoop cluster Twitter: 500 million tweets per day Many applications for streaming data, e.g., sensors

Drivers of Data Computing
Reliability Security Privacy Usability 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity

Big Data Computing Drowning in data Advancing technology
Volume, Velocity, Variety, and Veracity 2.5 Exabyte every day Web data, healthcare, e-commerce, social network Advancing technology Cheap storage/processing power Growth in huge data centers Data is in the “cloud”- Amazon AWS, Hadoop, Azure Computing is in the “cloud” Data quantity, speed, types and messiness Terabytes to exabytes of existing data to process Streaming data, milliseconds to seconds to respond Structured, unstructured, text, multimedia, Data in doubt, uncertainty due to inconsistency&incompleteness, ambiguities, latency, deception

AVC Denial Log Analysis
Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR

NoSQL http://nosql-database.org/
Non relational Scalability Collection of structures No pre-defined schema No join operations CAP not ACID Consistency, Availability and Partitioning (but not all three at once!) Atomicity, Consistency, Isolation and Durability

Advantages of NoSQL Cheap, easy to implement
Data are replicated and can be partitioned Easy to distribute Don't require a schema scale up and down Can handle web-scale data Quickly process large amounts of data Relax the data consistency requirement (CAP)

Disadvantages of NoSQL
Data is generally duplicated, potential for inconsistency Lack standard No standardized schema No standard format for queries No standard language Difficult to impose complicated structures Depend on the application layer to enforce data integrity No guarantee of support

NewSQL It is more of a movement than specific product
The “New” refers to the Vendors and not the SQL Seek to provide the same scalable performance of NoSQL for OLTP read-write workloads while maintaining ACID Transactions are short-lived, access a small set of data, and are repetitive. H-Store, VoltDB, Amazon RDS, Microsoft SQL Azure, Google Spanner, SAP HANA

For Iterations/ Learning
The World of Big Data Tools DAG Model MapReduce Model Graph Model BSP/Collective Model Hadoop MPI For Iterations/ Learning HaLoop Giraph Twister Hama GraphLab Spark GraphX Harp Stratosphere Dryad/ DryadLINQ Reef For Query Pig/PigLatin Hive Tez Drill Shark MRQL For Streaming S4 Storm Samza Spark Streaming From Bingjing Zhang

Cross Cutting Capabilities
Orchestration & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy Data Analytics Libraries: Machine Learning Mahout , MLlib , MLbase CompLearn (NA) Linear Algebra Scalapack, PetSc (NA) Statistics, Bioinformatics R, Bioconductor (NA) Imagery ImageJ (NA) MRQL (SQL on Hadoop, Hama, Spark) Hive (SQL on Hadoop) Pig (Procedural Language) Shark (SQL on Spark, NA) Hcatalog Interfaces Impala (NA) Cloudera (SQL on Hbase) Swazall (Log Files Google NA) High Level (Integrated) Systems for Data Processing Parallel Horizontally Scalable Data Processing Giraph ~Pregel Tez (DAG) Spark (Iterative MR) Storm S4 Yahoo Samza LinkedIn Hama (BSP) Hadoop (Map Reduce) Pegasus on Hadoop (NA) NA:Twister Stratosphere Iterative MR Graph Batch Stream Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka ABDS Inter-process Communication Hadoop, Spark Communications MPI (NA) & Reductions Harp Collectives (NA) HPC Inter-process Communication Cross Cutting Capabilities Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Security & Privacy Monitoring: Ambari, Ganglia, Nagios, Inca (NA) from Geoffrey Fox

In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) …….. ABDS Cluster Resource Management HPC Cluster Resource Management ABDS File Systems User Level HPC File Systems (NA) HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated iRODS(NA) Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) Cross Cutting Capabilities Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Security & Privacy Monitoring: Ambari, Ganglia, Nagios, Inca (NA) SQL MySQL (NA) SciDB (NA) Arrays, R,Python Phoenix (SQL on HBase) UIMA (Entities) (Watson) Tika (Content) Extraction Tools Cassandra (DHT) NoSQL: Column HBase (Data on HDFS) Accumulo (Data on HDFS) Solandra (Solr+ Cassandra) +Document Azure Table NoSQL: Document MongoDB (NA) CouchDB Lucene Solr Riak ~Dynamo NoSQL: Key Value (all NA) Dynamo Amazon Voldemort ~Dynamo Berkeley DB Neo4J Java Gnu (NA) NoSQL: General Graph RYA RDF on Accumulo NoSQL: TripleStore RDF SparkQL AllegroGraph Commercial Sesame (NA) Yarcdata Commercial (NA) Jena ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard File Management IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google Bare Metal Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) From Geoffrey Fox

CSCE 4013/5013 Big Data Analytics and Management Fall 2017.

Similar presentations

Presentation on theme: "CSCE 4013/5013 Big Data Analytics and Management Fall 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 4013/5013 Big Data Analytics and Management Fall 2017.

Similar presentations

Presentation on theme: "CSCE 4013/5013 Big Data Analytics and Management Fall 2017."— Presentation transcript:

Similar presentations

About project

Feedback