Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Similar presentations


Presentation on theme: "Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application."— Presentation transcript:

1 Berkley Data Analysis Stack Shark, Bagel

2 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application Resource Management Data Management Share infrastructure across frameworks (multi-programming for datacenters) Efficient data sharing across frameworks Data Processing in-memory processing trade between time, quality, and cost Application New apps: AMP-Genomics, Carat, …

3 3 Previous Presentation Summary Mesos, Spark, Spark Streaming

4 Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD Parallel operation Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

5 Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s val data = spark.textFile(...).map(rea dPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { … } println("Final w: " + w)

6 HIVE: Components HDFS Hive CLI DDL QueriesBrowsing Map Reduce MetaStore Thrift API SerDe ThriftJuteJSON.. Execution Hive QL Parser Planner Mgmt. Web UI

7 Data Model Hive Entity Sample Metastore Entity Sample HDFS Location TableT/wh/T Partitiondate=d1/wh/T/date=d1 Bucketing column userid /wh/T/date=d1/part-0000 … /wh/T/date=d1/part-1000 (hashed on userid) External Table extT /wh2/existing/dir (arbitrary location)

8 Hive/Shark flowchart (Insert into table) Two ways to do this. 1.Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. 2.Load “Buckets” directly. The user is responsible for creating buckets. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Creates the table directory.

9 Hive/Shark flowchart (Insert into table) Two ways to do this. 1.Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view Step 1 Step 2

10 Hive/Shark flowchart (Insert into table) Two ways to do this. 1.Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 3 FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

11 Hive File on HDFS Hierarchical Object Hierarchical Object Writable Stream Hierarchical Object Hierarchical Object Map Output File Writable Hierarchical Object Hierarchical Object File on HDFS User Script Hierarchical Object Hierarchical Object Hierarchical Object Hierarchical Object Hive Operator SerDe FileFormat / Hadoop Serialization Mapper Reducer ObjectInspector 1.0 3 54 0.2 1 33 2.2 8 212 0.7 2 22 thrift_record BytesWritable(\x3F\x64\x72\x00) Java Object Object of a Java Class Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized Writable Text(‘1.0 3 54’) // UTF8 encoded User defined SerDes per ROW

12 getType ObjectInspector1 getFieldOI getStructField getType ObjectInspector2 getMapValueOI getMapValue deserializeSerDeserializegetOI SerDe, ObjectInspector and TypeInfo Hierarchical Object Hierarchical Object Writable Struct int string list struct map string Hierarchical Object Hierarchical Object String Object getType ObjectInspector3 TypeInfo BytesWritable(\x3F\x64\x72\x00)Text(‘a=av:b=bv 23 1:2=4:5 abcd’) class HO { HashMap a, Integer b, List c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd” ) int HashMap(“a”  “av”, “b”  “bv”), HashMap a, “av”


Download ppt "Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application."

Similar presentations


Ads by Google