Code Gen of Expr Eval in Shark

Code Gen of Expr Eval in Shark hao.cheng@intel.com

Outlines CG examples Performance Comparison (CG Expr Eval V.S. Hive Expr Eval) CG Design & Major Class Diagram Implemented UDFs/Generic UDFs Future Works

CG Examples shark.expr.cg=true/false in hive-site.xml to enable/disable the feature; default is true.

Performance Comparison (CG Expr Eval V.S. Hive Expr Eval) 747,747,840 records / 66,909,023,675 bytes / RC File (with LzoCodec) on 4 Slaves Machines

Performance Comparison (CG Expr Eval V.S. Hive Expr Eval) (2) Why CG Expr Eval is Faster than Hive Expr Eval? In Hive Expr Eval: A.Keep re-evaluating the common sub node expressions e.g. in expression: concat(year(date_add(visitDate,7)), '/', month(date_add(visitDate,7)), '/', day(date_add(visitDate,7))), the “date_add(visitDate,7)” will be evaluated 3 times. B.Keep checking data types in the runtime The parameter types of “evaluate” method in GenericUDFs is uncertain until runtime, and Hive Expr Eval have to keep checking the value types inside of the “evaluating”. e.g. GenericUDFOPGreaterThan.evaluate, GenericUDFPrintf.evaluate etc. C.Un-necessary type converting e.g. in expression: (duration + 1.03), variable “duration” will be converted into a new object FloatWritable first in Hive Expr Eval, which creates lots of small temperate objects (GenericUDFBridge.conversionHelper) D.Large mount of virtual function calls in runtime Hive Expr Eval always use the base class objects, particularly the UDF objects and the field value objects E.Using the Java Reflection to call UDF evaluate() method Hive Expr Evals access the UDF (in class GenericUDFBridge) is based on the Java Reflection API, which cause another performance issue (http://docs.oracle.com/javase/tutorial/reflect/index.html)http://docs.oracle.com/javase/tutorial/reflect/index.html CG Expr Eval Generates Source Code with concrete objects and executing branches.

CG Design & Major Class Diagram

CG Design & Major Class Diagram (2) Why not generate the bytecode directly? A.The generated content is quite complicated, source code is much easier to debug / troubleshooting. B.Java complier could do another optimizations when compile the source code. Why not generate the evaluating source code according to Hive ExprNodeEvaluator tree, but the ExprNodeDesc tree? A.ExprNodeEvaluator tree loss some information, which may be helpful for further optimization. (e.g. the common sub node expression evaluating) B.Extracting the information from the ExprNodeEvaluator tree is kind of tough, as most of the variables are protected / private in ExprNodeEvaluator.

Implemented UDFs/Generic UDFs Supported Features: o Relational Operators (=,!=,<,<= etc.) o Arithmetic Operators (+,-,*,/,% etc.) o Logical Operators (AND,OR,NOT etc.) o Built-in Functions(UDF) and existed User-Defined Functions o Partial of the generic UDF GenericUDFBetween GenericUDFPrintf GenericUDFInstr GenericUDFBridge Unsupported Features o Conditional Functions (if/case/when etc.) o Map/Array o UDAF o UDTF o Misc. Functions (java_method/reflect/hash etc.)

Future Works Generated Java Source Compile once and distribute among the cluster Reuse the Generated.class for the same queries Support more General UDF (case/when/if etc.) Support Collection Type(Array/Map etc.) Code Gen in Aggregations

Code Gen of Expr Eval in Shark

Similar presentations

Presentation on theme: "Code Gen of Expr Eval in Shark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Code Gen of Expr Eval in Shark

Similar presentations

Presentation on theme: "Code Gen of Expr Eval in Shark"— Presentation transcript:

Similar presentations

About project

Feedback