Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST rainoftime@gmail.com

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical Code Completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusion

Marriage of ML and PL SLANG: Code completion[PLDI 14] PL Translation[Onward 14] More Application: Program bug detection Program invariants inference PL design: Probabilistic PL Binary analysis........ JSNice: Type Predication[POPL 15]

Intermediate Representation Sequences Trees Graphical Models Feature Vectors Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....

Extract program representation with Program Analysis SLANG: alias and typestate analysis JSNice: scope and alias analysis, type analysis...... Other application: Use type inference to get trained labels Use SAT Solver to check path condition....

What's the suitable probabilistic model? N-gram language model [PLDI 14] Probabilistic context-free grammers [ICSE 12] Netural networks Support vector machine Conditional Ramdom Fields[POPL 15]...... Same like the IR, it's dependent on the application.

ML for PL [Picture from Martin Vechec's slide]

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusions

Statistical Code Completion [SLANG, V.Raychev et al' PLDI 14] Key insight: Regularities in code are similar to regularities in natural language

Techniques in SLANG IR: Sequences (setences) Program Analysis: typestate analysis, alias analysis Trained Model: Netural Network, N-gram language model Some smoothing techniques

N-gram language model Conditional probability only on previous n-1 words Training is achieved by counting n-grams. Time complexity for each word encountreed in training is constant, so training is usually fast. Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusion

Learning to Recognize Functions in Binary Code [Tiffany Bao et al' Usenix Security 14] When we use gcc with -O3, the function information may be stripped. Can we automatically and accurately recover function information from binaries?

Example: GCC #include int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1); } void main(int argc, char **argv){ printf("%d", fac(10)); }

Example: GCC default -O0 08048443 : push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp … 0804841c : push %ebp mov %esp,%ebp sub $0x18,%esp cmpl $0x1,0x8(%ebp) jne 804842f mov $0x1,%eax …

-O1 -O2 0804841c : push %ebx sub $0x18,%esp mov 0x20(%esp),%ebx mov $0x1,%eax cmp $0x1,%ebx … 08048330 : mov $0x1,%edx mov $0xa,%eax lea 0x0(%esi),%esi … push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp …

ByteWeight A machine learning + program analysis approach to function identification Training: Creates a model of function start patterms using supervised learning Usage: – Use trained models to match function start on stripped binaries — Function Start Identification – Use program analysis to identify all bytes associated with a function — Function Identification

[Picture from Bao's slide]

Problems of Program analysis Program have unbounded behaviors Program analysis – Analyze all behaviors – Run for a finite time In finite time, observe only finite behaviors Need to generalize

Generalization in Program Analyais Abstraction interpretation: widening operator[ CEGAR: interpolants Parameter tuning of tools(flow, path sensitivity, etc) Lots of folk knowledge, heuristics,...

Generalization in Machine Learning “It’s all about generalization” I A famous concept in Computational learning theory – Complexity and Feasibility of learning Learn a function from observations Hope that the function generalizes

Bias-variance Tradeofs in Program analysis [Aiken, POPL 14] Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis – Variance: O(VC-d) Explain know observations by this model Use this model to obtain better tools(in ASTREE, Yogi Project..)

Combine ML and PL Research Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security.... Lots of applications and theories to be found. Combination with other fileds: System, Security..

Thank you !

Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Similar presentations

Presentation on theme: "Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Similar presentations

Presentation on theme: "Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST"— Presentation transcript:

Similar presentations

About project

Feedback