Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Slides:



Advertisements
Similar presentations
Active Appearance Models
Advertisements

ByteWeight: Learning to Recognize Functions in Binary Code
Type Analysis and Typed Compilation Stephanie Weirich Cornell University.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J. Schwartz *, JongHyup Lee ✝, Maverick.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Greg MorrisettFall  Compilers.  (duh)  Translating one programming language into another.  Also interpreters.  Translating and running a language.
Learning From Data Chichang Jou Tamkang University.
Automatically Extracting and Verifying Design Patterns in Java Code James Norris Ruchika Agrawal Computer Science Department Stanford University {jcn,
From Cooper & Torczon1 Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code)
UBC104 Embedded Systems Functions & Pointers.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Program Analysis Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
Part I: Classification and Bayesian Learning
Software Testing and QA Theory and Practice (Chapter 4: Control Flow Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Classification and Prediction: Regression Analysis
Crash Course on Machine Learning
Automated malware classification based on network behavior
1 Chapter 5: Names, Bindings and Scopes Lionel Williams Jr. and Victoria Yan CSci 210, Advanced Software Paradigms September 26, 2010.
Bug Localization with Machine Learning Techniques Wujie Zheng
Google’s MapReduce Connor Poske Florida State University.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
1 #include void silly(){ char s[30]; gets(s); printf("%s\n",s); } main(){ silly(); return 0; }
Recitation 6 – 2/26/01 Outline Linking Exam Review –Topics Covered –Your Questions Shaheen Gandhi Office Hours: Wednesday.
Rahul Sharma, Aditya V. Nori, Alex Aiken Stanford MSR India Stanford.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Recitation 2 – 2/11/02 Outline Stacks & Procedures Homogenous Data –Arrays –Nested Arrays Mengzhi Wang Office Hours: Thursday.
Machine-Level Programming 3 Control Flow Topics Control Flow Switch Statements Jump Tables.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Intermediate Code Representations
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Binary Concolic Execution for Automatic Exploit Generation Todd Frederick.
Bits and Bytes September 1, F’05 class02.ppt “The Class That Gives CMU Its Zip!”
C. C ? K & R C – The Kernighan and Richie classic ANCI C -- started 1983 – ANSI X and ISO/IEC 9899:1990 – Standard C, C89, C90 C90 –
Recitation 3 Outline Recursive procedure Complex data structures –Arrays –Structs –Unions Function pointer Reminders Lab 2: Wed. 11:59PM Lab 3: start early.
Stories of the past and a future topic on API recommendation Qirun Zhang.
Gogul Balakrishnan Thomas Reps University of Wisconsin Analyzing Memory Accesses in x86 Executables.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 unstrip: Restoring Function Information to Stripped Binaries Using Dyninst Emily.
OUTLINE 2 Pre-requisite Bomb! Pre-requisite Bomb! 3.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Experience Report: System Log Analysis for Anomaly Detection
Control Flow Testing Handouts
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Sparse Kernel Machines
Command line arguments
Supervised Time Series Pattern Discovery through Local Importance
Outline of the Chapter Basic Idea Outline of Control Flow Testing
Compiler Lecture 1 CS510.
Basic machine learning background with Python scikit-learn
Emily Jacobson and Nathan Rosenblum
Machine Learning Ali Ghodsi Department of Statistics
Vijay Srinivasan Thomas Phan
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Deep Learning Hierarchical Representations for Image Steganalysis
Machine-Level Representation of Programs III
Machine-Level Programming: Introduction
CNT4704: Analysis of Computer Communication Network Buffer Overflow : Example of Using GDB to Check Stack Memory Cliff Zou Fall 2011.
IntScope: Automatically Detecting Integer overflow vulnerability in X86 Binary Using Symbolic Execution Tielei Wang, TaoWei, ZhingiangLin, weiZou Purdue.
Model Checking and Its Applications
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical Code Completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusion

Marriage of ML and PL SLANG: Code completion[PLDI 14] PL Translation[Onward 14] More Application: Program bug detection Program invariants inference PL design: Probabilistic PL Binary analysis JSNice: Type Predication[POPL 15]

Intermediate Representation Sequences Trees Graphical Models Feature Vectors Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....

Extract program representation with Program Analysis SLANG: alias and typestate analysis JSNice: scope and alias analysis, type analysis Other application: Use type inference to get trained labels Use SAT Solver to check path condition....

What's the suitable probabilistic model? N-gram language model [PLDI 14] Probabilistic context-free grammers [ICSE 12] Netural networks Support vector machine Conditional Ramdom Fields[POPL 15] Same like the IR, it's dependent on the application.

ML for PL [Picture from Martin Vechec's slide]

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusions

Statistical Code Completion [SLANG, V.Raychev et al' PLDI 14] Key insight: Regularities in code are similar to regularities in natural language

Techniques in SLANG IR: Sequences (setences) Program Analysis: typestate analysis, alias analysis Trained Model: Netural Network, N-gram language model Some smoothing techniques

N-gram language model Conditional probability only on previous n-1 words Training is achieved by counting n-grams. Time complexity for each word encountreed in training is constant, so training is usually fast. Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusion

Learning to Recognize Functions in Binary Code [Tiffany Bao et al' Usenix Security 14] When we use gcc with -O3, the function information may be stripped. Can we automatically and accurately recover function information from binaries?

Example: GCC #include int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1); } void main(int argc, char **argv){ printf("%d", fac(10)); }

Example: GCC default -O : push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp … c : push %ebp mov %esp,%ebp sub $0x18,%esp cmpl $0x1,0x8(%ebp) jne f mov $0x1,%eax …

-O1 -O c : push %ebx sub $0x18,%esp mov 0x20(%esp),%ebx mov $0x1,%eax cmp $0x1,%ebx … : mov $0x1,%edx mov $0xa,%eax lea 0x0(%esi),%esi … push %ebp mov %esp,%ebp and $0xfffffff0,%esp sub $0x10,%esp …

ByteWeight A machine learning + program analysis approach to function identification Training: Creates a model of function start patterms using supervised learning Usage: – Use trained models to match function start on stripped binaries — Function Start Identification – Use program analysis to identify all bytes associated with a function — Function Identification

[Picture from Bao's slide]

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusions

Problems of Program analysis Program have unbounded behaviors Program analysis – Analyze all behaviors – Run for a finite time In finite time, observe only finite behaviors Need to generalize

Generalization in Program Analyais Abstraction interpretation: widening operator[ CEGAR: interpolants Parameter tuning of tools(flow, path sensitivity, etc) Lots of folk knowledge, heuristics,...

Generalization in Machine Learning “It’s all about generalization” I A famous concept in Computational learning theory – Complexity and Feasibility of learning Learn a function from observations Hope that the function generalizes

Bias-variance Tradeofs in Program analysis [Aiken, POPL 14] Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis – Variance: O(VC-d) Explain know observations by this model Use this model to obtain better tools(in ASTREE, Yogi Project..)

Outline Overview – Potential applications – Intermediate Representation – Probabilistic model Statistical code completion Learning to Recognize Functions in Binary Code Bias-Variance Tradeoffs in program analysis Conclusions

Combine ML and PL Research Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security.... Lots of applications and theories to be found. Combination with other fileds: System, Security..

Thank you !