272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

Slides:



Advertisements
Similar presentations
A Method for Validating Software Security Constraints Filaret Ilas Matt Henry CS 527 Dr. O.J. Pilskalns.
Advertisements

CS 267: Automated Verification Lecture 2: Linear vs. Branching time. Temporal Logics: CTL, CTL*. CTL model checking algorithm. Counter-example generation.
CS 267: Automated Verification Lecture 8: Automata Theoretic Model Checking Instructor: Tevfik Bultan.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Rigorous Software Development CSCI-GA Instructor: Thomas Wies Spring 2012 Lecture 13.
Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.
Software Testing and Quality Assurance
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Detailed Design Kenneth M. Anderson Lecture 21
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
Chapter 18 Testing Conventional Applications
Data Flow Analysis Compiler Design Nov. 8, 2005.
Improvements and Extensions of the EG Interface Fall 2002.
Program Analysis Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
CS 267: Automated Verification Lecture 13: Bounded Model Checking Instructor: Tevfik Bultan.
1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 4: SMT-based Bounded Model Checking of Concurrent Software.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
A Complexity Measure THOMAS J. McCABE Presented by Sarochapol Rattanasopinswat.
Automated malware classification based on network behavior
System/Software Testing
Notes for Chapter 12 Logic Programming The AI War Basic Concepts of Logic Programming Prolog Review questions.
Class Specification Implementation Graph By: Njume Njinimbam Chi-Chang Sun.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 20 Slide 1 Defect testing l Testing programs to establish the presence of system defects.
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Bug Localization with Machine Learning Techniques Wujie Zheng
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
Agenda Introduction Overview of White-box testing Basis path testing
Lesley Charles November 23, 2009.
Compiler Chapter# 5 Intermediate code generation.
CS 267: Automated Verification Lecture 6: Binary Decision Diagrams Instructor: Tevfik Bultan.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
C++ Programming Language Lecture 2 Problem Analysis and Solution Representation By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
Test Coverage CS-300 Fall 2005 Supreeth Venkataraman.
Black-box Testing.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.
Jinlin Yang and David Evans [jinlin, Department of Computer Science University of Virginia PASTE 2004 June 7 th 2004
Basic Control Structures
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Deriving Operational Software Specification from System Goals Xin Bai EEL 5881 Course Fall, 2003.
Detecting Equality of Variables in Programs Bowen Alpern, Mark N. Wegman, F. Kenneth Zadeck Presented by: Abdulrahman Mahmoud.
1 Features as Constraints Rafael AccorsiUniv. Freiburg Carlos ArecesUniv. Amsterdam Wiet BoumaKPN Research Maarten de RijkeUniv. Amsterdam.
The Software Development Process
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Mining Specifications of Malicious Behavior Mihai Christodorescu (work done at University of Wisconsin) Somesh Jha University of Wisconsin Christopher.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Theory and Practice of Software Testing
SOFTWARE TESTING. Introduction Software Testing is the process of executing a program or system with the intent of finding errors. It involves any activity.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 9: Test Generation from Models.
SAFEWARE System Safety and Computers Chap18:Verification of Safety Author : Nancy G. Leveson University of Washington 1995 by Addison-Wesley Publishing.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
The Hashemite University Computer Engineering Department
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 6: Stepwise refinement revisited, Midterm review.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
A Generic Approach to Big Data Alarms Prioritization
Automatic Extraction of Malicious Behaviors
TriggerScope: Towards Detecting Logic Bombs in Android Applications
Software Testing.
Software Engineering (CSI 321)
Presentation transcript:

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining

Code Mining There is a lot of code that is available for everyone to access –Can we learn from them? One of the active research directions in software engineering is to mine existing code for various purposes such as –To discover common behaviors which can then be used to extract specifications such as interfaces usage patterns, etc. –To discover anomalies which can then be used to find bugs or problematic behaviors

We will discuss two papers that do this Today we will discuss two papers that use code mining for different purposes: "Graph-based Mining of Multiple Object Usage Patterns" Tung Nguyen, Hoan Nguyen, Nam Pham, Jafar Al-Kofahi, and Tien Nguyen. 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009). –"Mining Specifications of Malicious Behavior" Mihai Christodorescu, Somesh Jha, Christopher Kruegel. Joint meeting of European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2007).

Mining for Object Usage Patterns We discussed papers that automatically extract behavioral interfaces for classes –The papers we discussed earlier focus on usage of a single class and try to identify the ordering of method class on a single object However, there maybe usage patterns that involve multiple objects –Moreover, usage patterns may involve control flow structures –such as calling a method within a loop and data dependencies –such as one argument in a method call being dependent on an argument in another method call

GrouMiner GrouMiner is a tool that extracts usage patterns for objects that takes into account both –temporal usage orders (like we have seen in the interface extraction papers we discussed already), and –data dependencies It defines a graph-based object usage model (groum) and extracts these models from existing code

Object Usage Model A groum is a directed acyclic graph –nodes are labeled, edges are not labeled Nodes correspond to –actions: method calls, access to object fields –control flow structures: conditions, branches or loops such as if, while, for statements Edges represent –temporal ordering: If A is used before (or always generated before) B then there is an edge from A to B –data dependency: There is an edge from A to B if there is a data dependency between A and B A groum can represent multiple objects

How to Extract the Object Usage Model? The temporal ordering of the nodes in a groum is extracted from the AST by adding edges between nodes that are sequentially ordered The data dependency edges are extracted using an intra-procedural dependency analysis –Identify the variables involved in each action to determine the dependencies and add edges to represent the dependencies

Extracting Code Skeletons A groum model can be un-parsed and converted to a code skeleton This code skeleton will demonstrate the usage pattern as code rather than a directed acyclic graph This approach can be used as a reverse engineering approach to discover different usage patterns in the code However, there will be many groums in a given code and not all of them should be reported as usage patterns –They should be filetered somehow

Usage Pattern Mining GrouMiner uses graph mining techniques to identify the common usage patterns in the code They determine the frequency of a pattern by computing the number of independent occurrences –If the frequency of a pattern is higher than a threshold, then it is reported The graph mining algorithm determines the common patterns efficiently by –By identifying common graph patterns incrementally, starting with graphs with small number of nodes and then finding other patterns based on sub-graph relationship –By checking equivalence of patterns approximately using a vector representation that summarizes the features of a pattern, rather than doing an exact matching

Anomaly Detection Using the graph mining algorithm they can identify anomalous usages –They identify an anomalous usage as a sub-graph of an identified pattern that is not extensible to that pattern –This is considered a violation of the pattern A violation is considered an anomaly when it is too rare –i.e., common violations are not reported as anomalies They discuss two types of anomaly detection: 1) anomaly detection in a given project, 2) anomaly detection when a project changes Anomaly detection can be used to identify errors –Ana anomalous usage may correspond to violation of an interface and may point to a bug However, when anomaly detection is used as a bug finding approach it generates a lot of false positives (87.8% in one case) –i.e., many identified anomalies do not correspond to errors

Mining Specifications of Malicious Behavior In the second paper we are discussing, code mining is used to find specifications of malicious behavior Computer security applications rely on manually written specifications to identify malicious code automatically However, the manual specification task is hard and time consuming –This paper tries to automate the specification of malicious behavior

The approach The presented approach works in three steps 1.Collect execution traces from malware and benign programs 2.Construct the corresponding dependence graphs 3.Compute specification of malicious behavior as difference of dependence graphs Note that in this approach mining is done on the execution traces –In the paper we discussed earlier, mining was done on the source code

How to represent behavior? They identify some requirements for representation of behaviors: 1.A specification must not contain independent operations 2.A specification must relate the dependent operations 3.A specification should capture only security relevant operations To meet these requirements they focus only on system calls and represent malicious behavior as a dependence graph of system calls This representation satisfies their requirements –Independent calls will not be connected in this representation –Dependent calls will be connected –Only the system calls will be tracked since they correspond to the security relevant operations

How to represent the behavior? The behavior is represented as a special type of dependence graph Since they are interested in system security, they decide to model execution behavior as a sequence of system calls Each node of the dependence graph they construct corresponds to a system call The edges of the dependence graph corresponds to constraints that represent the dependences between two system calls –Such as argument1 for call1 is equal to the argument 2 of call2

More on dependence graphs The dependence graphs they construct are directed acyclic graphs Each node corresponds to a system call –They define a simple type system for the arguments of the system calls Edges represent dependencies which are characterized as logic formulas –A logic system that allows constraints with modular and bit-vector arithmetic, arrays, and existential and universal quantifiers is sufficient

Comparing Benign Programs and Malware The presented approach first constructs the dependence graphs for the execution traces of the benign program and the malicious programs Then they construct the minimal contrast subgraph of a malware dependence graph and the benign dependence graph –The smallest subgraph of the first graph that does not appear in the second

Empirical evaluation Thee presented approach is applied to 16 well-known malware examples For these 16 examples, the algorithm successfully discovers the same behavioral features as those independently provided by human experts