Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo.

Similar presentations


Presentation on theme: "Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo."— Presentation transcript:

1 Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo

2 Work with … Julius Davies** Daniel German Abram Hindle Wei Wang Ian Davis Cory Kapser*** Lijie Zou Qiang Tu ** Did most of the hard work this time 2 *** Did most of the hard work last time

3

4 “Provenance” Originally, used for art / antiques, but now used in science and IT: – Data provenance / audit trails – Component provenance for security – … but what about source code artifacts? 4 A set of documentary evidence pertaining to the origin, history, or ownership of an artifact. [From “provenir”, French for “to come from”] A set of documentary evidence pertaining to the origin, history, or ownership of an artifact. [From “provenir”, French for “to come from”]

5 5 Consider this code … const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1"); ap_threads_per_child = 1; } return NULL;

6 6 const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d threads,", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1"); ap_threads_per_child = 1; } return NULL; … and this code

7 7 … or these two functions gnumeric_oct2bin (FunctionEvalInfo *ei, GnmValue const * const *argv) { return val_to_base (ei, argv[0], argv[1], 8, 2, 0, GNM_const(7777777777.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO); } gnumeric_hex2bin (FunctionEvalInfo *ei, GnmValue const * const *argv) { return val_to_base (ei, argv[0], argv[1], 16, 2, 0, GNM_const(9999999999.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO); }

8 8 Or this … static PyObject * py_new_RangeRef_object (const GnmRangeRef *range_ref){ py_RangeRef_object *self; self = PyObject_NEW py_RangeRef_object, &py_RangeRef_object_type); if (self == NULL) { return NULL; } self->range_ref = *range_ref; return (PyObject *) self; }

9 9 … and this static PyObject * py_new_Range_object (GnmRange const *range) { py_Range_object *self; self = PyObject_NEW (py_Range_object, &py_Range_object_type); if (self == NULL) { return NULL; } self->range = *range; return (PyObject *) self; }

10 “Code clone” detection methods Strings Tokens ASTs PDGs 10 Time and complexity / prog lang dependence

11 Provenance: Related ideas Software clone detection – Why? Just “understand” where/why duplication has occurred Possible refactoring to reduce inconsistent maintenance, binary footprint size, to improve design, … Tracking software licensing compatibilities, esp. included libraries and cross-product entity “adoption” – Many techniques for this 11

12 Provenance: Related ideas “Origin analysis” + software genealogy – Why? Program comprehension Name / location change of sw entity within a system can break longitudinal studies – Use entity and relationship analysis to look for likely suspects [TSE-05] g y x z V old f y x z V new ??? 12

13 Provenance: Related ideas Software dev. “recommender systems”, mining software repositories – Why? Given info about similar situations, what might be helpful / informative in this situation? – Many techniques (AI, LSI, LDA, data mining, plus ad hoc specializations + combinations) … and so on … 13

14 Who are you? Alphonse Bertillon(1853- 1914) 14

15 15

16 Bertillonage metrics 1.Height 2.Stretch: Length of body from left shoulder to right middle finger when arm is raised 3.Bust: Length of torso from head to seat, taken when seated 4.Length of head: Crown to forehead 5.Width of head: Temple to temple 6.Length of right ear 7.Length of left foot 8.Length of left middle finger 9.Length of left cubit: Elbow to tip of middle finger 10.Width of cheeks 16

17 Forensic Bertillonage Some problems … – Equipment was cumbersome, expensive, required training – Measurement error, consistency – The metrics were not independent! – Adoption (and later abandonment) … but overall it was a big success! – Quick and dirty, and a huge leap forward – Some training and tools required but could be performed with technology of late 1800s – If done accurately, could quickly narrow down a very large pool of mugshots to only a handful 17

18 Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) – Who are you, really? Entity and relationship analysis – Where did you come from? Evolutionary history – Does your mother know you’re here? Licensing 18

19 Bertillonage desiderata A good Bertillonage metric should: – be computationally inexpensive – be applicable to the desired level of granularity and programming language – catch most of the bad guys (recall) – significantly reduce the search space (precision) Why not “fingerprinting” or “DNA analysis”? – Often there just is not enough info (or too much noise) to make conclusive identification – So we hope to reduce the candidate set so that manual examination is feasible 19

20 Bertillonage meta-techniques 20 1. Count basedsize, LOC, fan-in, McCabe 2. Set basedcontained string literals, method names 3. Relationship basedcall sets, throws sets, libraries included / used 4. Sequence basedmethods in order, token-based clone detection 5. Graph basedAST and PDG clone detection

21 Software Bertillonage: A motivating example 21

22 A real-world problem Software packages often bundle in third-party libraries to avoid “DLL-hell” [Di Penta-10] – In Java world, jars may include library source code or just byte code – Included libs may include other libs too! Payment Card Industry Data Security Std (PCI-DSS), Req #6: – “All critical systems must have the most recently released, appropriate software patches to protect against exploitation and compromise of cardholder data.” What if a financial software package doesn’t explicitly list the version IDs of its included libraries? 22

23 Identifying included libraries The version ID may be embedded in the name of the component! e.g., commons-codec-1.1.jar – … but often the version info is simply not there! Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10] – Won’t work if we don’t have library source code Compare against all known compiled binaries – But compilers, build-time compilation options may differ 23

24 Anchored class signatures Idea: Compile / acquire all known lib versions but extract only the signatures, then compare against target binary – Shouldn’t vary by compiler/build settings For a class C with methods M 1, …, M n, we define its anchored class signature as: θ(C) = σ(C), σ(M 1 ),..., σ(M n ) For an archive A composed of classes C 1,…,C k, we define its anchored class signature as θ(A) = {θ(C 1 ),..., θ(C k )} θ(C) = σ(C), σ(M 1 ),..., σ(M n ) θ(A) = {θ(C 1 ),..., θ(C k )} 24

25 // This is **decompiled** source!! package a.b; public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac } synchronized static int a (java.lang.String s) throws a.b.E { // decompiled byte code omitted } // This is **decompiled** source!! package a.b; public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac } synchronized static int a (java.lang.String s) throws a.b.E { // decompiled byte code omitted } σ(C) = public class a.b.C extends Object implements I σ(M 1 ) = public C() σ(M 2 ) = default synchronized static int a(String) throws E θ(C) = σ(C), σ(M 1 ), σ(M 2 ) σ(C) = public class a.b.C extends Object implements I σ(M 1 ) = public C() σ(M 2 ) = default synchronized static int a(String) throws E θ(C) = σ(C), σ(M 1 ), σ(M 2 ) 25

26 Archive similarity We define the similarity index of two archives as their Jaccard coefficient: We define the inclusion index of two archives as: 26

27 Implementation Created byte code ( bcel5 ) and source code signature extractors Used SHA1 hash for class signatures to improve performance – We don’t care about near misses at the method or class level! Built corpus from Maven2 jar repository – Maven is unversioned + volatile! – 150 GB of jars, zips, tarballs, etc., – 130,000 binary jars (75,000 unique) – 26M.class files, 4M.java source files (incl. duplicates) – Archives contain archives: 75,000 classes are nested 4 levels deep! 27

28 Investigation Target system: An industrial e-commerce app containing 84 jars RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? 28

29 Investigation RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive? 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0. – 48 exact, 3 correct product (target version not in Maven) 20 / 84 we found multiple matches with simIndex = 1.0 – 19 exact, 1 correct product 12 / 84 we found matches with 0 < simIndex < 1.0 – 1 exact, 9 correct product, 2 incorrect product 1 / 84 we found no match (product not in Maven as a binary) 29 More data here: http://juliusdavies.ca/uvic/jarchive/http://juliusdavies.ca/uvic/jarchive/

30 Investigation RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? 22 / 84 binary jars (26.2%), we found a single candidate from the corpus with similarity index of 1.0. – 13 exact, 2 correct product 7 / 84 we found multiple matches with simIndex = 1.0 – 6 exact, 1 correct product 46 / 84 we found matches with 0 < simIndex < 1.0 – 25 exact, 20 correct product, 1 incorrect product 16 / 84 we found no match (product not in Maven as source) 30 More data here: http://juliusdavies.ca/uvic/jarchive/http://juliusdavies.ca/uvic/jarchive/

31 Investigation RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? Found exact match or correct product 81 / 84 times (96.4%) RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? Found exact match or correct product 57 / 84 times (67.9%) 31

32 Further uses 1.Used the version info from extracted from e-commerce app to perform audits for licensing and security – One jar changed open source licenses (GNU Affero, LGPL) – One jar version was found to have known security bugs 1.When did Google Android developers copy-paste httpclient.jar classes into android.jar ? – And how much work would it be to include a newer version? – We narrowed it down to two likely candidates, one of which turned out to be correct. 32

33 Summary: Anchored signature matching If the master repository is rich enough, anchored signature matching can be highly accurate for both source and binary – Tho might have to examine multiple candidates with perfect scores (median: 4, max: 30) – Also works well if product present, but target version missing The approach is fast and simple, once the repository has been built 33

34 Summary: Software Provenance / Bertillonage Who are you? – Determining the provenance of software entities is a growing and important problem Software Bertillonage: – Quick and dirty techniques applied widely, then expensive techniques applied narrowly Identifying version IDs of included Java libraries is an example of the software provenance problem – And our solution of anchored signature matching is an example of software Bertillonage 34

35 References “Software Bertillonage: Finding the provenance of an entity”, by Julius Davies, Daniel M. German, Michael W. Godfrey, and Abram J. Hindle. Under review. “Copy-Paste as a Principled Engineering Tool”, Michael W. Godfrey and Cory J. Kapser. Chapter 28 in the book Making Software: What Really Works and Why We Believe It, by Greg Wilson and Andy Oram (eds.), O'Reilly and Associates, October 2010. “Using Origin Analysis to Detect Merging and Splitting of Source Code Entities”, by Michael W. Godfrey and Lijie Zou, IEEE Transaction on Software Engineering, 31(2), Feb 2005. 35

36 36 Chapter 28 is awesome!! All author proceeds to Amnesty International!!

37 Non-CS References Fingerprints: The Origins of Crime Detection and the Murder Case that Launched Forensic Science, Colin Beavan, Hyperion Publishing, 2001. http://en.wikipedia.org/wiki/Alphonse_Bertillon 37

38 Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo

39 Bertillonage meta-techniques 1.Count based e.g., size, LOC, fan-in, McCabe 2.Set based e.g., contained string literals, method names 3.Relationship based e.g., call sets, throws sets, libraries included / used 4.Sequence based e.g., methods in order, token-based clone detection 5.Graph based e.g., AST and PDG clone detection 39

40 Why not “software DNA analysis”? Because the original may not exist Because the original may exist but not be in our records Because the original may have changed Because the search space is large, and detailed comparisons are expensive 40

41 Suffix tree / array For each token stream (“string”), build a tree that represents all possible suffixes –Compare each new string to the set of existing trees –Comparisons are fast, but uses a lot of space Suffix array is a sorted list of all suffixes: 1.a 2.ana 3.anana 4.banana 5.na 6.nana 41 [Diagram from Wikipedia]

42 Who are you? 42

43 A challenge Do clone detection on the Maven database at the package level – Scale – Precision/recall 43

44 A question How effective are simple, cheap techniques? Examples: – Relationship analysis (e.g., call sets) – Library inclusion – Embedded strings in binaries – Binary-oriented clone detection 44

45 Techniques – Source code analysis – Binary analysis – Mining sw repositories 45

46 Related work... umm, lots and lots … most of the people in this room and then some 46

47 Common problems Library inclusion in deployed package – Upgrade? Lic. change? Standard conformance Inconsistent maintenance across clones Understanding entity evolution License violations 47

48 Techniques Lexical / sequence – Strict or parameterized matching – Gaps Syntactic / structural – Tree/graph comparisons – Metrics Semantic – Relationship analysis (eg call sets, lib inclusions) 48

49 Applications Which version of library XXX is packaged with the release YYY? – How easy to upgrade / compatible? – Changes to licenses? – Conformance to security standards Origin analysis – Patterns of evolution / refactoring – Want to do empirical studies Code clone detection (many ways) GPL violations – Source analysis – Static binary analysis – Dynamic analysis Entity vs relationship analysis String matching Library inclusion patterns 49


Download ppt "Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo."

Similar presentations


Ads by Google