Regexes in the Wild Empirical Studies on Security and Correctness

Regexes in the Wild Empirical Studies on Security and Correctness
Hi, I’m Jamie. In this talk I’ll discuss my dissertation research on regular expressions (regexes). For those of you on my PhD committee, I hope you have not yet grown bored of this material. The work I’ll describe today is largely empirical. If I had to pick just one thing for you to take away today, it would be that I have placed the problems that practitioners face on firm empirical ground Both a security issue called REDOS And portability problems that may lead to correctness issues James Davis Dongyoon Lee

Regexes! Some of you might be wondering why my research focuses on regexes. The answer is hiding here on my LinkedIn profile. Before coming to grad school, I spent a few years on a software product team at IBM. My job was to test the software, and one of my roles was to ensure that the command-line interface was still working. And guess what? *A common way to test a CLI is to encode valid output using a regex. I probably wrote and maintained hundreds of regexes during my time at IBM, and got a really good sense of the value of this technology for practitioners. which produces human-friendly output rather than machine-friendly output, and which might change a bit from release to release –

Talk outline 1. Background What is a regex?
What do software engineers use them for? How are they implemented in programming languages? How are regexes related to security (ReDoS)? (SECURITY’18) 2. Selected research Methods: Where do I get regexes to analyze? (ASE’19) Security: How widespread might ReDoS be? (ESEC/FSE’18) Correctness: The promise and perils of re-using regexes (ESEC/FSE’19) 3. Advice to my past self In the first section of the talk, I’ll answer some questions you might be wondering at the moment, like: *… Regexes can be a security problem – because they can be expensive to evaluate, they can lead to service outages (hence “RE denial of service”). Then I’ll describe some of the research I’ve done on this topic And since many of you are junior graduate students, I’ll close with a few reflections on my time in graduate school, in the hopes of improving all of your experiences.

Part 1: Background Let me start by introducing some background material.

Primer on Regular Expressions (Regexes)
Concept Sample Notation String language Supported in all PLs Extended Regexes NP-hard ab “ab” a+ “a”, “aaa” a* “ ”, “aaa” [a-z] “a”, “x”, “z” \w [0-9 a-z A-Z] Language: a subset of all possible strings

Software engineers use regexes for…
30-40% of Python and JavaScript projects Diverse purposes User-agent string  Server-side rendering File names  Command-line tools Tokenizing  Lexers HTML  Browser plug-ins  Input validation [C&S ‘16] [D et al. ‘18] Regexes are widely used, estimated to appear in 30-40% of Python and JavaScript projects. They are used for diverse purposes. My prior empirical work has identified many common application areas, including User-agent string: For example, rendering a webpage to take advantage of features available only on Google Chrome Source code: e.g. to enforce coding conventions like camelCase File names: e.g. to validate input for a CLI Tokens: e.g. for a lexer to tokenize some source code HTML: e.g. for a browser to manipulate source code e.g. to do form validation

/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/  IPv4
Some regex examples /.+$/  “Chars” /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/  IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/  Username To give you a few more examples, here are some regexes we mined from real software available on GitHub. Super-linear “ReDoS regex”

How most regex engines work [Spencer’94]
What are they? match(regex, string)  “Does this regex match this string?” How do they work? /^a+$/ Simulate on input “aaa” Now this worst-case exponential-time behavior is due to how the regex engines in most programming languages are implemented.

Super-linear “ReDoS Regexes”
Simple ReDoS regex /^(a+)+$/ NFA Malicious input “aaaaaaaaaa…aa!” Recurrence relation T(n) = 2*T(n-1) = 2*(2*T(n-2)) = O(2n) A ReDoS regex is a special kind of regex whose worst-case behavior Is polynomial or exponential in most regex engines. Mismatch - backtracking Exponential paths

Regular Expression Denial of Service (ReDoS)
[C&W ‘03] /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ “aaa…aaaa!” “Susie” Malicious input injected Suppose the server is using a ReDoS regex for usernames – we found this one in some of Microsoft’s open-source code Killing a server: Users like Susie are sad. If this is Outlook mail, then millions of customers can’t check their . REDOS is an attack concept that has been discussed for about 15 years in the security and theory communities, but I am the first to perform large-scale measurements and experiments on this security vulnerability. In my research I have shown that ReDoS is a real threat to many software applications, and hopefully motivated both practitioners and programming language designers to move to address it. [S&P ‘18] [D et al. ‘18]

ReDoS @ CloudFlare – July 2019
CPU utilization (all machines) 100% 75% 50% 25% 0% There have, however, been a few anecdotes of REDOS in the wild. For example, the CDN company CloudFlare had an outage in July that was caused by a super-linear regex evaluation. This regex was used for XSS payload detection and was applied to all of the packets processed by their servers. *When they deployed the regex, it began to exhibit cubic behavior, burning up all of their CPU cores This led to a 27 minute service outage, the first they had had in years. *Here’s the regex in all of its super-linear glory. (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

ReDoS @ JavaScript: 10% of npm vulnerabilities
CloudFlare isn’t the only company that has been affected. During a paper I published at USENIX Security with these fine gentlemen, we categorized the security vulnerabilities reported in the JavaScript module registry npm. According to Snyk.io, which maintains the largest such database, about 75% of the DoS vectors – or 10% of all reported security vulnerabilities – were caused by regular expressions. SECURITY’18 – Excerpt

Part 2: Research Alrighty.
Now you are all experts on regexes, and you have sense of one problem that software engineers might encounter when they work with them. Next I’ll walk you through three of my research projects on regexes.

Research outline 1 2 Methodology ASE’19 ReDoS ESEC/FSE’18 Re-use
First I’ll explain a key aspect of the research methodology, by explaining a snippet of a methods paper I published at ASE this year. Then I’ll walk through a paper I published last year examining ReDoS regexes. Finally I’ll discuss another paper I published this year about regex re-use practices. 3

How to build a regex corpus
ASE’19 – Excerpt How to build a regex corpus (research excerpt) James Davis Daniel Moyer Ayaan Kazerouni Dongyoon Lee Q: Where do I get regexes to analyze? A: This project on building a regex corpus

What’s a regex corpus, and why do I want one?
Q: How to extract regexes? Typical regex practices Popular and unpopular features Extent of super-linear regexes  Tool builders, regex engine devs. *We’d like to understand how software engineers use regexes in practice. *To make sure our results aren’t biased, we can’t just look at one project – we’ll need to look at a bunch. *Suppose we have a magic means of extracting regexes from software projects. Then we can build a collection of all of the regexes people use in the wild. *We can use this for…  “all sorts of insights that will benefit regex tool builders and regex engine developers *So – how should we extract regexes?

How have researchers built regex corpuses?
Programming language  Software  Applications, modules Extraction methodology  “Static” and “Dynamic” Note that static extraction is fairly straightforward, while “dynamic extraction” – instrument the program or runtime, then execute the software – is more difficult to scale to many projects and multiple languages.

Experimental design Programming language 
Software  Important open-source modules Extraction methodology  Let’s try both! Research question: Does it matter which extraction methodology we follow? More formally, we want to know whether there is a measurable difference between regexes extracted using one methodology or the other.

Regex collection methodology (125K regexes)
Module regex extraction Module selection . . . Static analysis Prog. Instr. We collected regexes like this. For each language, we identified the largest module registry. We mapped those modules to GitHub and sorted them by the number of GitHub stars they have, which gave us a registry-independent measurement of importance. *For each of the top 25K modules, we applied both previously-used regex extraction techniques to build sets of regexes extracted one way or the other. *After we repeat this for 25K modules, these can BOTH rightly be said to be a JavaScript regex corpus. But is there any difference? For this we need to characterize a regex, preferably quantitatively. The next part of our project was collecting a set of metrics to characterize a regex across three dimensions considered in prior work. “Important” Top 25K

Regex metrics Representation Language diversity – matching strings
String length Features used NFA size Language diversity – matching strings Complexity DFA size Super-linear behavior (“ReDoS regexes”) Use of advanced features

Representative results (all charts look like this)

Conclusions from this work

Within a Prog. Lang., the two competing regex extraction methodologies yield indistinguishable regex corpuses Slide, then: Takehome message We believe our study shows that: ReDoS is a very real problem in practice, and that The practitioner community needs better tools to detect and repair ReDoS regexes. (So let’s use the easier methodology!)

3

The Impact of ReDoS in Practice
ESEC/FSE’18 – Distinguished paper The Impact of ReDoS in Practice James Davis Christy Coghlan Francisco Servant Dongyoon Lee Two of the research questions I’ll highlight: (1) How common are super-linear regexes (“ReDoS regexes”) in real software? (2) How do developers repair these security vulnerabilities?

Highlighted contributions
ReDoS regexes are prevalent in the wild Developers (try to) fix them with 3 techniques Not just one or two Thousands! 1. The key ingredient for ReDoS is a ReDoS regex. We found that, to our surprise, that... but thousands of ReDoS regexes used in practice today. Takehome message We believe our study shows that: ReDoS is a very real problem in practice, and that The practitioner community needs better tools to detect and repair ReDoS regexes.

ReDoS Regexes in the Wild

Collecting Regexes 45% 35% 350K 60K Module selection
Module regex extraction 45% 35% 565K 125K Giant list of regexes “Can clone” filter Walk through the flow-chart GitHub: Anything that was “git clone-able”, which in the vast majority of cases meant the modules were hosted on GitHub Whole-ecosystem approach. Python We also repeated this analysis for the Python module ecosystem to see if our findings would generalize to another context. 350K 60K 375K (66%) 70K (58%)

Analyzing Regexes 350K 60K 1. ReDoS regexes 2. Degree 3.6K (1%) 704
[R&T ‘14] [WMBW ‘16] [WOHD ‘17] 1. ReDoS Regexes Talk through flowchart. From this we learn that ReDoS regexes occur a lot in practice – thousands of ReDoS regexes occurring in over 10K modules. ReDoS regexes make these modules, as well as any modules and applications that depend on them, potentially vulnerable to ReDoS. *As you’ll see later in the talk, these figures are wild underestimates. Based on the outcome of improving the detectors a little bit later on, I would estimate about an order of magnitude more super-linear regexes than is shown here. 2. Degree We wanted to know how concerned developers should be about these regexes. Exponential behavior is rather more problematic than polynomial behavior. We measured match time on varying-length inputs. Then we fitted exponential and polynomial curves to the data. We chose the curve of best fit. 3.6K (1%) 704 (1%) 13K (3%) 705 (1%)

ReDoS Regexes are Usually Quadratic
This figure shows... x-axis: empirical curve y-axis is percent of ReDoS regexes with that curve orange is npm, maroon is pypi Results Most ReDoS regexes exhibit quadratic behavior as the input length increases. Exponential-time ReDoS regexes are rare. Trends similar in npm and pypi.

ReDoS Regexes occur in Prominent Places
These prominent ReDoS regexes have the potential to affect millions of applications. Transition In summary, in this theme we used state-of-the-art ReDoS regex detectors to determine the incidence of ReDoS regexes in npm and pypi ecosystems. 3 regexes 2 regexes

The Repair of ReDoS Regexes
Theme 3!

(ReDoS) Regexes Are Hard to Understand
[CWS ‘17] /^(\+|-)?(\d+|(\d*\.\d*))? (E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\’ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)/ /^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ To give you a taste of the problem faced by developers, here are some of the ReDoS regexes that we reported. They are long, and they are complicated. We wanted to know how developers go about fixing ReDoS vulnerabilities like these ones, in order to inform future research on automatic ReDoS regex amelioration. /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ /\+OK.*(<[^>]+>)/ /\s*#?\s*$/ /^\s*/\*\s*(.+?)\s*\*/\s*$/

Methodology Historic Disclosures & Fixes
Study all ReDoS reports in CVE and Snyk.io DBs “What do developers prefer when they know all the fix strategies?” 284 module maintainers Vulnerability disclosure Fix strategies Part 1 CVE: General DB for security vulnerabilities in software Snyk.io: Module-specific security vulnerability DB that includes npm and pypi modules. Part 2: We selected modules based on: - Downloaded >1000 times/month - Manual inspection for seriousness (potential for ReDoS exploitability)

Fix Strategies For RedoS Regexes
Original Fix strategies Trim TRUNCATE(input, 1000) Revise Replace* (Custom parser) Exactly somewhere in the middle of the string A ‘.’ to the right of But not immediately to the right Match different strings! cf. [vdM et al. ‘17] This example is taken from a discussion we had with the Django development team. *The fact that they match different sets of strings is surprising. In fact there’s been some work from van der Merwe et al. who tried to revise REDOS regexes but only in a language-preserving way. They were excited to hear that they can be more aggressive in their proposed revisions.

Fix strategies and correctness
We observed different trends in the strategies used by developers in the historic data vs. after our disclosures. Axes X-axis: Fix strategy Y-axis: Number of times it was used In total there are 37 historical fixes, with most in Revise but more than 5 of the others. Compare that to the 48 fixes resulting from our disclosures. Decrease in Trim Developers exposed to all 3 fix strategies opted not to Trim Correctness So...fixing ReDoS regexes seems to be pretty hard for developers to do correctly. 1 incorrect 2 incorrect “All correct”

What did we learn from this work?

ReDoS regexes are a real problem in practice
Regexes are widely used in JavaScript and Python modules 1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modules ReDoS regexes are hard to fix Slide, then: Takehome message We believe our study shows that: ReDoS is a very real problem in practice, and that The practitioner community needs better tools to detect and repair ReDoS regexes.

3

The promise and perils of re-using regexes
ESEC/FSE’19 The promise and perils of re-using regexes James Davis The promise and perils of re-using regexes By “lingua franca”, I mean a common language shared by several parties. This is both - an experimental question – in what ways aren’t they portable across programming languages? - and a plea to programming language designers – why AREN’T they? Mischa Michael Christy Coghlan Francisco Servant Dongyoon Lee

Summary of Contributions
Developer perspectives Survey of 158 devs Regex re-use Corpus of 500K regexes Portability experiments Language differences TODO Talking points Emphasize the scale of our work – prior survey is ~20, prior corpus is from only 2 languages, etc. Read through the languages in case the icons are not known. TODO: Remind the audience of possible broader implications – regexes as a class of software artifact that can be shared across PLs, similar to SQL etc. – and bring this back in the final slide?

The regex development process
 583K views Here’s a typical example of the regex development process, based on results from our developer survey. *Great, let’s put it into our Ruby program. *We’re good software engineers, so we write a test case, which it passes. Unlike most code snippets, regexes can flow unchanged across language boundaries. Programming languages have similar regex syntaxes, so re-used regexes may compile without modification. However, surface-level syntactic compatibility can mask more subtle portability problems. For example, if regex semantics vary, then a regex will match different sets of strings across programming languages, resulting in logical errors. *Here in Ruby, another input also passes – this time unexpectedly. *I wanted to require the whole string to match, but in Ruby the ^ anchor only restricts to the beginning of each line within the string. 1.23 ^\d+(\.\d{1,2})?$ \n1.23

Research questions Developer perspectives Do developers re-use regexes? Where do developers re-use regexes from? Do developers believe regexes are portable across languages? Measuring regex re-use How commonly are regexes re-used? (other software ; Internet) Empirical portability Different regex semantics? Different regex performance? We divided our research questions across three themes. *The first theme was qualitative: what do developers think and do during regex development? *In our second theme, we corroborated the re-use practices that developers described by measuring regex re-use in open-source software. *In the third theme, we empirically evaluated the regex portability problems that developers might encounter, considering both semantics and performance. ---- Our experiments covered two aspects: Semantic portability: When and why do regexes have different semantics across ProgLangs -- match different sets of strings in different programming languages? Performance portability: When and why do regexes have different worst-case performance in different programming languages?

Developer perspectives Developers’ Regex Re-use Practices
In the first theme, we studied developer perspectives on regexes. To do this, we designed a survey on regex practices and shared it with professional software developers.

158 Respondents: “We re-use regexes across languages”
Methods IRB-approved survey Respondents The median respondent: 3-5 years of professional experience medium-size company intermediate regex skill Key results (in this paper – see also ASE’19) 94% of developers re-use regexes When re-using, developers are rarely confident that the regex comes from the same programming language 94% of respondents reported re-using regexes at least 25% of the time. We asked them how often they knew the regex was being re-used in the same language. They were rarely confident that this was the case.

Measuring Regex Re-Use Polyglot Regex Corpus
Developers reported regex re-use practices, copy/pasting regexes from other source code (e.g. their company’s mono-repo), and from Internet sources like Stack Overflow. To corroborate this characterization of practice, we collected a polyglot regex corpus and measured the extent of regex duplication between software projects written in different programming languages.

Regex collection methodology
Module regex extraction Module selection Unique regexes We collected regexes like this. For each language, we identified the largest module registry. We mapped those modules to GitHub and sorted them by the number of GitHub stars they have, which gave us a registry-independent measurement of importance. For each of the top 25K modules, we statically extracted regexes from the master branch. This gave us a set of unique regexes for each module, which we merged into a larger corpus. Top 25K

Regex corpus Num. modules Unique regexes 25K 150K ” 20K 45K ” 45K ”
193K modules 580K unique regexes 150K ” In total, we extracted regexes from 193K modules written in 8 programming languages. After accounting for duplicates, this gave us a polyglot corpus of about half a million unique regexes. --- 193,524 modules As you can see, we found more regexes in some languages than others. 20K ” All (30K) 140K All (10K) 2K

“Complex” regex re-use by module
In building our regex corpus, we tracked the regexes extracted from each module. To understand regex re-use, we measured the extent of re-use between the modules in our registry sample. This chart plots the percent of modules that contained a complex regex that appeared in more than one module. This is what I mean by “complex”: To eliminate trivially identical regexes like /\s/, we conservatively require any matching regexes to be at least 15 characters long. We report that *In 20% of all modules, we found a complex regex that appeared in some other module, in the same or a different programming language, *About 7% of the corpus modules contain a complex regex that appears in more than one language, and *5% of the corpus modules contain an exact copy of a regex we found on Stack Overflow Most common in JS – 15% of the JavaScript modules These measurements generally corroborate the description of regex practices from our survey. It appears that, like our survey respondents, the developers of open-source software also copy-paste regexes, including across language boundaries. ---- By complex I mean it is at least 15 characters long -- and thus unlikely to be independently derived. Although some of these regexes may have been derived by other means – independent creation, whole-sale file duplication, etc. …

(Empirical) Portability Semantic + Performance Differences
In the first set of research results, I reported that developers say they re-use regexes across language boundaries. Just now, I showed that measurements on our corpus corroborate their report. These findings made us wonder: What might go wrong with this regex re-use practice? To answer this question, we ran two “what-if” experiments to understand the possible problems empirically. We studied both semantic and performance differences. By a semantic difference, I mean that a regex matches different sets of strings in different languages. By a performance difference, I mean that a regex takes a noteworthily different amount of time to complete a match – some change in the worst-case complexity.

What-if experiments using regex testers
match regex against input if match: emit matched substring emit capture groups else: report “no match” “Capture group” To perform our what-if experiments, we used a common set of regex testers. The regex testers all follow the same simple algorithm: … We implemented this algorithm in each of these languages. *As an example, suppose we have the regex … and the input “3.14”. *This regex will match the input. It will match the substring “3.14” – the whole string, in this case. The regex contains one capture group, the integer prefix of the string, and our tester will also emit the contents of this group. /(\d+).\d+/ & 3.14 Match substring “3.14” capture “3”

Language versions used in experiments

Semantics – Methodology
Input generators Rex, MutRex, EGRET, ReScue, BRICS Both match? Same substring? Same captures? Regex testers Language semantic analysis (pairwise) Here is the methodology we used to identify semantic portability problems. *To measure semantic differences, we needed a set of inputs to determine the strings a regex matches and the strings it does not match. We used five state-of-the-art regex input generators to generate input. Using our input generator ensemble, we produced about 2500 unique inputs for the median regex. *We then fed these inputs to our regex testers and collected the results. *To analyze, we compared the results for each pair of languages. COMPARING the results matches can produce four distinct outcomes: First row: They both mismatch, or they match and agree on everything (no disagreement) They both match, agree on the matched substring, but disagree about the contents of capture groups They both match, but report that different substrings matched One matches and the other does not When two languages disagree on the behavior for a (regex, input) pair, we call this a “difference witness”. Difference witnesses fall into one of three categories: capture, substring, and match witnesses. ReScue: modified to emit the randomized inputs it explores in its search for super-linear behavior BRICS: modified to emit random subsets of inputs instead of an exhaustive set (infinite / very large for our purposes) “Difference witnesses”

Copy/pasting regexes leads to semantic differences
8% had a match witness 4% had a substring witness 7% had a capture witness 15% of regexes (82K) had a difference witness We identified at least one difference witness for 15% of the regexes. These are regexes that compiled in some pair of languages but disagreed in some way about the match. Among those: 8% of the regexes had a match witness 4% had a substring witness 7% had a capture witness This means that, ceteris paribus, about 15% of the time a developer copy/pastes a regex across a language boundary it will have different behavior in the new programming language. Assuming the behavior of the regex in the old language was “correct” according to application semantics, some regex porting will be required.

Some sample results of semantic differences
Witness Language 1 Behavior 1 Language 2 Behavior 2 \cC ctrl-C “cC” ^a Begin input line a++ Possessive quantifier Quantifier [ ] ] “]” Empty capture group The full details of the differences we identified are in the paper, but I’d like to show you some illustrative findings here. The first behavior is the “expected” behavior (normal across several languages), while the second behavior is the “surprising” behavior – unique to the language or shared by a few. Regex engines are complex and under-specified, and not all of these behaviors were documented. We found that there’s no substitute for actually evaluating regexes in different languages and comparing how they behave. Undocumented behaviors

Full table of semantic differences
Regex notation that describes a feature in one language and no feature in another (silent interpretation) The same regex notation describes different features The same regex notation describes the same feature, but it behaves differently Bugs Regex engine bugs

Performance methodology
Regex corpus SL regex detectors Per-lang. complexity This brings us to the second part of our experiments: portability differences. Prior research has shown that regex matches can have linear, polynomial, or exponential complexity in the length of the input. This match time can vary by language for the same regex and input, based on the algorithm that a language’s regex engine uses to perform the match. However, we don’t have a good picture of the differences between the regex engines in each language, so we checked experimentally. Why does this matter? Well, suppose a regex is copied from a language in which it has linear performance to one in which it has polynomial or exponential performance. This introduces a potential security vulnerability. If this regex is used in server-side code that handles client input, then an attacker could trigger this algorithmic complexity and perform a Regular Expression Denial of Service attack. To understand the algorithmic complexity differences between the regex engines in each language, we used an ensemble of four super-linear regex detectors to identify possible super-linear behavior. Then we used our per-language regex testers to check the worst-case behavior in each language, labeling each regex as linear, polynomial, or exponential in the worst case.

Spencer – Medium (defenses)
Performance results Thompson - Fast This chart shows the percent of regexes in each language that exhibited exponential and polynomial worst-case behavior. Note first that in the previous paper, we reported ~1% of regexes as super-linear. But look at the y-axis here! We measured up to 10% of regexes as super-linear. In between the previous paper and this one, we observed several sources of false negatives in the SL regex detector ensemble, which we addressed in these experiments. This means that the extent of super-linear regexes may be up to an order of magnitude greater than I reported in our previous measurements – not 1% but up to 10%. As you can see, we identified three distinct families of performance: JS, Java, Python, and Ruby PHP and Perl Go and Rust The “Slow” and “Medium” engines use variations on Spencer’s algorithm, while the “Fast” engines use Thompson’s algorithm. Details about the algorithms used in the three classes are in the paper. The implication: Moving regexes from a fast language to a medium one, or from a medium to a slow one, may be dangerous from a security perspective. ---- We expected to see two classes of performance: “Slow” regex engines that use a backtracking-based algorithm, and “Fast” regex engines that use Thompson’s BFS algorithm. We see these two classes, as well as a “Medium” class for PHP and Perl. We studied the implementations of PHP and Perl’s regex engines, and identified the defense mechanisms that enable them to handle more regexes quickly. Details are in the paper. Spencer - slow Spencer – Medium (defenses)

Summarized in developer-friendly heatmaps
We think a good way to visualize our experimental findings is through heatmaps illustrating the pairwise semantic and performance differences between languages, The darker the cell, the riskier it is to move regexes from that source to that destination. We hope these will benefit practitioners thinking about copy/pasting regexes.

What did we learn from this work?

Regexes are not a lingua franca (somewhat to the surprise of developers)
Survey: developers say they believe and act like regexes are an LF Empirical re-use study: Corroboration – regexes appear to be re-used Portability experiments: Correctness, security concerns * survey * Empirical corroborates But *Our portability experiments demonstrate the correctness and security concerns that may arise from this practice. Takehome message On the whole, we believe our study shows that: Regexes are a misunderstood programming language feature

Limitations Methodology ASE’19 ReDoS ESEC/FSE’18 Re-use ESEC/FSE’19

Highlighting two limitations of this research
ReDoS: Regex + reachability [W et al. 2017] Generalizing modules to applications Of course, our work has some limitations. I want to highlight two of them. Reachability Did not consider reachability of ReDoS regexes, in part for reasons of scale. This is complicated by the fact that we were analyzing modules rather than applications, Even if a regex were reachable from a module API that does not tell us how an application will use that module. Our experience and measurements suggest that regexes are commonly used for input validation, but this would be worth revisiting ReDoS regex == ReDoS? This is not just about reachability; it concerns deployment. ReDoS generally requires an application containing a ReDoS regex to be deployed on a server that handles untrusted input. It would be interesting to study how modules are used and whether we can predict their usage context based on README, the APIs they use, etc. Despite these limitations, we believe that having ReDoS regexes in your module or application is a liability, whether or not they are currently reachable by user input, or deployed on a server today. This is because in our regex corpus we found many examples of regexes that appear to be copy/pasted from one module to another, or derived from StackOverflow. In one case a ReDoS regex was duplicated over 2000 times. Thus, a ReDoS regex used in a safe context can easily pollute code used in a sensitive context. Why have we accepted these limitations so far?

What have we learned?

Regexes could use some Application-level scenarios Regex engines
Developers misunderstand and misuse regexes Tools based on real regexes and real developer pain points Regex engines Under-optimized (real regexes are unnecessarily super-linear) Under-tested (we found bugs in several engines!) * survey * Empirical corroborates But *Our portability experiments demonstrate the correctness and security concerns that may arise from this practice. Takehome message On the whole, we believe our study shows that: Regexes are a misunderstood programming language feature

Part 3: Advice to my younger self
OK, that was the last research work I wanted to talk about. Now, I have the impression that most of the students here are early on in their graduate careers. I’d like to spend my last few minutes sharing some of my “accumulated wisdom” to help you along your path.

Share Research artifacts Blog Open-source Podcasts
A researcher has two fundamental responsibilities: Discover new things Tell other people about what you found In many areas of computer science, which are fairly applied, there are tons of practitioners who about your research -- if they knew about it. If you do research, publish it in some conference proceedings, and move on, you’re doing yourself and others a disservice. Plan for Reproducibility – Artifacts Blog about it I post a practitioner-friendly summary of each of my papers on Medium. This is a great way to teach the practitioner community about issues they might encounter or tools they might benefit from. About 3000 people have looked at my posts, which means about 3000 people have learned more about the regex issues I’ve identified in my research. From a personal perspective, writing about your research for a lay audience really forces you to distill your findings and articulate your thoughts precisely. This is incredibly valuable for improving your writing. Contribute to open-source – I wrote this guide on nodejs.org articulating my research findings in a practitioner-friendly way. While we’re getting creative, you might even volunteer to talk about your research on a podcast. Of course, to succeed in graduate school you will have to publish research. But you can have a much greater impact if you dream beyond publishing papers. Open-source Podcasts

Collaborate At least in my experience, research is not a solo endeavor. If I had worked alone on these projects, this talk would have been a lot shorter. I’d like to thank the collaborators on the projects I discussed today. They’re an amazing group of undergraduates, graduate students, faculty, and practicing engineers. Collaboration takes time – you might have to train someone new, and you may spend a lot of time in meetings. But when you and your collaborators combine your strengths and interests, the research you produce will be better.

Research is exciting Thinking thoughts no one has thought before.
Performing experiments no one has attempted before. Articulating ideas carefully and completely With that, thank you for your attention. … I’d love to answer any questions you have.

Bonus slides - Thinking thoughts no one has thought before.
Performing experiments no one has attempted before. Articulating ideas carefully and completely With that, thank you for your attention. … I’d love to answer any questions you have.

Part 3: Where next?

Engine-level ReDoS soln. Rethinking regex engines
Designed without data Use data for guardrails e.g. “Nobody uses X” ESEC/FSE’19 SRC ICSOC’19 Query Languages Other engineering tools

DynoRegexes In our second theme we identified these heuristics and examined their effectiveness.

Why modules? Modules are critical infrastructure
Modules are comparable building blocks Strings Math Command-line scripting aids Graphics

Why not applications? Open-source applications may not be representative of code in industry Module ecosystems are shared by open-source and industry Modules are sometimes authored by industry as a way to give back to the open-source community

EcosystemREDOS In our second theme we identified these heuristics and examined their effectiveness.

Theme 2 Do Developers’ Heuristics Work?
In our second theme we identified these heuristics and examined their effectiveness.

Heuristics Used in Practice
Our Heuristics Finding Heuristics Exponential Reference books Regex websites What they said Star height “Watch out when different parts of the regex can match the same text” ( ) Star height (a|a) Q.O.D. Finding Heuristics We reviewed ~10 reputable reference books purchased from Amazon And the advice given on several prominent websites dedicated to regexes What they said Avoid nested stars – this is the root cause of the exponential example we discussed earlier. This heuristic is widely used; an npm module called “safe-regex” uses it and gets about 20M DL/mo. Vague language: Operationalize into two heuristics. a Q.O.A. Polynomial

Few false negatives Many false positives
Developed detectors for these heuristics Imprecise – modeled on safe-regex implementation Axes X-axis: Each heuristic Y-axis: Percentage Few false negatives In total we capture about 81-85% of the ReDoS regexes in our dataset. High false positives Speaks to the power of the ReDoS regex detectors we used in the first theme. We therefore recommend that developers adopt those tools, and to this end have published our drivers on GitHub.

Bringing Research to Practice
I am now the maintainer of safe-regex Fixed false negatives in star height heuristic Incorporating (improved) QOD and QOA heuristics Maintainer 20 M DL/month Last week I published the first release in 4 years, fixing some false negatives in the star height heuristic. Transition So, In the first theme we studied the incidence of ReDoS regexes in the wild, and in this theme we studied how developers currently detect ReDoS regexes. We also wanted to know how developers repair ReDoS regexes when they learn about them.

Lingua Franca

Developer perspectives Developers’ Regex Re-use Practices
In the first theme, we studied developer perspectives on regexes. To do this, we designed a survey on regex practices and shared it with professional software developers.

Survey methodology and respondents
33 questions, closed / open-ended. IRB. Regex process, re-use practices, and portability Median developer *Design *Delivery Responses: About half from direct and transitive acquaintances. About half from Reddit/HN. *This figure summarizes the demographics of our respondents on working years, size of company, and regex expertise. *The median respondent: has 3-5 years of professional experience, works at a medium-size company, and claims intermediate regex skill. They were fairly diverse set on these metrics Ranged from first-year developers to career professionals Worked at small-to-large companies Self-report: some novices, many intermediate/expert

Respondents re-use regexes across languages
94% of developers re-use regexes 94% of respondents reported re-using regexes at least 25% of the time. We asked them how often they knew the regex was being re-used in the same language. They were rarely confident that this was the case. When re-using, developers are rarely confident that the regex comes from the same programming language

Regexes in the Wild Empirical Studies on Security and Correctness

Similar presentations

Presentation on theme: "Regexes in the Wild Empirical Studies on Security and Correctness"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regexes in the Wild Empirical Studies on Security and Correctness

Similar presentations

Presentation on theme: "Regexes in the Wild Empirical Studies on Security and Correctness"— Presentation transcript:

Similar presentations

About project

Feedback