Presentation on theme: "Deep Data: Mapping the Legal Genome"— Presentation transcript:
1Deep Data: Mapping the Legal Genome JUDICATAJudicata cofounder and CTO. Make tools for litigators---legal research software. More accurately, though, we're mapping the legal genome. Understanding what's going on in all of case law at a sentence-by-sentence level. A little about me: Stanford CS, big legal nerd. Met cofounder (Itai, google, lawyer) at right time. No law school. 18 months later, of us, no product.Deep Data: Mapping the Legal Genome
2COMMON LAW Chimel v. California, 395 U.S. 752 (1969) New York v. Belton, 453 U.S (1981)Arizona v. Gant, 556 U.S. 332 (2009)Before I get into the technical details, I want to tell you why we exist.The law changes frequently. Lawyers have a problem in that they need to follow how rules are interpreted, because that could mean the rules are changing. So before I get into things, I want to go over three cases that have to do with search incident to arrest and track how that rule has changed.
3Starting with CA employment law Starting with CA employment law. Rule that makes it unlawful to discriminate.3 interesting things going on here:1. Rule-based. We have to identify all the rules.2. All the varying phrasings of that rule in the 200 cases.3. The fact filtersDo this in a keyword-based system, you get 2500 results: not what associates want to spend their weekends doing
4Three metallurgic morals DEEP DATAAccuracy is paramount.Why are we so obsessed with accuracy? Has to do with nature of problem. Want to find that one new case that applies this rule. That one case that has this combination of facts.We talk about three metals a lot in the course of generating our data:Three metallurgic morals
5Sit down and make gold data. Gold data = labeled “training” data. But it’s useful for more than training machines—you might learn about patterns you might not have seen before.An example is our parsing code:Sit down and make gold data.
6We need to extract information from this. (Gant) We need to know what every piece of language is doing here.
8It may take a lot of lead bullets… It might be a lot of work to find the right solution to these subproblems, and they're not necessarily trivial subproblems. There are a lot of problems that require just a lot of lead bullet solutions to the subproblems.You might find out that there are four different types of subproblems and the individual sub-problems are easier to do.So how do we go about implementing these lead bullet solutions?It may take a lot of lead bullets…
9APPROACH Start with rule-based approaches Reviewable, not black boxes …but don’t completely ignore AI, of courseTraining combinations of parameters, Stanford NLP.
10Parsing references: way beyond regex Parsing references: way beyond regex. Essential because we build up on layers and layers of data.
11…but it can pay dividends and lead to a silver bullet. These layers can pay dividends, though…but it can pay dividends and lead to a silver bullet.
12“With respect to the pervasiveness of harassment, courts have held an employee generally cannot recover for harassment that is occasional, isolated, sporadic, or trivial; rather, the employee must show a concerted pattern of harassment of a repeated, routine, or a generalized nature.”Support clustering from really good Reference parsing, strong Support code Bootstrapping our unsupervised learning of what the rules are in a corpus of case law.This is really cool because this is totally unsupervised, yet we can learn what all the rules are in a jurisdiction without any extra work, basically. It’s another matter of finding all the cases applying these rules, but this is a very difficult problem that was solved easily using prior layers of high-accuracy data.
13STRUCTURE As much as possible algorithmically Acknowledge that code can’t do it all2/3 Engineers, 1/3 LegalOnly ~150,000 cases make up California case lawWe have to extract every last dropWe’re a weird startup