Presentation on theme: "SQL for Crime ANALYSTS BACIAA Session Thursday, 22 March 2012 James G. Beldock."— Presentation transcript:
SQL for Crime ANALYSTS BACIAA Session Thursday, 22 March 2012 James G. Beldock
Todays Agenda Introductions Preliminaries Databases, Structured Data, and Tables Demo 1: Exploring Tables How Databases Are Structured (& Why) Demo 2: Lots of Tables Break A Sample CAD Database SQL SELECT, part 1 Using database data in Excel Lunch Joins SQL Select, part 2 Joining Saving Joins to a View Break Views Other SQL Commands
PRELIMINARIES 1.Databases, Database Varieties, and SQL 2.How Databases Are Structured (& Why)
Databases Store data permanently Sometimes called persistent storage Data can be Structured data A Person has: First Name; Last Name; Social Security Number; Photo.JPG Unstructured data examples: Moby Dick; an entire website; email messages (sometimes) Sizes Databases can be small (100K, 1MB, etc.) or Quite Large (UK Land Registry is 23TB; thats ~1.1 Libraries of Congress) 1 RIDICULOUSLY LARGE (Googles index of the web; Facebooks profiles database) thats 1.84 x 100,000,000,000,000 bits! 2 1 DB2 - the secret database (http://www.theregister.co.uk/2006/01/18/db2_neglected/)http://www.theregister.co.uk/2006/01/18/db2_neglected/ 2 Wolfram Alpha is great for this sort of thing: http://www.wolframalpha.com/input/?i=23+terabyteshttp://www.wolframalpha.com/input/?i=23+terabytes
[silicon valley moment] Recently, SQL-running databases have fallen somewhat out of fashion SQL was never cool Now its officially uncool for some purposes, like building NetFlix Highly scalable (thousands of servers?) Very flexible data structures Todays session is all about SQL, and SQL is (usually) used with relational databases, which are, if you ask the cool people, not as cool as they used to be. SQL is still the worlds most prolific database language, and certainly stores more structured data than any other environment ever built.
Structured Data SQL deals with structured data 3 Structured Data Keeps track of one or more types of things, called Entities (or TABLEs in SQL) Knows certain, specific, structured pieces of information about those entities, called Attributes (or COLUMNs in SQL) 3 Well, nearly always. But not always always: Storing Unstructured Data in SQL Server 2008 – Microsoft,Storing Unstructured Data in SQL Server 2008 – Microsoft 4 source: SqlCourse2.com, http://www.sqlcourse2.com/index.htmlhttp://www.sqlcourse2.com/index.html Sample Structured Data: a TABLE of Customers 4 Note: SQL keywords will be in blue. They are traditionally written in ALL CAPS. and names of Tables or Columns will appear in Brown or Orange, respectively. They are traditionally Capitalized (but not ALL CAPS).
Database TABLES Name of TABLE ROWs COLUMNs Names of COLUMNS Question: Whats the name of a ROW?
a ROW 5 COLUMNs (also called Fields): customeridsome type of numberprobably a Unique Identifier firstnametext (called a String)probably not unique lastnamestringprobably not unique citystring probably not unique statestringprobably not unique Unique IDs are called KEYs The KEY used to name a ROW is called the PRIMARY KEY
Before We Go Further: SQL That is why youre here, right? Structured Query Language (SQL) is: A language for asking a database for information (querying) A language for changing information in a database Changing the structure of a database Adjusting security, performance, and deployment of databases Destroying everything in the database…but dont worry :-) Database Manipulation Language, DML Create Read Update Delete Often called: DANGEROUS (seriously, called admin functionality, or Database Definition Language, DDL)
SQLs SELECT Statement The single most important SQL statement. Period. Selects data out of a database, or performs a calculation on a column, value, table, etc. Really simple examples: SELECT 'hello' hello SELECT 1 + 3 4
SELECT Statement, continued More commonly, the basic SELECT statement returns ROWs from a TABLE: SELECT firstname FROM customers John Leroy Elroy Lisa SELECT firstname, city FROM customers JohnLynden LeroyPinetop ElroySnoqualmie LisaOshkosh SELECT * FROM customers 10101JohnGray LyndenWashington 10298LeroyBrown PinetopArizona 10299ElroyKeller SnoqualmieWashington 10315LisaJones Oshkosh Washington A special COLUMN name: * Means all COLUMNs
SELECT Statement: the Important optioNS 5 (for one table) SELECT list of columns, functions on columns, or * FROM name of table WHERE list of conditions to include (called predicates) ORDER BY list of columns and direction of sort (ascending/descending) GROUP BY list of columns 5 The full definition of the SQL SELECT statement syntax is much longer and, to some extent, specific to the database software. See the definition of Microsoft SQL Server 2008 R2s SELECT statement at http://msdn.microsoft.com/en- us/library/ms189499.aspxhttp://msdn.microsoft.com/en- us/library/ms189499.aspx
SELECT … ORDER BY Use Order By to sort by one or more columns, in ascending or descending order Effect of ORDER BY clause
SELECT … WHERE Use WHERE to filter based on one criterion: or more than one: Why the [square brackets]? The word state is a reserved SQL keyword. When it is used as a column name, it must be [bracketed] to avoid confusion.
FUNCTIONS You can add functions to a SQL SELECT statement to perform various analyses. The most common 6 are Aggregate functions count(), which returns the number of somethings, and sum(), which adds up the somethings Also: min(), max(), avg(), stdev(), var() Math, Date and String (text) Manipulation functions Math: abs(), ceiling(), power(), sqrt(), others String: len(), substring(), replace(), upper(), lower(), left(), right(), others Date: dateadd(), datediff(), datepart(), getdate(), day(), month(), year(), others 6 The full list is quite long. For SQL Server, see http://msdn.microsoft.com/en- us/library/aa258899(v=sql.80).aspx.http://msdn.microsoft.com/en- us/library/aa258899(v=sql.80).aspx
[DEMO] Using FunctionS, WHERE, and ORDER BY Summary: count(*) gives you the count of rows resulting from your query You can SELECT any combination of columns Unless you GROUP BY, in which case you are limited to the GROUP ed BY columns and aggregate functions applied to other columns Gotchas sum(*) doesnt make sense, but sum(columnname) doesfor columns of numbers GROUP BY is finicky: the list of columns you select is limited Some things arent easy: for example, finding the percent of total
Terminology: DBMS Database is a generic term; it can refer to: A specific set of data running on a Database Server A Database Server itself (not really the right term) A large body of information kept by a human being (my recipe database) Databases generally run on a Database Server A computer running Database Management System (DBMS) Accepts connections (queries) from many client computers Returns a response (result set) to each client in response to each query Can be distributed onto lots of servers (Facebook: 1,800+ MySQL servers) DBMS handle multiple databases Each Database is stored in one or more database files Database Files can sometimes be loaded/viewed/edited by other software
Names You Might Encounter (in the Database World) SQL Server, from Microsoft (also Microsoft SQL Server) Oracle DB2, from IBM Less common: Microsoft Access, dBase, Sybase
Database Structures Most databases have many TABLEs 10 would be few; 50 would be normal; 150 would be many There is a method to this madness Different TABLEs contain different categories of information Example: Customers: contains lots of customers Products: contains lots of products Orders: combines customers and products (and quantities, etc.)
Why So Many Tables? Imagine a world with just 1 table The problem of duplicate data: OrderIDCustomerNameCustomerAddressQuantityProductName 1000James123 Main Street, Arcadia, CA, 950003Orange 1001James123 Main Street, Arcadia, CA, 950004Apple 1002George444 1st Avenue, Sacramento, CA 970001Fork 1003James123 Main Street, Arcadia, CA 950006Pear Adding a new order is easy: But what happens when James changes his address? Answer: need to update every ROW where 'James' is the CustomerName (ugh!)
Solution: Divide and Conquer Divide data into Entities (TABLEs), specific to a given purpose: CustomerIDCustomerNameCustomerAddress 1James123 Main Street, Arcadia, CA, 95000 2George444 1st Avenue, Sacramento, CA 97000 OrderIDCustomerIDQuantityProductName 100013Orange 100114Apple 100221Fork