Presentation is loading. Please wait.

Presentation is loading. Please wait.

PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14.

Similar presentations


Presentation on theme: "PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14."— Presentation transcript:

1 PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14

2 About the presenter … James Hanson Reformed Oracle RDBMS developer / administrator I am interested in network data and I see PostgreSQL as powerful, underutilized tool for storing and analyzing network data 21-May-142 PGCon14 Working with Network Data Types

3 About the presenter … James HansonParsons, Inc. 21-May-143 PGCon14 Working with Network Data Types

4 What are we doing in this workshop? Using PostgreSQL to explore the question “What can I find out about the requester_IP addresses in my Apache log?” …plus some side trips … – Parsing & loading an Apache access log and associating the requesting IPs with GeoLocated, known malicious and other IP lists. NOTE: All workshop data is free and publicly available. 21-May-144 PGCon14 Working with Network Data Types

5 What will we get from the workshop? Hands-on experience using PG to load, relate and analyze moderate-sized network data sets Simple & useful techniques for loading network data Simple and useful techniques for exploring network data Experience exploring some public data sets Get jazzed to use the techniques and code samples on your network data Meet other folks with complementary interests and expertise 21-May-145 PGCon14 Working with Network Data Types

6 What’s not in this workshop? GUI displays or application integration. We focus on the data PG optimization, tuning or extensions other than PostGIS RDBMS normalizing, surrogate keys or orthodoxy* Automated workflow Handling of exceptions, errors and boundary conditions * We’ll revisit this later 21-May-146 PGCon14 Working with Network Data Types

7 Workshop setup and logistics Using an EnterpriseDB PostgreSQL PostGIS with no additional extensions Hosted on a Centos 6.5 x64 virtual machine with 3GB RAM and 2* CPU cores Using three workshop downloads … – The data and DDL queries are zipped and available on GoogleDocs – The PowerPoint is on Slide Share – The SQL and Python files are on GitHub – Everything is linked off PGCon14 site 21-May-147 PGCon14 Working with Network Data Types

8 Workshop setup errata PGAdmin Query Tool Show NULL values as Centos yum packages: policycoreutils-guisetroubleshootlzopsamba dos2unixsystem-config-lvm.noarch PostgreSQL non-default parameters Workshop value Enterprise DB default max_connections effective_cache_size 262,144 16,384 maintenance_work_mem 524,288 16,384 max_stack_depth 2, shared_buffers 98,304 1,024 temp_buffers 131,072 1,024 work_mem1,048,576 1,024 fsync off on synchronous_commit off on encoding UTF-8 21-May-148 PGCon14 Working with Network Data Types

9 PostgreSQL network data types NameStorage SizeDescription cidr 7 or 19 bytesIPv4 and IPv6 networks inet 7 or 19 bytes IPv4 and IPv6 hosts and networks macaddr 6 bytesMAC addresses 21-May-14 PGCon14 Working with Network Data Types 9 An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet );

10 PostgreSQL network operators 21-May-14 PGCon14 Working with Network Data Types 10 OperatorDescriptionExample <=is less than or equal inet ' ' <= inet ' ' =equals inet ' ' = inet ' ' <>is not equal inet ' ' <> inet ' ' <>contains inet ' /24' >> inet ' '

11 PostgreSQL ranges and arrays Ranges – 6 built in types. Easy to create custom types – Can by inclusive, [, or exclusive, ( We’ll use inclusive lower and upper bounds, […] – Use the “contains and “element is contained by”, operators Arrays … just the tip or the iceberg – Use for parsing raw data and alternative storage models – Search arrays with the ANY operator 21-May-14 PGCon14 Working with Network Data Types 11

12 Q: Why are we doing this with PostgreSQL? I was told that “the cloud” was the only place to work with network data! 21-May-14 PGCon14 Working with Network Data Types 12

13 A1: Because RDBMSs have some advantages Cloud technologies have a difficult time with semantically similar but syntactically different data such as: – Subnets: /25 = – Date / time: 21-May-14 = 05/21/14 = RDBMS are (still) much faster with GUI responses on small / moderate data sets 21-May-14 PGCon14 Working with Network Data Types 13

14 A2: Because PostgreSQL has some useful and unique capabilities PostgreSQL network data types and operators enable: – Much faster Layer3-type operations … because you can think like a router – Data validation (NOTE: custom solutions are nearly impossible for IPv6) PG arrays can simplify SQL syntax and ERDs. PG regexp_split_to_array simplifies ETL … but COPY still lacks important functionality 21-May-14 PGCon14 Working with Network Data Types 14

15 Our workshop approach Our approach: What useful and tough problems can I solve with PostrgreSQL’s unique capabilities and low cost? ! Our approach: Given that you are using a relational database management system, these are the reasons why PostgreSQL is the best brand to choose. 21-May-1415 PGCon14 Working with Network Data Types

16 Workshop logistics Preamble / instruction Brief overview of our datasets Transform and load the data together -- Intermezzo -- Query the data together Questions, follow-up & next steps. NOTE: Code snippets are at the end of the PPT. All files are linked off of PGCon14 21-May-14 PGCon14 Working with Network Data Types 16

17 Our buffet of data … All free and publicly available … OurAirportsAirport GeoLoc CSC InfoChimpsUS IPGeo Dragon Research GroupMalicious IPs National Vulnerabilities DatabaseIAVAs GeoLiteIPv6 Geo 21-May-14 PGCon14 Working with Network Data Types 17

18 … Our buffet of data MyIPMalicious IPs CowgarApache log The Internet Traffic Archivetracert ftp://ita.ee.lbl.gov/ Cooperative Association for Internet Data Analysis (CAIDA)DNS NIL ARP table 21-May-14 PGCon14 Working with Network Data Types 18

19 Dive into Airport data …../020_Airports/ 4 csv files: – airports, runways, countries and regions AirportDDL.sql 21-May-14 PGCon14 Working with Network Data Types 19

20 INTERMEZZO FOR 10 MINUTES 21-May-14 PGCon14 Working with Network Data Types 20

21 Let’s query our data …../Queries/ Where else have the Apache_log requesting_ip addresses appeared? USIPGeo_Queries.sql 21-May-14 PGCon14 Working with Network Data Types 21

22 REVISIT NORMALIZATION, QUESTIONS AND WRAP UP 21-May-14 PGCon14 Working with Network Data Types 22

23 *Normalizing and surrogate keys NOTE: This is a revisited topic from the workshop pre-amble. Normalization is very useful for handling ACID transactions and transaction-entered data (vs. bulk- loaded). Surrogate keys are loved by Hibernate, may help in some situations and are part or many ORM methodologies. But that does not generally apply to immutable network data intended for analysis … like we have. 21-May-14 PGCon14 Working with Network Data Types 23

24 *Normalizing and surrogate keys(2) And normalization has a steep cost in … – Increased complexity to load and query data – Hard coded assumptions in a complex ERD – Reduced performance ** – Intellectual distance from the data The reasons to normalize and use surrogate keys don’t apply here but the costs do, so I don’t do it. (feel free to explore this topic off-line with me) ** This is not true in all cases. I refer to index-based joins and searches 21-May-14 PGCon14 Working with Network Data Types 24

25 Q: Why did you use arrays and custom data types? Doesn’t that violate RDBMS principles plus lock you in to custom PostgreSQL syntax? A: Yes. Here’s why I suggest you consider that: Dr’s. Boyce and Codd did not include arrays in their RDBMS theoretical work and arrays are not in the ANSI-SQL standard … but that work is focused on ACID transactions, which do not apply here. PostgreSQL network types are unique … so you are already committed to custom code. Why not take full advantage of PGs capabilities. 21-May-14 PGCon14 Working with Network Data Types 25

26 Wrap up and follow-on … IP and IP/subnet data types enable you to think of network data as a Layer-3 device does – This will always be faster and simpler than treating IPs as strings or numbers. Ex. “Which of these IPs is contained in that large set if IP subnets and ranges?” – 1-line WHERE clause in PostgreSQL – … still working on it in other environments 21-May-14 PGCon14 Working with Network Data Types 26

27 Wrap up and follow-on … An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); If you are using PG network data types with read-only data for analysis … … consider using custom data types and arrays – You have already committed to custom-code so you may as well reap the (substantial) benefits. 21-May-14 PGCon14 Working with Network Data Types 27

28 XML and XPath notes The idealized ETL process for XML (and JSON) is ETL … don’t Extract or Transform, just Load PostgreSQL has good, but not great, XML support Whatever you do, don’t write custom code to shred XML to RDBMS then build XML again for delivering to applications. Please. Very good XPath resources are: data-from-an-xml-column-in-postgres for details 21-May-14 PGCon14 Working with Network Data Types 28

29 PostgreSQL: Working with Network Data James Hanson Please provide feedback. This presentation is new and I’d like to make it better. (… and please let me know if you see / have any cool data sets too!) 21-May-14 PGCon14 Working with Network Data Types 29

30 QUESTIONS AND COMMENTS … James Hanson / 21-May-14 PGCon14 Working with Network Data Types 30

31 Code snippets and techniques from AirportsDDL.sql -- Countries CREATE TABLE countries ( id BIGINT PRIMARY KEY, code CHAR(10) NOT NULL UNIQUE, name VARCHAR(100), continent CHAR(4), wikipedia_link VARCHAR(500), keywords VARCHAR(300)); COMMENT ON TABLE countries IS 'Data from COPY countries FROM '/Pgcon14/Scripts/020_Airports/countries.csv' WITH DELIMITER ',' CSV HEADER; 21-May-14 PGCon14 Working with Network Data Types 31

32 Code snippets and techniques from AirportsDDL.sql -- Update the GEOMETRY column from the long / lat UPDATE airports SET location = ST_GeomFromText('POINT('||longitude_deg||' '|| latitude_deg||')',4326); 21-May-14 PGCon14 Working with Network Data Types 32

33 Code snippets and techniques from USIPGeoDDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE us_ipgeo_temp ( id SERIAL, col1 TEXT); -- Load temporary table COPY us_ipgeo_temp( col1) FROM '/Pgcon14/Scripts/030_USIPGeo/infochimps_dataset_13358_download_16680/ip_blocks_us _geo.tsv‘ WITH DELIMITER '|' -- use a character that is not in the dataset CSV; 21-May-14 PGCon14 Working with Network Data Types 33

34 Code snippets and techniques from USIPGeoDDL.sql -- Use a CTE (Common Table Expression) syntax and -- regexp_split_to_array to look at the data in our temporary table. WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT 10 ) SELECT col1[1] AS col1_1, col1[2] AS col1_2, col1[3] AS col1_3, col1[4] AS col1_4, col1[5] AS col1_5, col1[6] AS col1_6, col1[7] AS col1_7, col1[8] AS col1_8, col1[9] AS col1_9, col1[10] AS col1_10, col1[11] AS col1_11, col1[12] AS col1_12, col1[13] AS col1_13, col1[14] AS col1_14 FROM us_ipgeo_array; 21-May-14 PGCon14 Working with Network Data Types 34

35 Code snippets and techniques from USIPGeoDDL.sql -- Manipulate and CAST the fields as needed WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp LIMIT check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ORDER BY metro_code NULLS LAST; 21-May-14 PGCon14 Working with Network Data Types 35

36 Code snippets and techniques from USIPGeoDDL.sql -- Create and populate us_ipgeo from us_ipgeo_temp -- Try it once with a LIMIT statement SELECT * INTO us_ipgeo FROM ( WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ) AS tmp_table; 21-May-14 PGCon14 Working with Network Data Types 36

37 Code snippets and techniques from ApacheAccessLogs_DDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE apache_log_temp ( id SERIAL, col1 TEXT); -- Set the client_encoding for the same reason as before SET client_encoding TO 'latin1'; -- Load temporary table COPY apache_log_temp( col1) FROM '/Pgcon14/Scripts/080_ApacheLogs/access.log' WITH DELIMITER '^' -- use a character that is not in the dataset -- CSV ; 21-May-14 PGCon14 Working with Network Data Types 37

38 Code snippets and techniques from ApacheAccessLogs_DDL.sql -- Try it once with a LIMIT statement SELECT * INTO apache_log FROM ( WITH apache_log_array AS ( SELECT regexp_split_to_array(col1, ' ') AS col1 FROM apache_log_temp -- LIMIT 1000 ) SELECT col1[1]::inet AS requesting_ip, to_timestamp(RIGHT(col1[4], -1) || LEFT(col1[5], -6), 'DD/Mon/YYYY:HH24:MI:SS') AT TIME ZONE INTERVAL '-04:00' AS time, RIGHT(col1[6], -1)::VARCHAR(10) AS action, col1[7]::VARCHAR(100) AS resource, LEFT(col1[8], -1)::VARCHAR(10) AS protocol, NULLIF(col1[9], '-')::int AS status_to_client, NULLIF(col1[10], '-')::int AS object_size_bytes, REPLACE(NULLIF(col1[11], '"-"'), '"', '')::VARCHAR(100) AS referrer, REPLACE(COALESCE(col1[12], '') || COALESCE(col1[13], '') || COALESCE(col1[14], '') || COALESCE(col1[15], '') || COALESCE(col1[16], '') || COALESCE(col1[17], '') || COALESCE(col1[18], '') || COALESCE(col1[19], '') || COALESCE(col1[20], ''), '"', '')::VARCHAR(200) AS user_agent FROM apache_log_array ) AS tmp_table; 21-May-14 PGCon14 Working with Network Data Types 38

39 Code snippets and techniques from tracert_DDL.sql -- Lets play a bit with an array structure for the tracert table -- DROP TYPE tracert_trip; CREATE TYPE tracert_trip AS ( hop INTEGER, from_ip INET, to_ip INET ); SELECT source_file, dest_name, dest_ip, array_agg((hop, from_ip, to_ip)::tracert_trip ORDER BY hop) tracert_trip FROM tracert GROUP BY source_file, dest_name, dest_ip LIMIT 10000; 21-May-14 PGCon14 Working with Network Data Types 39

40 Code snippets and techniques from 03_MyIP_500TopHosts_Array_DDL.sql -- Want to use a range for the IPs -- Create custom type CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); -- Check out our data -- SELECT inclusive, [], range for our inetrange SELECT website, ww_site_rank, hosting_country, hosting_company, owner_country, parent_hosting_company, parent_country, inetrange(owner_ip_from, owner_ip_to, '[]') AS owner_range, -- [] means upper and lower inclusive parent_ip_range FROM tophosts ORDER BY owner_range; 21-May-14 PGCon14 Working with Network Data Types 40


Download ppt "PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14."

Similar presentations


Ads by Google