Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"

Similar presentations


Presentation on theme: "© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N""— Presentation transcript:

1 © 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;.NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" Module 4b: Perl for Web Log Analysis

2 © 2006 KDnuggets Perl - introduction  A full-featured, fast, and easy to use scripting language  Very powerful pattern-matching facilities  More powerful than gawk; very popular for web programming and CGI files  Many Perl tutorials, e.g. learn.perl.org/ www.perl.com/pub/a/2000/10/begperl1.html www.perlmonks.org/index.pl?node=Tutorials

3 © 2006 KDnuggets Perl – historical note  PERL stands for Practical Extraction and Reporting Language  Developed by Larry Wall  Perl 1.0 was released to usenet's alt.comp.sources in 1987  Perl is the most popular web programming language – due to powerful text manipulation and quick development.  Perl is widely known as "the duct-tape of the Internet".

4 © 2006 KDnuggets Perl - running  First Perl script (on Unix) file1.pl #!/usr/local/bin/perl -w print "Hi there!\n"; Note: On Windows, first line usually is #!c:/Perl/bin/perl.exe -w % file1.pl Result: Hi there!

5 © 2006 KDnuggets Perl for Windows  Active Perl – ready-to-install Perl distribution  Runs on Windows, Linux, MAC OS, and other OS  Free download www.activestate.com/Products/ActivePerl/

6 © 2006 KDnuggets Perl basics  Two data types: numbers and strings  Perl uses many special characters $, @, %, as part of its syntax  Perl variables:  Scalars (simple variables, things) start with $, e.g. $count  Arrays (lists) start with @, e.g. @array1  Hashes (associative arrays) start with %  Usual control structures  Full introduction to Perl is beyond the scope of this module

7 © 2006 KDnuggets What does this code do? @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f =!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/ && close$_}%p;wait until$?;map{/^r/&& }%p;$_=$d[$q];sleep rand(2)if/\S/;print Answer: We do NOT want to know !

8 © 2006 KDnuggets The Tao of Coding  Human time is MUCH more precious than computer time  It is much better (and faster) to develop programs using methods that AVOID mistakes than try to find bugs in badly written programs

9 © 2006 KDnuggets Perl style: understandability first  Perl allows you to do tricky programs to save a few lines of text  AVOID this approach  Use careful, step by step development  Test after every step  A good program should be easy to understand  Only after you have an understandable program, and only if you need it, you can improve efficiency

10 © 2006 KDnuggets Perl coding  Variables can be declared implicitly by their first use, e.g. $oldvar=$nevar+27  if $nevar was not declared before, it will be initialized to zero  Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?)  Much better to declare variables explicitly e.g. my $newvar = 0;  Enforced by command use strict

11 © 2006 KDnuggets Sample log file  We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file.  We will give useful code examples You are encouraged to try the code examples in this lecture on this file  You should get the same answers!

12 © 2006 KDnuggets Perl for parsing a web log file Program 0: logparse0.pl - read and print log file #!c:/Perl/bin/perl.exe -w use strict; while (<>) { my $line = $_; # current line print $line; }

13 © 2006 KDnuggets Perl regular expressions, 1  Usage: $var =~ / regex / where regex is a regular expression. E.g. $line =~ /google/ will match all lines containing "google" Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )

14 © 2006 KDnuggets Perl log parsing, 1 #!c:/Perl/bin/perl.exe -w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~/google/) {$cnt++;} } print " $cnt lines matched google"; Check how many lines refer to google Applying this code to d100.log,you get: 2 lines matched google

15 © 2006 KDnuggets Perl regular expressions, 2 Special characters:. : matches one character a* : matches zero or more repeats of "a" a+ : matches 1 or more repeats of "a" \S : matches any non-white space character ^ : anchor – matches beginning of string $ : anchor – matches end of string

16 © 2006 KDnuggets Log parse 2: IP address  IP address is the first item on the log line.  In almost all log files it is followed by " - - ", representing missing "ident_user" and "auth_user" fields  Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;

17 © 2006 KDnuggets Perl regex: parentheses capture match variables  Perl regex items enclosed in parentheses () correspond to special match variables.  Variable $1 contains value matched by regular expression in the first parentheses, etc

18 © 2006 KDnuggets Perl regex: match variables #!c:/Perl/bin/perl.exe –w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~ /^(\S+) - - /) { my $ip = $1; print "ip $ip\n"; $cnt++; } else { print "bad line $line\n"; } print " processed $cnt log lines\n"; this program shows how to assign IP to variable $ip; also shows error processing if match is not successful Note: First line with Perl is probably different on your machine

19 © 2006 KDnuggets Perl regular expression 4: brackets  Brackets [ ] allow you match any character inside  Example:  [cmt]an will match can, man or tan,  will not match ban or dan.

20 © 2006 KDnuggets Perl regular expression 4b: brackets [^ ] [^x] will match any character except x  (note: here ^ is not the beginning of text anchor) Example: [^:]* will match any string that does not include a colon :. Example: if $date is 16/Nov/2005:031415, after $date =~ ([^:]*):.* [^:]* will match 16/Nov/2005 Because it was enclosed in (), match result stored in $1

21 © 2006 KDnuggets Parsing log: Date, Time  Date, Time is specified in the log as [DD/Mon/YYYY:HH:MM:SS timezone] Matching regular expression \[([^:]+):(..):(..):(..) -0500\]

22 © 2006 KDnuggets Parsing log: Date, Time Matching regular expression in detail \[([^:]+):(..):(..):(..) -0500\] \[ matches brackets \] [^:] matches any string that does not contain : ([^:]+) will match DD/Mon/YYYY ; value in $1 first (..) will match HH (hours); value in $2 second (..) will match MM ; in $3 third (..) matches SS; in $4

23 © 2006 KDnuggets Parsing log: Time Zone  The time zone is relative to GMT  The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same in the time log  but it changes during daylight savings time  In our test log file the time zone is -0500, US Eastern time zone

24 © 2006 KDnuggets Parsing log: Request "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" Regular expression for parsing Request field: method opening and closing quotes URL, captures any string of 1 or more non-blanks HTTP version - usually ignored

25 © 2006 KDnuggets Parsing log: Status code and Object size Status (Response) code is always a 3-digit number, followed by space, so it can be matched with (\d\d\d) Object size is either a number or "-" followed by space. Simplest regex to match it is (\S+)

26 © 2006 KDnuggets Parsing log: Referrer The Referrer is a string enclosed in double quotes "…" Can have anything inside except for a double quote Can also be "-" in case of a direct request. Not documented, but can be "" (nothing between the quotes). Referrer can be matched by: "([^"]*)" opening and closing quotes anything except a double quote appearing zero or more times

27 © 2006 KDnuggets Parsing log: User agent User agent is also a string enclosed in double quotes " … ", that can have anything inside except for a double quote. It can also be "-". User agent can be matched by: "([^"]+)" opening and closing quotes anything except a double quote appearing one or more times

28 © 2006 KDnuggets Parsing a web log line: putting all together if ($line =~ /^(\S+) - - \[([^:]+):(..):(..):(..) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) { … } The matching is done by the following (should be all on one line) Full code is in program weblog_parse.pl

29 © 2006 KDnuggets Perl arrays  Perl array is an ordered list of items  Array names begin with @  Array initialization: @days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")

30 © 2006 KDnuggets Perl arrays, num of items  When referring to a single array item, name begins with "$". E.g. we print the first array item (index 0) using print $days[0] ;  Number of items in an array is $#array $#days is 7

31 © 2006 KDnuggets Perl array iteration  Iterating over entire array foreach $day (@days) {print $day,"\n" } ;  is the same as for $n ($n=0; $n <7; $n++) { print $days[$n],"\n" } ;

32 © 2006 KDnuggets Perl hash  Hash is unordered list of key, value pairs.  Hash names begin with %  Hash initialization: %capitals=("USA", "Washington D.C.", "France", "Paris", "China", "Beijing") ;

33 © 2006 KDnuggets Perl hash reference  Referring to a single hash item, name begins with "$".  To get capital of China from %capitals we use $capitals{"China"}  To add the capital of UK, we use  $capitals{"UK"} = "London" ;

34 © 2006 KDnuggets Perl hash iteration Iteration over the entire hash foreach $country (keys %capitals) { print "$country capital $capitals{$country}\n"; }

35 © 2006 KDnuggets Additional tools for Web log analysis  Perl for web log analysis www.oreilly.com/catalog/perlwsmng/chapter/ch08.html Some web log analysis tools  Analog www.analog.cx/  AWstats awstats.sourceforge.net/  Webalizer www.mrunix.net/webalizer/  FTPweblog www.nihongo.org/snowhare/utilities/ftpweblog/


Download ppt "© 2006 KDnuggets 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N""

Similar presentations


Ads by Google