Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 9. HTTP Protocol: Text-based protocol vs TCP and many others Basically a couple of requests (GET and POST) and a much larger variety of replies.

Similar presentations


Presentation on theme: "Chapter 9. HTTP Protocol: Text-based protocol vs TCP and many others Basically a couple of requests (GET and POST) and a much larger variety of replies."— Presentation transcript:

1 Chapter 9

2 HTTP Protocol: Text-based protocol vs TCP and many others Basically a couple of requests (GET and POST) and a much larger variety of replies (200 OK and more, over 40 in total). Replies are divided into 5 groups:  Informational (1xx)  Success (2xx)  Redirection (3xx)  Client Error (4xx)  Server Error (5xx) HTTP messages come with a plain text header with individual entries terminated by \r\n. The entire header is terminated by \r\n. It is followed by an optional contents.

3 HTTP GET: Getting the RFC defining HTTP Exercise: Find out the meaning of each of the header fields and values.

4 HTTP 200 OK: This is the response to the previous GET Body to follow:

5 Reading the Header in Python: The Python code that generates the previous traffic. [pletcha@archimedes Spr14Homework]$ python Python 2.7.5 (default, Nov 12 2013, 16:18:42) [GCC 4.8.2 20131017 (Red Hat 4.8.2-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib,urllib2 >>> url="http://www.ietf.org/rfc/rfc2616.txt" >>> rawreply=urllib2.urlopen(url).read()

6 URL vs URI: URIs identify a resource. URLs both identify and indicate how to fetch a resource. For example: Characters that are not alphanumerics on in the set “$-_.+!*'(),” must be expressed as % http://example.com:8080/Nord%2FLB/logo?shape=square&dpi=96 hostport path elementscript paramters

7 Parsing URLs: Parsing methods These break up a complex URL into its component parts Given a URL, break it apart >>> from urlparse import urlparse, urldefrag, parse_qs, parse_qsl >>> p = urlparse('http://example.com:8080/Nord%2FLB/logo?shape=square&dpi=96') >>> p ParseResult(scheme='http', netloc='example.com:8080', path='/Nord%2FLB/logo', params='', query='shape=square&dpi=96', fragment='') >>> scheme://netloc/path;parameters?query#fragment NOTE: This is a tuple, not a dictionary Example: 'path' could point to a servlet and the 'parameters' could tell us which version of the servlet to use – 1.1, 1.2, etc

8 Parsing Continued: Suppose you want key-value pairs for everything parse_qs vs parse_qsl: the latter preserves order in a dictionary by spitting out an ordered list (l) Anchors are also part of a URL but not sent to the server. They are used by the client to locate a tag in the reply document. You can split them off a URL using urldefrag(): >>> p.query 'shape=square&dpi=96' >>> parse_qs(p.query) {'shape': ['square'], 'dpi': ['96']} >>> u = 'http://docs.python.org/library/urlparse.html#urlparse.urldefrag' >>> urldefrag(u) ('http://docs.python.org/library/urlparse.html', 'urlparse.urldefrag')

9 Rebuilding a URL from a ParseResult: It is always nice to go backwards. Since the parse query is the only complicated thing we can build a query from scratch and then create a new ParseResult object with our query >>> p.geturl() 'http://example.com:8080/Nord%2FLB/logo?shape=square&dpi=96' >>> import urllib >>> import urlparse >>> query=urllib.urlencode({'company':'Nord/LB','report': 'sales'}) >>> query 'report=sales&company=Nord%2FLB' >>> p = urlparse.ParseResult('https','example.com:8080','data/path',None,query,None) >>> p.geturl() 'https://example.com:8080/data/path?report=sales&company=Nord%2FLB'

10 Relative URLs: If a URL is “relative” then it is relative to some “base”. The base is found from the URL used to fetch the page containing the relative link. Relative URLs are particularly useful if an entire sub-tree of a website is moved to a new parent directory. All the relative links continue to function without any editing.

11 urlparse.urljoin(): urljoin() combines a relative URL with a base to give a single absolute URL. >>> import urlparse >>> # a relative joined to a path that ends in / gets added on >>> urlparse.urljoin('http://www.python.org/psf/', 'grants') 'http://www.python.org/psf/grants' >>> urlparse.urljoin('http://www.python.org/psf/', 'mission') 'http://www.python.org/psf/mission'http://www.python.org/psf/mission >>> # a relative joined to a path that does not end in / replaces >>> # the last part of the path >>> urlparse.urljoin('http://www.python.org/psf', 'grants') 'http://www.python.org/grants'http://www.python.org/grants >>> # relatives can represent “current location” as '.' >>> urlparse.urljoin('http://www.python.org/psf/', './mission') 'http://www.python.org/psf/mission' >>> urlparse.urljoin('http://www.python.org/psf', './mission') 'http://www.python.org/mission'http://www.python.org/mission >>> # relatives can represent “parent of current” as '..' >>> urlparse.urljoin('http://www.python.org/psf/', '../news') 'http://www.python.org/news'

12 urlparse.urljoin(): urljoin() combines a relative URL with a base to give a single absolute URL. >>> # relatives won't take you out of the top level of the >>> # current website >>> urlparse.urljoin('http://www.python.org/psf', '../news') 'http://www.python.org/../news'http://www.python.org/../news >>> # relatives that start with / start at the top level of this website >>> urlparse.urljoin('http://www.python.org/psf/', '/dev/') 'http://www.python.org/dev/' >>> urlparse.urljoin('http://www.python.org/psf/mission/', '/dev/') 'http://www.python.org/dev/'http://www.python.org/dev/ >>> # if the relative is an “absolut” then urljoin() simply replaces >>> # one absolute with another >>> urlparse.urljoin('http://www.python.org/psf/','http://yelp.com/') 'http://yelp.com/' Useful and implies we can always apply urljoin() to the current path plus the next link

13 Easy Access to HTTP headers: Every time we send off a GET or POST and get a reply we'd like to be able to see the headers. urllib2 has an open() method that can be customized with handlers of your own devising. You accomplish this by subclassing a couple of urllib2 and httplib classes and then overriding methods in this classes so that your methods are invoked instead of the default ones. In order that the correct functionality actually occurs it is important that your methods also invoke the superclass methods.

14 Opening a link with your own opener: >>> from verbose_http import VerboseHTTPHandler >>> import urllib, urllib2 >>> opener = urllib2.build_opener(VerboseHTTPHandler) >>> opener.open('http://serendipomatic.org') -------------------------------------------------- GET / HTTP/1.1 Accept-Encoding: identity Host: serendipomatic.org Connection: close User-Agent: Python-urllib/2.7 -------------------- Response -------------------- HTTP/1.1 200 OK Date: Mon, 24 Mar 2014 20:08:00 GMT Server: Apache/2.2.25 (Amazon) Vary: Cookie Set-Cookie: csrftoken=HybJc4Gx5GgYsucjhF3f34aF7rMtzmcU; \ expires=Mon, 23-Mar-2015 20:08:00 GMT; Max-Age=31449600; Path=/ Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8 <addinfourl at 140151715103304 whose fp = \ >

15 verbose_handler.py: Replace standard open() with one of your own specialized to use your versions of the following object methods: #!/usr/bin/env python # Foundations of Python Network Programming - Chapter 9 - verbose_http.py # HTTP request handler for urllib2 that prints requests and responses. import StringIO, httplib, urllib2 # your VerboseHTTPHandler extends urllib2.HTTPHandler class VerboseHTTPHandler(urllib2.HTTPHandler): def http_open(self, req): # overrides HTTPHandler.http_open return self.do_open(VerboseHTTPConnection, req)

16 verbose_handler.py: Replace standard open() with one of your own specialized to use your versions of the following object methods: #!/usr/bin/env python # Foundations of Python Network Programming - Chapter 9 - verbose_http.py # HTTP request handler for urllib2 that prints requests and responses. import StringIO, httplib, urllib2 class VerboseHTTPResponse(httplib.HTTPResponse): def _read_status(self): s = self.fp.read() print '-' * 20, 'Response', '-' * 20 print s.split('\r\n\r\n')[0] self.fp = StringIO.StringIO(s) return httplib.HTTPResponse._read_status(self) class VerboseHTTPConnection(httplib.HTTPConnection): response_class = VerboseHTTPResponse def send(self, s): print '-' * 50 print s.strip() httplib.HTTPConnection.send(self, s)

17 GET: In the begin there was only GET. info is an object with many fields. >>> from verbose_http import VerboseHTTPHandler >>> import urllib, urllib2 >>> opener = urllib2.build_opener(VerboseHTTPHandler) >>> info = opener.open('http://www.ietf.org/rfc/rfc2616.txt') -------------------------------------------------- GET /rfc/rfc2616.txt HTTP/1.1 Accept-Encoding: identity Host: www.ietf.org Connection: close User-Agent: Python-urllib/2.7 >>> info.code, info.msg (200, 'OK') >>> sorted(info.headers.keys()) ['accept-ranges', 'connection', 'content-length', 'content-type', 'date', 'etag', 'last-modified', 'server', 'vary'] >>> info.headers['content-type'] 'text/plain' >>>

18 Why Host header? Originally each IP address supported only one web server at port 80. Over time we needed to put multiple sites at the same (IP,PortNum) pair. So the httpd further demultiplexes the GET request using the Host header field to determine which website the request should be delivered to. This is called virtual hosting.

19 GET (cont): info is also set up to be read like a file starting to read right after the end of the header >>> info.read(200) '\n\n\n\n\n\n Network Working Group R. Fielding\n Request for Comments: 2616 UC Irvine\n Obsoletes: 2068 ' >>>

20 Return Codes: 200 range – things are OK 300 range – some kind of redirection 400 range – client problem (invalid URL, for exampe) 500 range – server problem

21 300s: 301 – Moved Permanently: the 'Location' header field has the new, permanent address. 303 – See Other: In response to an original POST or PUT, go to the 'Location' URL and issue a GET. The original request is considered to have succeeded. 304 – Not Modified: The request came with a page timestamp and the website hasn't changed since that time so the requester already has the most recent copy. 307 – Temporary Redirect: Like 303 but issue the original POST again.

22 301: >>> info = opener.open('http://google.com') -------------------------------------------------- GET / HTTP/1.1... -------------------- Response -------------------- HTTP/1.1 301 Moved Permanently Location: http://www.google.com/... GET / HTTP/1.1 Accept-Encoding: identity Host: www.google.com... -------------------- Response -------------------- HTTP/1.1 200 OK Date: Mon, 07 Apr 2014 15:18:06 GMT...

23 301 (cont): >>> info = opener.open('http://www.twitter.com') -------------------------------------------------- GET / HTTP/1.1 Accept-Encoding: identity Host: www.twitter.com Connection: close User-Agent: Python-urllib/2.7 -------------------- Response -------------------- HTTP/1.1 301 Moved Permanently content-length: 0 date: Mon, 07 Apr 2014 15:27:27 UTC location: https://twitter.com/ server: tfe set-cookie: guest_id=v1%3A139688444736565383; Domain=.twitter.com; Path=/; Expires=Wed, 06-Apr-2016 15:27:27 UTC Connection: close >>> NOTE: This time urllib2 doesn't follow the redirect; perhaps because it is going to a secure site. NOTE: Obviously Google and Twitter have different opinions of 'WWW'.

24 Handle You Own Redirect: >>> class NoRedirectHandler(urllib2.HTTPRedirectHandler):... def http_error_302(self,req,fp,code,msg,headers):... return... http_error_301 = http_error_303 = http_error_307 = http_error_302... >>> no_redirect_opener = urllib2.build_opener(NoRedirectHandler) >>> no_redirect_opener.open('http://google.com')http://google.com... urllib2.HTTPError: HTTP Error 301: Moved Permanently >>>

25 Title: XXX

26 Title: XXX


Download ppt "Chapter 9. HTTP Protocol: Text-based protocol vs TCP and many others Basically a couple of requests (GET and POST) and a much larger variety of replies."

Similar presentations


Ads by Google