Building a Simple Web Proxy

Slides:



Advertisements
Similar presentations
HTTP and Apache Roy T. Fielding eBuilt, Inc. The Apache Software Foundation
Advertisements

HTTP Reading: Section and COS 461: Computer Networks Spring
Hypertext Transfer PROTOCOL ----HTTP Sen Wang CSE5232 Network Programming.
Presenter: James Huang Date: Sept. 29,  HTTP and WWW  Bottle Web Framework  Request Routing  Sending Static Files  Handling HTML  HTTP Errors.
An Introduction to the Internet and the Web Frank McCown COMP 250 – Internet Development Harding University.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
HTTP – HyperText Transfer Protocol
Hypertext Transfer Protocol Kyle Roth Mark Hoover.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Know how HTTP accesses data on the WWW Objectives.
Lecture 7 TELNET Protocol & HyperText Transfer Protocol CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger.
Chapter 2 Application Layer Computer Networking: A Top Down Approach Featuring the Internet, 3 rd edition. Jim Kurose, Keith Ross Addison-Wesley, July.
Hypertext Transfer Protocol Information Systems 337 Prof. Harry Plantinga.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
Hypertext Transport Protocol CS Dick Steflik.
 What is it ? What is it ?  URI,URN,URL URI,URN,URL  HTTP – methods HTTP – methods  HTTP Request Packets HTTP Request Packets  HTTP Request Headers.
Rensselaer Polytechnic Institute CSC-432 – Operating Systems David Goldschmidt, Ph.D.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
FTP (File Transfer Protocol) & Telnet
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
HTTP Reading: Section and COS 461: Computer Networks Spring
HyperText Transfer Protocol (HTTP).  HTTP is the protocol that supports communication between web browsers and web servers.  A “Web Server” is a HTTP.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
Application Layer 2 Figures from Kurose and Ross
Maryam Elahi University of Calgary – CPSC 441.  HTTP stands for Hypertext Transfer Protocol.  Used to deliver virtually all files and other data (collectively.
Copyright (c) 2010, Dr. Kuanchin Chen1 The Client-Server Architecture of the WWW Dr. Kuanchin Chen.
HTTP Hypertext Transfer Protocol
HTTP Hypertext Transfer Protocol RFC 1945 (HTTP 1.0) RFC 2616 (HTTP 1.1)
Web Client-Server Server Client Hypertext link TCP port 80.
1-1 HTTP request message GET /somedir/page.html HTTP/1.1 Host: User-agent: Mozilla/4.0 Connection: close Accept-language:fr request.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Operating Systems Lesson 12. HTTP vs HTML HTML: hypertext markup language ◦ Definitions of tags that are added to Web documents to control their appearance.
2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.
Web Technologies Lecture 1 The Internet and HTTP.
Web Services. 2 Internet Collection of physically interconnected computers. Messages decomposed into packets. Packets transmitted from source to destination.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
CS 6401 The World Wide Web Outline Background Structure Protocols.
Overview of Servlets and JSP
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/17/10.
Week 11: Application Layer 1 Web and HTTP r Web page consists of objects r Object can be HTML file, JPEG image, Java applet, audio file,… r Web page consists.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
Lecture # 1 By: Aftab Alam Department Of Computer Science University Of Peshawar Internet Programming.
What’s Really Happening
Tiny http client and server
Block 5: An application layer protocol: HTTP
How HTTP Works Made by Manish Kushwaha.
Hypertext Transfer Protocol
HTTP – An overview.
The Hypertext Transfer Protocol
HTTP Protocol Specification
Hypertext Transfer Protocol
Hypertext Transport Protocol
HTTP Protocol.
HTTP Hypertext Transfer Protocol
Web Server Design Week 5 Old Dominion University
CS 5565 Network Architecture and Protocols
EE 122: HyperText Transfer Protocol (HTTP)
William Stallings Data and Computer Communications
Web Server Design Week 6 Old Dominion University
Web Server Design Week 5 Old Dominion University
HTTP Hypertext Transfer Protocol
Web Server Design Week 6 Old Dominion University
CSCI-351 Data communication and Networks
Web Server Design Assignment #5 Extra Credit
Web Server Design Week 7 Old Dominion University
Web Server Design Week 7 Old Dominion University
Presentation transcript:

Building a Simple Web Proxy COS 461 Assignment 1

A Brief History of HTTP Mar 1989 - "Information Management: A Proposal" Oct 1990 - "WorldWideWeb" coined Oct 1994 - W3C founded May 1996 - RFC 1945 (HTTP 1.0) June 1999 - RFC 2616 (HTTP 1.1)

Anatomy of HTTP 1.0 Web Client Connect: Request Web Server GET / HTTP/1.0 Host: www.yahoo.com CRLF The Hypertext Transfer Protocol or (HTTP) is the protocol used for communication on this web. That is, it is the protocol which defines how your web browser requests resources from a web server and how the server responds. For simplicity, in this assignment we will be dealing only with version 1.0 of the HTTP protocol, defined in detail in RFC 1945. You should read through this RFC and refer back to it when deciding on the behavior of your proxy. HTTP communications happen in the form of transactions, a transaction consists of a client sending a request to a server and then reading the response. As a protocol, it is essentially stateless in that each transaction is, at the protocol level, idempotent and independent of the rest. HTTP does not remember, track, or ensure the ordering of a series of transactions. Each transaction is byte-level serialized since HTTP rides over reliable TCP streams, ensuring the proper deliver or request and response data, but the interaction among transactions must be resolved and defined by higher level mechanisms/applications. The protocol itself is defined completely in cleartext as an ASCII based protocol. However, the message body can be any arbitrary byte stream as specified by the Content-Type: header. HTTP is basically a line-delimited protocol (CRLF) Request and response messages share a common basic format:An initial line (a request or response line, as defined below)Zero or more header linesA blank line (CRLF)An optional message body. For most common HTTP transactions, the protocol boils down to a relatively simple series of steps (important sections of RFC 1945 are in parenthesis):A client creates a connection to the server.The client issues a request by sending a line of text to the server. This request line consists of a HTTP method (most often GET, but POST, PUT, and others are possible), a request URI (like a URL), and the protocol version that the client wants to use (HTTP/1.0). The message body of the initial request is typically empty. (5.1-5.2, 8.1-8.3, 10, D.1)The server sends a response message, with its initial line consisting of a status line, indicating if the request was successful. The status line consists of the HTTP version (HTTP/1.0), a response status code (a numerical value that indicates whether or not the request was completed successfully), and a reason phrase, an English-language message providing description of the status code. Just as with the the request message, there can be as many or as few header fields in the response as the server wants to return. Following the CRLF field separator, the message body contains the data requested by the client in the event of a successful request. (6.1-6.2, 9.1-9.5, 10)Once the server has returned the response to the client, it closes the connection. Response: Close HTTP/1.0 200 OK Date: Tue, 16 Feb 2010 19:21:24 GMT Content-Type: text/html; CRLF <html><head><title>Yahoo!</title>

Anatomy of HTTP 1.0 Web Client Connect: Request Web Server Request Line Request Header Request Delimiter GET / HTTP/1.0 Host: www.yahoo.com CRLF The Hypertext Transfer Protocol or (HTTP) is the protocol used for communication on this web. That is, it is the protocol which defines how your web browser requests resources from a web server and how the server responds. For simplicity, in this assignment we will be dealing only with version 1.0 of the HTTP protocol, defined in detail in RFC 1945. You should read through this RFC and refer back to it when deciding on the behavior of your proxy. HTTP communications happen in the form of transactions, a transaction consists of a client sending a request to a server and then reading the response. As a protocol, it is essentially stateless in that each transaction is, at the protocol level, idempotent and independent of the rest. HTTP does not remember, track, or ensure the ordering of a series of transactions. Each transaction is byte-level serialized since HTTP rides over reliable TCP streams, ensuring the proper deliver or request and response data, but the interaction among transactions must be resolved and defined by higher level mechanisms/applications. The protocol itself is defined completely in cleartext as an ASCII based protocol. However, the message body can be any arbitrary byte stream as specified by the Content-Type: header. HTTP is basically a line-delimited protocol (CRLF) Request and response messages share a common basic format:An initial line (a request or response line, as defined below)Zero or more header linesA blank line (CRLF)An optional message body. For most common HTTP transactions, the protocol boils down to a relatively simple series of steps (important sections of RFC 1945 are in parenthesis):A client creates a connection to the server.The client issues a request by sending a line of text to the server. This request line consists of a HTTP method (most often GET, but POST, PUT, and others are possible), a request URI (like a URL), and the protocol version that the client wants to use (HTTP/1.0). The message body of the initial request is typically empty. (5.1-5.2, 8.1-8.3, 10, D.1)The server sends a response message, with its initial line consisting of a status line, indicating if the request was successful. The status line consists of the HTTP version (HTTP/1.0), a response status code (a numerical value that indicates whether or not the request was completed successfully), and a reason phrase, an English-language message providing description of the status code. Just as with the the request message, there can be as many or as few header fields in the response as the server wants to return. Following the CRLF field separator, the message body contains the data requested by the client in the event of a successful request. (6.1-6.2, 9.1-9.5, 10)Once the server has returned the response to the client, it closes the connection. Response: Close Response Status Response Header Response Delimiter Response Body HTTP/1.0 200 OK Date: Tue, 16 Feb 2010 19:21:24 GMT Content-Type: text/html; CRLF <html><head><title>Yahoo!</title>

HTTP 1.1 vs 1.0 Additional Methods (PUT, DELETE, TRACE, CONNECT + GET, HEAD, POST) Additional Headers Transfer Coding (chunk encoding) Persistent Connections (content-length matters) Request Pipelining

Why Use a Proxy? Caching Content Filtering Privacy Ordinarily, HTTP is a client-server protocol. The client (usually your web browser) communicates directly with the server (the web server software). However, in some circumstances it may be useful to introduce an intermediate entity called a proxy. Conceptually, the proxy sits between the client and the server. In the simplest case, instead of sending requests directly to the server the client sends all its requests to the proxy. The proxy then opens a connection to the server, and passes on the client's request. The proxy receives the reply from the server, and then sends that reply back to the client. Notice that the proxy is essentially acting like both a HTTP client (to the remote server) and a HTTP server (to the initial client). Why use a proxy? There are a few possible reasons: Performance: By saving a copy of the pages that it fetches, a proxy can reduce the need to create connections to remote servers. This can reduce the overall delay involved in retrieving a page, particularly if a server is remote or under heavy load - See HashCache for developing worlds, and Codeen (and other CDN’s). Content Filtering and Transformation: While in the simplest case the proxy merely fetches a resource without inspecting it, there is nothing that says that a proxy is limited to blindly fetching and serving files. The proxy can inspect the requested URL and selectively block access to certain domains, reformat web pages (for instances, by stripping out images to make a page easier to display on a handheld or other limited-resource client), or perform other transformations and filtering. Privacy: Normally, web servers log all incoming requests for resources. This information typically includes at least the IP address of the client, the browser or other client program that they are using (called the User-Agent), the date and time, and the requested file. If a client does not wish to have this personally identifiable information recorded, routing HTTP requests through a proxy is one solution. All requests coming from clients using the same proxy appear to come from the IP address and User-Agent of the proxy itself, rather than the individual clients. If a number of clients use the same proxy (say, an entire business or university), it becomes much harder to link a particular HTTP transaction to a single computer or individual. Content Filtering Privacy

Building a Simple Web Proxy Forward client requests to the remote server and return response to the client Handle HTTP 1.0 (GET) Multi-process, non-caching web proxy ./proxy <port> The BasicsYour first task is to build a basic web proxy capable of accepting HTTP requests, forwarding requests to remote (origin) servers, and returning response data to a client. The proxy does NOT need to handle concurrent requests, i.e. no need for threaded, forked, or event-based non-blocking operation. Rather, the proxy should handle requests sequentially. You will only be responsible for implementing the GET method. All other request methods received by the proxy should elicit a "Not Implemented" (501) error (see RFC 1945 section 9.5 - Server Error).This assignment can be completed in either C or C++. It should compile and run (using g++) without errors or warnings from the FC 010 cluster, producing a binary called proxy that takes as its first argument a port to listen from. Don't use a hard-coded port number.You shouldn't assume that your server will be running on a particular IP address, or that clients will be coming from a pre-determined IP. When your proxy starts, the first thing that it will need to do is establish a socket connection that it can use to listen for incoming connections. Your proxy should listen on the port specified from the command line and wait for incoming client connections.

Handling Requests What you need from a client request: host, port, and URI path GET http://www.princeton.edu:80/ HTTP/1.0 What you send to a remote server: GET / HTTP/1.0 Host: www.princeton.edu:80 Connection: close Check request line and header format Once a client has connected, the proxy should read data from the client and then check for a properly-formatted HTTP request. Specifically, the proxy should ensure that the request contains a valid request line:<METHOD> <URL or PATH> <HTTP VERSION>And a Host header, if the specified resource is a PATH:Host: <HOSTNAME>All other headers just need to be properly fowrmatted:<HEADER NAME>: <HEADER VALUE>An invalid request from the client should be answered with an appropriate error code, i.e. "Bad Request" (400) or "Not Implemented" (501) for valid HTTP methods other than GET. See the note on network programming for more guidelines on how to handle real world clients and semi-valid requests. Once the proxy sees a valid HTTP request, it will need to parse the requested URL. The proxy needs at most three pieces of information: the requested host and port, and the requested path. See the URL (7) manual page for more info. You will need to parse the URL (full or absolute path) specified in the request line. Note that since an absolute path request, i.e. (/) does not include a hostname it must include the Host header in addition to the standard request line. Otherwise the proxy will not know where to retrieve the original resource. If the hostname, indicated in either the full URL or in the Host header, does not have a port specified, use the default HTTP port 80. Once the proxy has parsed the URL, it can make a connection to the requested host (using the appropriate remote port, or the default of 80 if none is specified) and send the HTTP request for the appropriate resource. The proxy should always send the request in the absolute path + Host header format regardless of how the request was received from the client: Accept from client:GET http://www.princeton.edu/ HTTP/1.0orGET / HTTP/1.0host: www.princeton.eduSend to remote server:GET / HTTP/1.0host: www.princeton.edu(Additional client specified headers, if any...)

Handling Responses Web Client Web Server Simple Proxy Parse Request: Host, Port, Path Simple Proxy Returning Data to the ClientAfter the response from the remote server is received, the proxy should send the response message (as-is) to the client via the appropriate socket. Once the transaction is complete, the proxy should close the connection to the client. Note: the proxy should terminate the connection to the remote server once the response has been fully received. For HTTP 1.0, the remote server will terminate the connection once the transaction is complete. Forward Response to Client Including Errors

Handling Errors Method != GET: Not Implemented (501) Unparseable request: Bad Request (400) Parse using the Parsing library Postel’s law: Be liberal in what you accept, and conservative in what you send convert HTTP 1.1 request to HTTP 1.0 convert \r to \r\n etc...

Testing Your Proxy Telnet to your proxy and issue a request > ./proxy 5000 > telnet localhost 5000 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. GET http://www.google.com/ HTTP/1.0 (HTTP response...) Direct your browser to use your proxy Use the supplied proxy_tester.py and proxy_tester_conc.py Testing Your ProxyRun your client with the following command:./proxy <port>, where port is the port number that the proxy should listen on. As a basic test of functionality, try requesting a page using telnet:telnet localhost <port>Trying 127.0.0.1...Connected to localhost.localdomain (127.0.0.1).Escape character is '^]'.GET http://www.google.com/ HTTP/1.0If your proxy is working correctly, the headers and HTML of the Google homepage should be displayed on your terminal screen. Notice here that we request the full URL (http://www.google.com/) instead of just the absolute path (/). Your proxy should support both of these formats, with the latter type including a Host header. A good sanity check would be to compare the HTTP response (headers and body) obtained via your proxy with the response from a direct telnet connection to the remote server.

Proxy Guidance Assignment page Assignment FAQ RFC 1945 (HTTP 1.0) Google, wikipedia, man pages Must build on Friend 010 machines