Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Web Science Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-

Similar presentations


Presentation on theme: "Introduction to Web Science Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-"— Presentation transcript:

1 Introduction to Web Science Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported LicenseCreative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported License This course is based on… 

2 Introduction to Web Science Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported LicenseCreative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported License

3 Attribution Based on Frank McCown's Spring 2013 offering of CS 475: – http://www.harding.edu/fmccown/classes/comp475-s13/ http://www.harding.edu/fmccown/classes/comp475-s13/ Additional slides by me, with at least two guest lectures with their own material – Sep 4: Python – Sept 11: R

4 all that you need to know about me

5 First Job: NASA Langley Research Center

6 WS-DL Research Group http://ws-dl.cs.odu.edu/

7

8 Class Administrivia Course home page: – http://www.cs.odu.edu/~mln/teaching/cs595-f14/ http://www.cs.odu.edu/~mln/teaching/cs595-f14/ Class email list – http://groups.google.com/group/cs595-f14 http://groups.google.com/group/cs595-f14 – please add yourself Class github – https://github.com/phonedude/cs595-f14 https://github.com/phonedude/cs595-f14 Readings, syllabus, schedule, etc. are all available from the class home page

9 Recommended Books

10 Grading Ten written assignments based on exploration of topics presented in class – 10 points for each assignment Possible extra credit at the end of the semester

11 Written Assignments Written like a report… – Grad: necessary to use LaTeX and graphs must be done in R or Gnuplot – Ugrad: MS Word, MS Excel graphs are acceptable – where necessary: references, screen shots, copy-n-paste of code output, etc. Feel free to use existing code, consult with others, etc. – but you MUST document / reference where you adapted the code, advice, etc. of others. Give credit where credit is due! – use without attribution is plagiarism! Due at beginning of class: – upload code, report, data fiels, etc. to fork of: https://github.com/phonedude/cs595-f14 https://github.com/phonedude/cs595-f14

12 See Also… Python, recommendations, clustering, filtering – http://www.cs.odu.edu/~hany/teaching/cs495-f12/ http://www.cs.odu.edu/~hany/teaching/cs495-f12/ All things HTTP – http://www.cs.odu.edu/~mln/teaching/cs595-s12/ http://www.cs.odu.edu/~mln/teaching/cs595-s12/ Information retrieval, metadata – http://www.cs.odu.edu/~mln/teaching/cs895-s14/ http://www.cs.odu.edu/~mln/teaching/cs895-s14/ – http://www.cs.odu.edu/~mln/teaching/cs751-s11/ http://www.cs.odu.edu/~mln/teaching/cs751-s11/ Visualization, Analytics – http://www.cs.odu.edu/~mweigle/CS725-F13/Home http://www.cs.odu.edu/~mweigle/CS725-F13/Home – http://www.cs.odu.edu/~mweigle/CS795-S13/Home http://www.cs.odu.edu/~mweigle/CS795-S13/Home Web programming, LAMP – http://www.cs.odu.edu/~mweigle/CS418-F12/Home http://www.cs.odu.edu/~mweigle/CS418-F12/Home

13 we now begin our feature presentation

14 What is Web Science? http://mags.acm.org/communications/200807/#pg1 Web Science is the interdisciplinary study of the Web as an entity and phenomenon. It includes studies of the Web’s properties, protocols, algorithms, and societal effects.

15 Background Web Science initiative launched in Nov 2006 by University of Southampton and MIT Nigel ShadboltTim Berners-LeeWendy HallJames HendlerDaniel Weitzer Images from http://webscience.org/people.htmlhttp://webscience.org/people.html

16 Degrees in Web Science Undergrad BS in Information Technology and Web Science at Rensselaer Polytechnic Institute MS degree in Web Science at – University of Southampton – University of San Francisco – Aristotle University of Thessaloniki – RPI (Web Science emphasis) Not many, but remember, it’s a new science!

17 Communications of the ACM July 2008

18 “Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology, and economics. We invite computer scientists to expand the discipline by addressing the challenges following from the widespread adoption of the Web and its profound influence on social structures, political systems, commercial organizations, and educational institutions.”

19 Web Science is Interdisciplinary O'Hara and Hall, Web Science, ALT Online Newsletter, May 6, 2008

20 Some questions of study: How is the Web structured? What is its size? How can unstructured data mined from the Web be combined in meaningful ways? How does information/misinformation spread on the Web? How can we discover its origin? How can the Web used to effectively harness the collective intelligence of its users? How can trust be measured on the Web? How can privacy be maintained on the Web? What do events gathered from online social networks tell us about the human condition? Has the Web changed how humans think? Why is this important? Huge implications for web search!

21 How is the Web structured? Graph Theory: Pages are nodes & links are directed edges link

22 Web Graph Num of In-links Total Web pages Normal/Gaussian Distribution Random Graph Typical Web Graph Power-law Distribution Num of In-links Total Web pages

23 Small World Network Six degrees of separation Most pages are not neighbors but most pages can be reached from others by a small number of hops Many hubs- pages with many inlinks Robust for random node deletions Other examples: road maps, networks of brain neurons, voter networks, and social networks

24 Six Degrees of Kevin Bacon

25 I have a Bacon Number of 3!

26 Bow-Tie Structure of the Web Broder et. al (Graph Structure of the Web, 2000) Examined a large web graph (200M pages, 1.5B links) 17 Million nodes

27 Bow-Tie Structure 75% of pages do not have a direct path from one page to another Ave distance is 16 clicks when path exists and 7 clicks when undirected path exists Diameter of SCC is at least 28 (max shortest distance between any two nodes) Diameter of entire Web is at least 500 (most distant node in IN to OUT) Broder et al., Graph Structure of the Web, 2000

28 Web Structure’s Implications If we want to discover every web page on the Web, it’s impossible since there are many pages that aren’t linked to Finding popular pages is easy, but finding pages with few in-links (the long tail) is more difficult How do we know when new pages are added to the Web or removed? Incoming links could tell us something about the “importance” of a page when searching the Web for information (e.g., PageRank) Link structure of the Web can be artificially manipulated

29 How large is the Web? 1 trillion unique URLs

30 How did Google discover all these URLs? By crawling the web

31 Web Crawler Web crawlers are used to fetch a page, place all the page’s links in a queue, and continue the process for each URL in the queue Figure: McCown, Lazy Preservation: Reconstructing Websites from the Web Infrastructure, Dissertation, 2007

32 Problems with Web Crawling Slow because crawlers limit how frequently they make requests to the same server (politeness policy) Many pages are disconnected from the SCC, password- protected, or protected by robots.txt There are an infinite number of pages (e.g., calendar) so crawlers limit how deeply they crawl Web pages are continually being added and removed Deep web: Many pages are only accessible behind a web form (e.g., US patent database). Deep web is magnitudes larger than surface web, and 2006 study 1 shows only 1/3 is indexed by big three search engines 1 He et al., Accessing the deep web, CACM 2007

33 Deep Web != Dynamic, Queries, or Personalized not deep web: http://oracleofbacon.org/movielinks.php?game=0&a=Kevin+Bacon&b=Seamus+O %27Regan&use_using=1&u0=on&u1=on&use_genres=1&g0=on&g4=on&g8=on&g 16=on&g20=on&g24=on&g1=on&g5=on&g9=on&g13=on&g17=on&g21=on&g25= on&g2=on&g6=on&g10=on&g14=on&g18=on&g22=on&g26=on&g3=on&g7=on&g 11=on&g15=on&g23=on&g27=on deep web: http://awoiaf.westeros.org/index.php/Special:Export (or more accurately, the 1000s of XML files available from this same URI are in the deep web)

34 What Counts? Many duplicate pages (30% of web pages are duplicates or near-duplicates 1 ) – How do we efficiently compare across a large corpus? Some pages change every time they are requested – How can we automatically determine what is an insignificant difference? Many spammy pages (14% of web pages 2 ) – How can we detect these? 1 Fetterly et al., On the evolution of clusters of near-duplicate web pages, J of Web Eng, 2004 2 Ntoulas et al., Detecting spam web pages through content analysis, WWW 2006

35 Duplicates & Near Duplicates

36 Some Observations Crawling a significant amount of the Web is hard Different search engines have different pages indexed, but they don’t share these differences with each other (company secret) So if we wanted to estimate the Web’s size but don’t want to try to crawl the Web ourselves, could we use the search engines themselves to estimate the Web’s size? (note: working with the web == working with stats)

37 Capture-Recapture Method Statistical method used to estimate population size (originally fish and wildlife populations) Example: How many fish are in the lake? – Catch S 1 fish from the lake, tag them, and return them to the lake – Then catch and put back S 2 fish, noting which are tagged (S 1,2 ) – S 1 /N = S 1,2 /S 2 so population N = S 1 × S 2 /S 1,2 N S1S1 S2S2 S 1,2

38 Estimate Web Population Lawrence and Giles 1 used capture-recapture method to estimate web page population – Submitted 575 queries to sets of 2 search engines – S1 = All pages returned by SE1 – S2 = All pages returned by SE2 – S1,2 = All pages returned by both SE1 and SE2 – Size of indexable Web (N) = S 1 × S 2 /S 1,2 Estimated size of indexable Web in 1998 = 320 million pages Recent measurements using similar methods find lower bound of 21 billion pages 2 1 Lawrence & Giles, Searching the World Wide Web, Science, 1998 2 http://www.worldwidewebsize.com/

39 This is just a sample of Web Science that we will be examining from a computing perspective.

40 More About Web Science Video: Nigel Shadbolt on Web Science (2008) http://webscience.org/professor-nigel- shadbolt-explains-web-science/http://webscience.org/professor-nigel- shadbolt-explains-web-science/ Slides: “What is Web Science?” by Carr, Pope, Hall, Shadbolt (2008) http://www.slideshare.net/lescarr/what-is- web-science http://www.slideshare.net/lescarr/what-is- web-science

41 Web Architecture

42 Web VoIP email IM Streaming video Internet != Web The Internet File transfer

43 “The Internet is a global system of interconnected computer networks that use the standard Internet Protocol Suite (TCP/IP) to serve billions of users worldwide.” http://en.wikipedia.org/wiki/Internetcomputer networksInternet Protocol Suite http://en.wikipedia.org/wiki/Internet Internet Computer 1Computer 2 255.254.253.2521.2.3.4

44 Internet Protocol Suite Internet Protocol (IP): directs packets to a specific computer using an IP address Transmission Control Protocol (TCP): directs packets to a specific application on a computer using a port number. – Common port numbers: Common port numbers 22 – ssh 23 – telnet 25 – email 80 – Web

45 Overview of the Web Internet Client – Web BrowserWeb Server 255.254.253.2521.2.3.4 World Wide Web: The system of interlinked hypertext documents accessed over the Internet using the HTTP protocol.

46 Hypertext Transfer Protocol (HTTP) HTTP is the set of rules that govern communication between web browsers and web servers. Client Request GET /comp/ HTTP/1.1 Host: www.harding.edu Server Response HTTP/1.1 200 OK Content-Length: 6018 Content-Type: text/html Content-Location: http://www.harding.edu/comp/ Last-Modified: Mon, 05 Jul 2010 18:49:40 GMT Server: Microsoft-IIS/6.0 Harding University - Computer Science Example request for http://www.harding.edu/comp/ http://www.harding.edu/comp/

47 Domain Name System (DNS) DNSDNS is a hierarchical look-up service that converts a given hostname into its equivalent IP address www.google.comwww.google.com  1.4.5.8 www.cnn.comwww.cnn.com  4.6.2.8 www.hulu.comwww.hulu.com  6.7.8.9 Etc... DNS Server www.harding.edu 128.82.4.20 DNS servers contact parent servers for missing entries Authoritative name servers are responsible for specific domains Warning: DNS cache poisoningDNS cache poisoning

48 Example: Web Page Request http://foo.org/bar.html Client (Web Browser)Web Server (4) HTTP GET bar.html (1) Enter URL (6) HTTP Response (5) Locate the resource (7) Parse HTML & display (8) HTTP GET image1 (N) HTTP GET imageX Potentially many requests & responses DNS (2) foo.org (3) 1.2.3.4

49 More Formal Definitions HTTP defined by Request for Comments (RFCs) 1945, 2068, and 2616 Other RFCs for defining URLs (1736, 1738), URIs (1630, 2396), etc. Web architecture defined in W3C’s The Architecture of the World Wide Web, Volume One – http://www.w3.org/TR/webarch/ http://www.w3.org/TR/webarch/

50 Web Definition “The World Wide Web (WWW, or simply Web) is an information space in which the items of interest, referred to as resources, are identified by global identifiers called Uniform Resource Identifiers (URI).” http://www.w3.org/TR/webarch/

51 URIs, Resources, and Representations http://www.w3.org/TR/webarch/ URIs identify Resources Representations represent Resources When URIs are dereferenced, they return representations (not resources) Different representations may be returned for the same URI (e.g., English vs. French version)

52 Remember Three Things http://www.cs.odu.edu/~mln/ URIs Identify Resources Represent Home:: Michael L. Nelson, Old Dominion University … Representations

53 What’s a URx? Figure: http://en.wikipedia.org/wiki/File:URI_Euler_Diagram_no_lone_URIs.svghttp://en.wikipedia.org/wiki/File:URI_Euler_Diagram_no_lone_URIs.svg URI - String of chars used to identify a name or resource on the Internet URL - Where to find a resource URN - Name of a resource

54 URI Components foo://username:password@example.com:8042/over/there/index.dtb;type=animal?name=ferret#nose \ / \________________/\_________/ \__/ \___/ \_/ \_________/ \_________/ \__/ | | | | | | | | | | userinfo hostname port | | parameter query fragment | \_______________________________/ \_____________|____|____________/ scheme | | | | | authority |path| | | | | path interpretable as filename | ___________|____________ | / \ / \ | urn:example:animal:ferret:nose interpretable as extension Figure: http://en.wikipedia.org/wiki/URI_schemehttp://en.wikipedia.org/wiki/URI_scheme Other examples: http://example.org/absolute/path/to/resource.txt ftp://example.org/resource.txt urn:issn:1535-3613

55 "The URI Is The Thing" http://blog.whatfettle.com/2008/10/06/the-uri-is-the-thing/

56 Talking to HTTP servers… % curl --head http://www.cs.odu.edu/~mln/ HTTP/1.1 200 OK Date: Mon, 12 Jan 2009 15:44:19 GMT Server: Apache/2.2.0 Last-Modified: Fri, 09 Jan 2009 17:18:37 GMT ETag: "88849-1c71-f28dd540" Accept-Ranges: bytes Content-Length: 7281 Content-Type: text/html % curl -I http://www.google.com/ HTTP/1.1 200 OK Cache-Control: private, max-age=0 Date: Mon, 12 Jan 2009 15:45:57 GMT Expires: -1 Content-Type: text/html; charset=ISO-8859-1 Set-Cookie: PREF=ID=9a80d3f602b685f3:TM=1231775157:LM=1231775157:S=imGxRyNsTD0Zczm5; expires=Wed, 12-Jan-2011 15:45:57 GMT; path=/; domain=.google.com Server: gws Content-Length: 0 “curl” is convenient, but speaking raw HTTP is more fun…

57 GET AIHT:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~mln/index.html HTTP/1.1 Connection: close Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Mon, 09 Jan 2006 17:07:04 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Last-Modified: Sun, 29 May 2005 02:46:53 GMT ETag: "1c52-14ed-42992d1d" Accept-Ranges: bytes Content-Length: 5357 Connection: close Content-Type: text/html Home Page for Michael L. Nelson <!-- [lots of html deleted] Connection closed by foreign host. Request (ends w/ CRLF) Response

58 HEAD AIHT:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/index.html HTTP/1.1 Connection: close Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Mon, 09 Jan 2006 17:14:39 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Last-Modified: Sun, 29 May 2005 02:46:53 GMT ETag: "1c52-14ed-42992d1d" Accept-Ranges: bytes Content-Length: 5357 Connection: close Content-Type: text/html Connection closed by foreign host.

59 OPTIONS (many methods) AIHT:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. OPTIONS /~mln/index.html HTTP/1.1 Connection: close Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Mon, 09 Jan 2006 17:16:46 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Length: 0 Allow: GET, HEAD, POST, PUT, DELETE, CONNECT, OPTIONS, PATCH, PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK, TRACE Connection: close Connection closed by foreign host.

60 OPTIONS (fewer methods) % telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. OPTIONS /~mln/index.html HTTP/1.1 Connection: close Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Tue, 10 Jan 2012 17:26:44 GMT Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q Allow: GET,HEAD,POST,OPTIONS Content-Length: 0 Connection: close Content-Type: text/html Connection closed by foreign host.

61 POST $ telnet awoiaf.westeros.org 80 Trying 108.162.197.188... Connected to awoiaf.westeros.org. Escape character is '^]'. POST /index.php/Special:Export HTTP/1.1 Host: awoiaf.westeros.org Content-type: text/plain Content-length: 10 123456789 HTTP/1.1 200 OK Server: cloudflare-nginx Date: Wed, 28 Aug 2013 15:25:53 GMT Content-Type: text/html; charset=utf-8 Content-language: en X-Frame-Options: DENY Vary: Accept-Encoding, Cookie [lots of headers deleted] [lot of html deleted]

62 Response Codes - 1xx: Informational - Request received, continuing process - 2xx: Success - The action was successfully received, understood, and accepted - 3xx: Redirection - Further action must be taken in order to complete the request - 4xx: Client Error - The request contains bad syntax or cannot be fulfilled - 5xx: Server Error - The server failed to fulfill an apparently valid request from section 6.1.1 of RFC 2616 not “error” codes!

63 501 - Method Not Implemented AIHT:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. NOTAREALMETHOD /index.html HTTP/1.1 Connection: close Host: www.cs.odu.edu HTTP/1.1 501 Method Not Implemented Date: Mon, 09 Jan 2006 17:22:40 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Allow: GET, HEAD, POST, PUT, DELETE, CONNECT, OPTIONS, PATCH, PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK, TRACE Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 15f 501 Method Not Implemented Method Not Implemented NOTAREALMETHOD to /index.html not supported. Invalid method in request NOTAREALMETHOD /index.html HTTP/1.1 Apache/1.3.26 Server at www.cs.odu.edu Port 80 0 Connection closed by foreign host.

64 301 - Moved Permanently AIHT:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~mln HTTP/1.1 Connection: close Host: www.cs.odu.edu Connection closed by foreign host. HTTP/1.1 301 Moved Permanently Date: Mon, 09 Jan 2006 19:32:24 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Location: http://www.cs.odu.edu/~mln/ Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 12e 301 Moved Permanently Moved Permanently The document has moved here. Apache/1.3.26 Server at www.cs.odu.edu Port 80 0

65 301- Moved Permanently % telnet bit.ly 80 Trying 69.58.188.39... Connected to bit.ly. Escape character is '^]'. HEAD http://bit.ly/s2FPFa HTTP/1.1 Host: bit.ly Connection: close HTTP/1.1 301 Moved Server: nginx Date: Tue, 10 Jan 2012 17:34:29 GMT Content-Type: text/html; charset=utf-8 Connection: close Set-Cookie: _bit=4f0c76a5-002b9-048b1-331cf10a;domain=.bit.ly; expires=Sun Jul 8 17:34:29 2012;path=/; HttpOnly Cache-control: private; max-age=90 Location: http://bit.ly/bundles/phonedude/e MIME-Version: 1.0 Content-Length: 125 the response code is REQUIRED; phrase is RECOMMENDED

66 302 - Found % telnet doi.acm.org 80 Trying 64.238.147.57... Connected to doi.acm.org. Escape character is '^]'. HEAD http://doi.acm.org/10.1145/1998076.1998100 HTTP/1.1 Host: doi.acm.org Connection: close HTTP/1.1 302 Found Date: Tue, 10 Jan 2012 17:53:36 GMT Server: Apache/2.2.3 (Red Hat) Location: http://dl.acm.org/citation.cfm?doid=1998076.1998100 Connection: close Content-Type: text/html; charset=iso-8859-1

67 303 - See Other % telnet dx.doi.org 80 Trying 38.100.138.149... Connected to dx.doi.org. Escape character is '^]'. HEAD http://dx.doi.org/10.1007/978-3-642-24469-8_16 HTTP/1.1 Host: dx.doi.org Connection: close HTTP/1.1 303 See Other Server: Apache-Coyote/1.1 Location: http://www.springerlink.com/index/10.1007/978-3-642-24469-8_16 Expires: Wed, 11 Jan 2012 12:04:29 GMT Content-Type: text/html;charset=utf-8 Content-Length: 210 Date: Tue, 10 Jan 2012 17:56:41 GMT Connection: close

68 404 - Not Found % telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /lasdkfjalsdkfjldaskfj HTTP/1.1 Host: www.cs.odu.edu Connection: close HTTP/1.1 404 Not Found Date: Tue, 10 Jan 2012 17:39:15 GMT Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q Connection: close Content-Type: text/html; charset=iso-8859-1 Connection closed by foreign host.

69 401 - Unauthorized % telnet www4.cs.odu.edu 80 Trying 128.82.5.93... Connected to www4.cs.odu.edu. Escape character is '^]'. HEAD http://www4.cs.odu.edu/Conference/index.aspx HTTP/1.1 Host: www4.cs.odu.edu Connection: close HTTP/1.1 401 Unauthorized Content-Length: 1656 Content-Type: text/html Server: Microsoft-IIS/6.0 WWW-Authenticate: Basic realm="www4.cs.odu.edu" MicrosoftOfficeWebServer: 5.0_Pub X-Powered-By: ASP.NET Date: Tue, 10 Jan 2012 17:43:57 GMT Connection: close

70 400 - Bad Request % telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD http://www.cs.odu.edu/~mln/ HTTP/1.1 Connection: close HTTP/1.1 400 Bad Request Date: Tue, 10 Jan 2012 18:24:17 GMT Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q Connection: close Content-Type: text/html; charset=iso-8859-1

71 503 - Service Unavailable % curl -i http://www.techsideline.com/ HTTP/1.0 503 Service Unavailable Content-Type: text/html Content-Length: 53 Expires: now Pragma: no-cache Cache-control: no-cache,no-store The service is not available. Please try again later.

72 Character Encoding HTML documents could be transmitted with 1 byte per char, but only works for ASCII. What about everyone else? UTF-8 can represent every character in Unicode UTF-8 is the dominant character encoding standard for WWW UTF-8 is variable-length – ASCII chars (values 0-127) use 1 byte, others use 3, 4, or more bytes depending on http://en.wikipedia.org/wiki/UTF-8

73 1 resource, N representations Resource URI Representation 2 Represents Representation 1 Represents Identifies Content Negotiation slide from Herbert Van de Sompel

74 Content Negotiation Server-side – server picks best representation (agent can pass in “hints” via Accept.* headers) section 12.1 see Apache algorithm at: http://httpd.apache.org/docs/current/content-negotiation.htmlhttp://httpd.apache.org/docs/current/content-negotiation.html Agent-side – server sends a list to the agent and the agent picks from representation section 12.2 Transparent Negotiation – combination of server-side and agent-side performed by caches, proxies, etc. mentioned in passing in RFC 2616; detailed in RFC-2295 – http://www.ietf.org/rfc/rfc2295.txt

75 Turning on Content Negotiation Tradeoff: flexibility vs. performance – “Cool URI’s don’t change” http://www.w3.org/Provider/Style/URI – Turn off content-negotiation for speed e.g., http://webauthv3.stanford.edu/manual/misc/perf-tuning.htmlhttp://webauthv3.stanford.edu/manual/misc/perf-tuning.html In Apache, content negotiation is turned via: – Type-map file (*.var) – Options +Multiviews directive in httpd.conf or.htaccess file http://httpd.apache.org/docs/2.0/content-negotiation.html

76 How it Works If a direct match for the requested URI is found, then the entity is returned If a 404 would be result the current request, AND content negotiation is available for this resource, then content negotiation begins

77 Request Headers & Status Codes Request headers – Accept – Accept-Charset – Accept-Encoding – Accept-Language Response headers – Content-Location – Vary tells caches in which dimensions the response can vary – covered in RFC 2295 TCN, Alternates Status codes – 300 Multiple Choices – 406 Not Acceptable

78 Test Directory % cd ~mln/public_html/teaching/cs595-s06/a3-test % ls fairlane.gif index.html.de index.html.ja.jis type-map.example fairlane.jpeg index.html.en index.html.ko.euc-kr vt-uva.html.gz fairlane.png index.html.es index.html.ru.koi8-r vt-uva.html.Z % more.htaccess Options All +MultiViews note: no “index.html”

79 Accept dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 04:04:22 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: fairlane.txt Vary: negotiate,accept TCN: choice Last-Modified: Mon, 13 Mar 2006 04:00:53 GMT ETag: "2288-c1-4414ee75;4414ee7a" Accept-Ranges: bytes Content-Length: 193 Connection: close Content-Type: text/plain Connection closed by foreign host. http://www.cs.odu.edu/~mln/teaching/cs595-s06/a3-test/fairlane.txt note: structured ETag

80 Accept dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Accept: image/*; q=1.0 Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 04:06:45 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: fairlane.jpeg Vary: negotiate,accept TCN: choice Last-Modified: Sun, 12 Mar 2006 17:37:16 GMT ETag: "3b64bd-9639-44145c4c;4414ee7a" Accept-Ranges: bytes Content-Length: 38457 Connection: close Content-Type: image/jpeg Connection closed by foreign host.

81 Accept dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Accept: image/png; q=1.0, image/gif; q=0.5, image/jpeg; q=0.1 Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 02:30:44 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: fairlane.png Vary: negotiate,accept TCN: choice Last-Modified: Sun, 12 Mar 2006 17:37:31 GMT ETag: "3b64bf-17f9b-44145c5b;4414d473" Accept-Ranges: bytes Content-Length: 98203 Connection: close Content-Type: image/png Connection closed by foreign host.

82 Accept dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Accept: image/tiff; q=1.0, image/gif; q=0.001 Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 02:37:10 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: fairlane.gif Vary: negotiate,accept TCN: choice Last-Modified: Sun, 12 Mar 2006 20:46:10 GMT ETag: "3b64c0-c28b-44148892;4414d473" Accept-Ranges: bytes Content-Length: 49803 Connection: close Content-Type: image/gif Connection closed by foreign host.

83 Accept-Encoding dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/vt-uva HTTP/1.1 Host: www.cs.odu.edu Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 03:26:35 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: vt-uva.html.gz Vary: negotiate,accept-encoding TCN: choice Last-Modified: Sun, 12 Mar 2006 20:52:54 GMT ETag: "3b64c1-2c54-44148a26;4414d473" Accept-Ranges: bytes Content-Length: 11348 Connection: close Content-Type: text/html Content-Encoding: x-gzip Connection closed by foreign host.

84 Accept-Encoding dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/vt-uva HTTP/1.1 Accept-Encoding: compress; q=0.1, gzip; q=0.0 Host: www.cs.odu.edu Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 03:29:54 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: vt-uva.html.Z Vary: negotiate,accept-encoding TCN: choice Last-Modified: Sun, 12 Mar 2006 20:52:41 GMT ETag: "3b64c3-5c0b-44148a19;4414d473" Accept-Ranges: bytes Content-Length: 23563 Connection: close Content-Type: text/html Content-Encoding: compress Connection closed by foreign host.

85 Accept-Language dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/ HTTP/1.1 Accept-Language: de; q=1.0 Host: www.cs.odu.edu Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 03:32:36 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: index.html.de Vary: negotiate,accept-language,accept-charset TCN: choice Last-Modified: Sun, 12 Mar 2006 16:39:23 GMT ETag: "3b64b7-1c94-44144ebb;4414d473" Accept-Ranges: bytes Content-Length: 7316 Connection: close Content-Type: text/html Content-Language: de Connection closed by foreign host.

86 Accept-Charset dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/index.html HTTP/1.1 Accept-Language: ja; q=1.0 Accept-Charset: iso-2022-jp; q=1.0 Host: www.cs.odu.edu Connection: close HTTP/1.1 200 OK Date: Mon, 13 Mar 2006 03:35:17 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Content-Location: index.html.ja.jis Vary: negotiate,accept-language,accept-charset TCN: choice Last-Modified: Sun, 12 Mar 2006 16:39:23 GMT ETag: "3b64ba-1dd3-44144ebb;4414d473" Accept-Ranges: bytes Content-Length: 7635 Connection: close Content-Type: text/html; charset=iso-2022-jp Content-Language: ja Connection closed by foreign host.

87 Accept-Charset / Status 406 dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/index.html HTTP/1.1 Accept-Language: ja; q=1.0 Accept-Charset: euc-jp; q=1.0 Host: www.cs.odu.edu Connection: close HTTP/1.1 406 Not Acceptable Date: Mon, 13 Mar 2006 03:39:29 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Alternates: {"index.html.de" 1 {type text/html} {language de} {length 7316}}, {"index.html.en" 1 {type text/html} {language en} {length 7233}}, {"index.html.es" 1 {type text/html} {language es} {length 7643}}, {"index.html.ja.jis" 1 {type text/html} {charset iso-2022-jp} {language ja} {length 7635}}, {"index.html.ru.koi8-r" 1 {type text/html} {charset koi8-r} {language ru} {length 7277}} Vary: negotiate,accept-language,accept-charset TCN: list Connection: close Content-Type: text/html; charset=iso-8859-1 Connection closed by foreign host.

88 406 Not Acceptable dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80 Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. HEAD /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Accept: image/tiff; q=1.0, */*;q=0.0 Connection: close HTTP/1.1 406 Not Acceptable Date: Mon, 13 Mar 2006 04:03:01 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Alternates: {"fairlane.gif" 1 {type image/gif} {length 49803}}, {"fairlane.jpeg" 1 {type image/jpeg} {length 38457}}, {"fairlane.png" 1 {type image/png} {length 98203}}, {"fairlane.txt" 1 {type text/plain} {length 193}} Vary: negotiate,accept TCN: list Connection: close Content-Type: text/html; charset=iso-8859-1 Connection closed by foreign host.

89 406 Not Acceptable dhcp65-74-196-93:~/Desktop/cs595-s06 mln$ telnet www.cs.odu.edu 80Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~mln/teaching/cs595-s06/a3-test/fairlane HTTP/1.1 Host: www.cs.odu.edu Accept: image/tiff; q=1.0, *;q=0.0 Connection: close HTTP/1.1 406 Not Acceptable Date: Tue, 14 Mar 2006 03:30:23 GMT Server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 Alternates: {"fairlane.gif" 1 {type image/gif} {length 49803}}, {"fairlane.jpeg" 1 {type image/jpeg} {length 38457}}, {"fairlane.png" 1 {type image/png} {length 98203}}, {"fairlane.txt" 1 {type text/plain} {length 193}} Vary: negotiate,accept TCN: list Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 27d 406 Not Acceptable Not Acceptable An appropriate representation of the requested resource /~mln/teaching/cs595-s06/a3-test/fairlane could not be found on this server. Available variants: fairlane.gif, type image/gif fairlane.jpeg, type image/jpeg fairlane.png, type image/png fairlane.txt, type text/plain Apache/1.3.26 Server at www.cs.odu.edu Port 80 0

90 Solipsism & Content Negotiation (i.e., just because you don't see it doesn't mean it isn't happening) $ curl -I en.wikipedia.org HTTP/1.0 301 Moved Permanently Date: Wed, 28 Aug 2013 15:57:59 GMT Server: Apache X-Content-Type-Options: nosniff Cache-Control: s-maxage=1200, must-revalidate, max-age=0 Vary: Accept-Encoding,X-Forwarded-Proto,Cookie Last-Modified: Wed, 28 Aug 2013 15:57:59 GMT Location: http://en.wikipedia.org/wiki/Main_Page Content-Length: 0 Content-Type: text/html; charset=utf-8 X-Cache: HIT from cp1017.eqiad.wmnet X-Cache-Lookup: HIT from cp1017.eqiad.wmnet:3128 Age: 76 X-Cache: HIT from cp1019.eqiad.wmnet X-Cache-Lookup: HIT from cp1019.eqiad.wmnet:80 Connection: close


Download ppt "Introduction to Web Science Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-"

Similar presentations


Ads by Google