Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000.

Similar presentations


Presentation on theme: "Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000."— Presentation transcript:

1 Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000

2 What is a Web Spider? A program that uses HTTP to automatically –download documents from a web server –analyze documents retrieved from a web server –send data back to a web server

3 Spider Usage Search engines –Lycos analyzes 10,000,000 Web pages a day Comparison shopping –ShopBot Data analysis –bidding behavior at online auctions Automated Web interactions –daily comics delivery –stock trading agent Other (Mirroring, HTML/link validation, …)

4 How Humans Typically Access The Web Web browser –human friendly interface –hides details of HTTP Web browser is just a program written in some language –Whatever it does, you (a programmer) can do too!

5 What Components We Need to Use the Web socket connection + HTTP + page knowledge

6 Setting Up a Socket Connection Programmatically (C, Perl, Java, Lisp, etc.) Unix command prompt: > telnet address port_number –address is web site address –default port_number for most web sites: 80 telnet http://www.netscape.com 80

7 HTTP A well-defined specification for message formats Orthogonal to: –TCP/IP –HTML –XML W3C – World Wide Web Consortium –www.w3.org

8 Page Knowledge Markup language: HTML, XML, free text Data formatting –regular expressions –domain-specific conventions –freeform text How to get the knowledge: –coded by humans –learning

9 telnet www.netscape.com 80 % telnet www.netscape.com 80 Trying 207.200.75.204... Connected to www-ld2.netscape.com. Escape character is '^]'. GET /index.html HTTP/1.0 User-Agent: An Evil Spider Accept: image/gif, */* Accept-Language: en, de Purpose-of-Request: Denial of Service Attack HTTP/1.1 200 OK Server: Netscape-Enterprise/3.6 Date: Thu, 10 Feb 2000 21:22:39 GMT Set-Cookie: UIDC=141.213.12.186:0950217760:031129;domain=.netscape.com;path=/; expires=31-Dec-2010 23:59:59 GMT Content-type: text/html Connection: close <!-- Hide from old browsers if (parseFloat(navigator.appVersion) '); location.href= "http://home.netscape.com/computing/download/upgrade_index.html";} // Stop Hiding From Old Browsers --> Netcenter [[[ about 40k snipped ]]] window.pup=Pup(); Connection closed by foreign host. %

10 Setting Up a Proxy in Netscape

11 Example of GET GET / HTTP/1.0^M Proxy-Connection: Keep-Alive^M User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u)^M Host: www.netscape.com^M Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*^M Accept-Encoding: gzip^M Accept-Language: de, en^M Accept-Charset: iso-8859-1,*,utf-8^M Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1^M

12 Example of GET-Based Form GET /lookup/Lookup.tibco?search=sunw&st_symbol=on HTTP/1.0 Referer: http://www.netscape.com/ Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: lookup.netscape.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Cookie: UIDC=141.213.12.157:0937196127:933529; HITO_VISITS=A151853E2+1368C1*E00B2*1; NSPOP=|myn12

13 Example of POST-Based Form POST /~dreeves/bin/quote-submit.cgi HTTP/1.0 Referer: http://www.eecs.umich.edu/~dreeves/add-quote.html Proxy-Connection: Keep-Alive User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: www.eecs.umich.edu Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: de, en Accept-Charset: iso-8859-1,*,utf-8 Content-type: application/x-www-form-urlencoded Content-length: 234 recipient=daniel&subject=QUOTE+DATABASE+SUBMISSION&name=eecs547+student &email=dreeves%40umich.edu &body=%22Beware+of+bugs+in+the+above+code %3B+I+have+only+proved+it+correct%2C+not+tried+it. %22%0D%0A++++++++++++++++--+Donald+Knuth%0D%0A

14 Basic Perl Web Library (web.pl) getURLAsString –Given a URL, returns contents as string. submitForm –Given a URL and a perl hash of HTML form fields and contents, submits the form and returns response. html2text –Uses lynx to parse html into a reasonable text approximation.

15 Example: Get Today’s Dilbert and Package it for Email

16 Other Issues and Gotchas SSL –Perl SSLeay library Cookies –Perl libraries exist robots.txt file –Sometimes for spider’s benefit –Politeness Don’t get your domain blocked!

17 For More Information... All examples and links at: http://www.eecs.umich.edu/ ~dreeves/hdiw/main.html


Download ppt "Web Spiders Dan Reeves Bill Walsh HDIW EECS 547 16 February 2000."

Similar presentations


Ads by Google