Introduction to Computing Using Python CSC 242-501 Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:

Slides:



Advertisements
Similar presentations
Basic Internet Terms Digital Design. Arpanet The first Internet prototype created in 1965 by the Department of Defense.
Advertisements

Web Development & Design Foundations with XHTML
World Wide Web Hyperlinks Servers/Clients Browsers HTML (HyperText Markup Language)
Building a Web Crawler in Python Frank McCown Harding University Spring 2013 This work is licensed under a Creative Commons Attribution-NonCommercial-
4.01 How Web Pages Work.
Project 1 Introduction to HTML.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
Web development  World Wide Web (web) is the Internet system for hypertext linking.  A hypertext document (web page) is an online document. It contains.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Introduction to HTML 2006 INT197B. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Internet – Part II. What is the World Wide Web? The World Wide Web is a collection of host machines, which deliver documents, graphics and multi-media.
IST 221 Internet Concepts and Applications Internet, WWW and HTML 1.
Introduction to HTML 2004 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
1st Project Introduction to HTML.
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
The Internet & The World Wide Web Notes
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
UNDERSTANDING WEB AND WEB PROJECT PLANNING AND DESIGNING AND EFFECTIVE WEBSITE Garni Dadaian.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. WEB.
UNESCO ICTLIP Module 1. Lesson 61 Introduction to Information and Communication Technologies Lesson 6. What is the Internet?
CIS 250 Advanced Computer Applications Internet/WWW Review.
Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.
Introduction to HTML. Today’s Discussion What is HTML ? What is HTML ? What is Web Page ? What is Web Page ? Web Server Web Server Web Browser Web Browser.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
Chapter 4 Applets Cop Why Applets? WWW makes huge information available to anyone with web browser. Web server send web pages and images to your.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Introduction to Computing Using Python WWW and Search  World Wide Web  Python WWW API  String Pattern Matching  Web Crawling.
HTML Hyper Text Markup Language 1BFCET BATHINDA. Definitions Web server: a system on the internet containing one or more web site Web site: a collection.
IS-907 Java EE World Wide Web - Overview. World Wide Web - History Tim Berners-Lee, CERN, 1990 Enable researchers to share information: Remote Access.
Web Design. What is the Internet? A worldwide collection of computer networks that links millions of computers by – Businesses (.com.net) – the government.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
01 - Introduction Informatics Department Parahyangan Catholic University.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Introduction to the World Wide Web & Internet CIS 101.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
Python: Programming the Google Search (Crawling) Damian Gordon.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 2 Creating Links.
World Wide Web. The World Wide Web is a system of interlinked hypertext documents accessed via the Internet The World Wide Web is a system of interlinked.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Chapter 1 Introduction to HTML.
Project 1 Introduction to HTML.
CASE STUDY -HTML,URLs,HTTP
HTTP Request Method URL Protocol Version GET /index.html HTTP/1.1
Web Page Concept and Design :
Chapter 16 The World Wide Web.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Presentation transcript:

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday: String Pattern Matching

Introduction to Computing Using Python The World Wide Web (WWW) The Internet is a global network that connects computers around the world It allows two programs running on two computers to communicate The programs communicate using a client/server protocol: one program (the client) requests a resource from another (the server). For example, a browser requests a web page from a web server. The computer hosting the server program is often referred to as a server too The World Wide Web (WWW or, simply, the web) is a distributed system of resources linked through hyperlinks and hosted on servers across the Internet Resources can be web pages, documents, multimedia, etc. Web pages are a critical resource as they contain hyperlinks to other resources Servers hosting WWW resources are called web servers A program that requests a resource from a web server is called a web client

The technology required to do the above includes: Uniform Resource Locator (URL) HyperText Transfer Protocol (HTTP) HyperText Markup Language (HTML) Introduction to Computing Using Python WWW plumbing When you click a hyperlink in a browser: 1.The browser (the web client in this case) parses the hyperlink to obtain a)the name and location of the web server hosting the associated resource b)the pathname of the resource on the server 2.The browser connects to the web server and sends it a message requesting the resource 3.The web server replies back with a message that includes the requested resource (if the server has it) The technology required to do the above includes: A naming and locator scheme that uniquely identifies, and locates, resources A communication protocol that specifies precisely the order in which messages are sent as well as the message format A document formatting language for developing web pages that supports hyperlink definitions

Introduction to Computing Using Python WWW plumbing The technology required to do the above includes: A naming and locator scheme that uniquely identifies, and locates, resources Called Uniform Resource Locator (URL) A communication protocol that specifies precisely the order in which messages are sent as well as the message format Called HTTP (Hypertext Transfer Protocol) A document formatting language for developing web pages that supports hyperlink definitions Caled HyperText Markup Languge (HTML)

Introduction to Computing Using Python Naming scheme: URL schemehostpathname The Uniform Resource Locator (URL) is a naming and locator scheme that uniquely identifies, and locates, resources on the web Other examples: ftp://ftp.server.net/

Introduction to Computing Using Python Communication protocol: HTTP The HyperText Transfer Protocol (HTTP) specifies the format of the web client’s request message and the web server’s reply message Suppose the web client wants HTTP/ OK Date: Sat, 21 Apr :11:26 GMT Server: Apache/2 Last-Modified: Mon, 02 Jan :56:24 GMT... Content-Type: text/html; charset=utf-8... HTTP/ OK Date: Sat, 21 Apr :11:26 GMT Server: Apache/2 Last-Modified: Mon, 02 Jan :56:24 GMT... Content-Type: text/html; charset=utf-8... GET /Consortium/mission.html HTTP/1.1 Host: User-Agent: Mozilla/ Accept: text/html,application/xhtml+xml... Accept-Language: en-us,en;q= GET /Consortium/mission.html HTTP/1.1 Host: User-Agent: Mozilla/ Accept: text/html,application/xhtml+xml... Accept-Language: en-us,en;q= request headers request line reply headers reply line requested resource

W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. Introduction to Computing Using Python HyperText Markup Language: HTML An HTML file is a text file written in HTML and is referred to as the source file The browser interprets the HTML source file and displays the web page w3c.html

In the source file, an HTML element is described using: A pair of tags, the start tag and the end tag Optional attributes within the start tag Other elements or data between the start and end tag Introduction to Computing Using Python HyperText Markup Language: HTML An HTML source file is composed of HTML elements Each element defines a component of the associated web page

Introduction to Computing Using Python HyperText Markup Language: HTML W3C Mission Summary W3C Mission Summary W3C Mission Summary W3C Mission W3C Mission Summary W3C Mission W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document. W3C Mission Summary W3C Mission The W3C mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web. Principles Web for All Web on Everything See the complete W3C Mission document.

Introduction to Computing Using Python W3C Mission document. HTML anchor element ( a ) defines a hyperlink attributeattribute value The anchor start tag must have attribute href whose value is a URL Hyperlinks... <a href="/Consortium/facts.html"... <a href=" <a href="/Consortium/facts.html"... <a href=" The URL can be relative or absolute <a href="/facts.html"... <a href=" <a href="/facts.html"... <a href=" hyperlink text relative URL that corresponds to absolute URL

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 11

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 12

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 13

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 14

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 15

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 16

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 17

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 18

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 19

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 20

Murach’s Java Servlets/JSP (2 nd Ed.), C4 © 2008, Mike Murach & Associates, Inc. Slide 21

Hypertext Transfer Protocol (HTTP) A protocol is a set of rules that a client and server use to communicate with each other Hypertext Transfer Protocol is used by Web clients and Web servers

At the highest level, HTTP is very simple: The client (typically, a Web browser) sends the server An HTTP request The server sends back an HTTP response HTTP request specifies the resource the client wants A web page, a multimedia file, etc. If the resource is available, the HTTP response contains the contents of that resource Although we discussed some of the details of HTTP earlier, it is not necessary to know them for CSC 242 Hypertext Transfer Protocol (HTTP)

The urllib.request module contains a method called urlopen Example: urlopen(‘ The urlopen function generates an HTTP request that is sent to the specified Web server In the above example, the server’s name is condor.depaul.edu urlopen returns a Python HttpResponse object The HttpResponse class has a read() method, which returns the body of the response HTTP and Python

The read() method returns an object whose type is binary (not str) This is because not all resources on the Web are textual – for example, multimedia files If a resource is text (for example a Web page), then the decode() method must be called to convert from binary to str HTTP and Python

Introduction to Computing Using Python Python programs as web clients >>> from urllib.request import urlopen >>> response = urlopen(' >>> type(response) ('Content-Type', 'text/html; charset=utf-8') >>> html = response.read() >>> type(html) >>> html = html.decode() >>> type(html) >>> html '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN”... \n' >>> from urllib.request import urlopen >>> response = urlopen(' >>> type(response) ('Content-Type', 'text/html; charset=utf-8') >>> html = response.read() >>> type(html) >>> html = html.decode() >>> type(html) >>> html '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN”... \n' HTTPResponse is a class defined in Standard Library module http.client ; it encapsulates an HTTP response. HTTPResponse supports methods: geturl() getheaders() read(), etc Because a web resource is not necessarily a text file, method read() returns an object of type bytes ; bytes method decode() interprets the bytes as a string encoded in UTF-8 and returns that string

Introduction to Computing Using Python Exercise from urllib.request import urlopen def news(url, topics): '''counts in resource with URL url the frequency of each topic in list topics''' response = urlopen(url) html = response.read() content = html.decode().lower() for topic in topics: n = content.count(topic) print('{} appears {} times.'.format(topic,n)) from urllib.request import urlopen def news(url, topics): '''counts in resource with URL url the frequency of each topic in list topics''' response = urlopen(url) html = response.read() content = html.decode().lower() for topic in topics: n = content.count(topic) print('{} appears {} times.'.format(topic,n)) Write method news() that takes a URL of a news web site and a list of news topics (i.e., strings) and computes the number of occurrences of each topic in the news. >>> news(' economy appears 3 times climate appears 3 times education appears 1 times >>> news(' economy appears 3 times climate appears 3 times education appears 1 times

Introduction to Computing Using Python Parsing a web page When an HTMLParser object is fed a string containing HTML, it processes it A parser is a program that analyzes the structure of a piece of text. The Python Standard Library module html.parser provides a class, HTMLParser, for parsing HTML files. The string is broen into tokens that correspond to HTML start tags, end tags, text data, etc. The tokens are then processed in the order in which they appear in the string For each token, an appropriate handler is invoked by Python The handlers are methods of class HTMLParser >>> infile = open('w3c.html') >>> content = infile.read().decode() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) >>> infile = open('w3c.html') >>> content = infile.read().decode() >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content)

Introduction to Computing Using Python HTMLParser methods >>> from urllib.request import urlopen >>> from html.parser import HTMLParser >>> response = urlopen(' >>> content = response.read().decode() >>> parser = HTMLParser() >>> parser.feed(content) >>> from urllib.request import urlopen >>> from html.parser import HTMLParser >>> response = urlopen(' >>> content = response.read().decode() >>> parser = HTMLParser() >>> parser.feed(content) TokenHTMLParser methodExplanation handle_starttag(tag, attrs) Start tag handler handle_endtag(tag) End tag handler datahandle_data(data) Arbitrary text data handler Everything on a Web page is a start tag, an end tag, or data. Using HTMLParser, it will be handled by one of the 3 methods above.

Introduction to Computing Using Python Parsing a web page >>> response = urlopen(‘ >>> content = response.read().decode() >>> parser = HTMLParser() >>> parser.feed(content) >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) >>> response = urlopen(‘ >>> content = response.read().decode() >>> parser = HTMLParser() >>> parser.feed(content) >>> infile.close() >>> from html.parser import HTMLParser >>> parser = HTMLParser() >>> parser.feed(content) TokenHandlerExplanation handle_starttag(tag, attrs) Start tag handler handle_endtag(tag) End tag handler datahandle_data(data) Arbitrary text data handler W3C Mission Summary W3C Mission... W3C Mission Summary W3C Mission... handle_starttag('http') handle_data('\n') handle_starttag('head') handle_data('\n') handle_starttag('title') handle_data('W3CMission Summary') handle_endtag('title') handle_data('\n') handle_endtag('head') For each token, an appropriate handler is invoked The handlers are methods of class HTMLParser By default, the handlers do nothing

Introduction to Computing Using Python Parsing a web page ch11.html HTMLParser is really meant to be used as a “generic” superclass from which application specific parser subclasses can be developed Here is an example subclass: class HTMLDisplay(HTMLParser): def handle_starttag(self, tag, attrs): print(‘Start tag: {} attributes: {}’.format(tag,attrs)) def handle_endtag(self, tag): print(‘End tag: {}’.format(tag)) def handle_data(self, data): print(‘Data: {}’.format(data))

Start tag: html attributes: [] Start tag: title attributes: [] Data: CSC simple Web page End tag: title Start tag: body attributes: [] Start tag: center attributes: [] Start tag: h3 attributes: [] Data: CSC 242 Section 501 Start tag: br attributes: [] Data: A simple Web page End tag: h3 Start tag: p attributes: [] Data: Hello world! Start tag: p attributes: [] Start tag: a attributes: [('href', ' Data: CSC home page End tag: a End tag: p End tag: body End tag: html CSC simple Web page CSC 242 Section 501 A simple Web page Hello world! Web pagePrintParser output

Introduction to Computing Using Python A parser that prints hyperlink URLs We need a parser that finds the URL attribute of every anchor start tag. As with the previous example, we create a subclass of HTMLParser, and override the handle_starttag method. from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: # each attr is a tuple if attr[0] == 'href': print(attr[1]) from html.parser import HTMLParser class LinkParser(HTMLParser): def handle_starttag(self, tag, attrs): 'print value of href attribute if any if tag == 'a': # search for href attribute and print its value for attr in attrs: # each attr is a tuple if attr[0] == 'href': print(attr[1]) ch11.html

Introduction to Computing Using Python LinkParser’s behavior Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. ch11.html >>> linkparser = LinkParser() >>> linkparser.feed(content) test.html >>> linkparser = LinkParser() >>> linkparser.feed(content) test.html

Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) links.html Parser LinkParser prints the URL value of the href attribute contained in every anchor start tag Suppose we would like to collect http URLS only, in their absolute version >>> url = … >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks()... test.html >>> url = … >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks()... test.html ch11.html Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. Absolute HTTP link Absolute link to Google Relative HTTP link Relative link to test.html. Need to convert to absolute URL

>>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> linkparser = LinkParser() >>> linkparser.feed(content)... /Consortium/sup /Consortium/siteindex Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) Need to transform a relative URL in web page … The Python Standard Library module urllib.parse defines method urljoin() for this purpose … to an absolute URL >>> from urllib.parse import urljoin >>> url = ' >>> relative = '/Consortium/siteindex' >>> urljoin(url, relative) ' >>> from urllib.parse import urljoin >>> url = ' >>> relative = '/Consortium/siteindex' >>> urljoin(url, relative) '

Introduction to Computing Using Python Collecting hyperlinks (as absolute URLs) from urllib.parse import urljoin from html.parser import HTMLParser class Collector(HTMLParser): 'collects hyperlink URLs into a list' def __init__(self, url): 'initializes parser, the url, and a list' HTMLParser.__init__(self) self.url = url self.links = [] def handle_starttag(self, tag, attrs): 'collects hyperlink URLs in their absolute format' if tag == 'a': for attr in attrs: if attr[0] == 'href': # construct absolute URL absolute = urljoin(self.url, attr[1]) if absolute[:4] == 'http': # collect HTTP URLs self.links.append(absolute) def getLinks(self): 'returns hyperlinks URLs in their absolute format' return self.links from urllib.parse import urljoin from html.parser import HTMLParser class Collector(HTMLParser): 'collects hyperlink URLs into a list' def __init__(self, url): 'initializes parser, the url, and a list' HTMLParser.__init__(self) self.url = url self.links = [] def handle_starttag(self, tag, attrs): 'collects hyperlink URLs in their absolute format' if tag == 'a': for attr in attrs: if attr[0] == 'href': # construct absolute URL absolute = urljoin(self.url, attr[1]) if absolute[:4] == 'http': # collect HTTP URLs self.links.append(absolute) def getLinks(self): 'returns hyperlinks URLs in their absolute format' return self.links >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() [..., ' ' ' ' >>> url = ' >>> resource = urlopen(url) >>> content = resource.read().decode() >>> collector = Collector() >>> collector.feed(content) >>> collector.getLinks() [..., ' ' ' '

Exercise: Change Collector so that it collects img URLs Collecting image links (as absolute URLs)