Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

Similar presentations


Presentation on theme: "Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou"— Presentation transcript:

1 Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

2 Outline Course & Tutors Information Introduction to Web Crawling  Utilities of a crawler  Features of a crawler  Architecture of a crawler Introduction to Regular Expression Appendix

3 Course and Tutors Information Course homepage:  http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 Tutors:  Xin Xin Email: xxin@cse.cuhk.edu.hkxxin@cse.cuhk.edu.hk Venue: Room 101  Tom (me) Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk Venue: Room 114A

4 Utilities of a crawler Web crawler, spider. Definition:  A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia) Utilities:  Gather pages from the Web.  Support a search engine, perform data mining and so on. Object:  Text, video, image and so on.  Link structure.

5 Features of a crawler Must provide:  Robustness: spider traps Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...http://foo.com/bar/foo/bar/foo/ Pages filled a large number of characters.  Politeness: which pages can be crawled, and which cannot robots exclusion protocol: robots.txt http://blog.sohu.com/robots.txt  User-agent: *  Disallow: /manage/

6 Features of a crawler (Cont’d) Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible

7 Architecture of a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set

8 www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted. Architecture of a crawler (Cont’d)

9 www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/Main_Page Disclaimers Dup URL Elim: the URL is checked for duplicate elimination. Architecture of a crawler (Cont’d)

10 Other issues:  Housekeeping tasks: Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds) Checkpointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier: Change rate. Quality.  Politeness: Avoid repeated fetch requests to a host within a short time span. Otherwise: blocked 

11 Regular Expression Usage:  Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters. Today’s target:  Introduce the basic principle. A tool to verify the regular expression: Regex Tester  http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13 bce26d-7755-441e-92b3-1eb5f9e859f9.aspx

12 Regular Expression Metacharacter  Similar to the wildcard in Windows, e.g.: *.doc Target: Detect the email address

13 Regular Expression \b: stands for the beginning or end of a Word.  E.g.: \bhi\b find hi accurately \w: matches letters, or numbers, or underscore..: matches everything except the newline *: content before * can be repeated any number of times  \bhi\b.*\bLucy\b +: content before + can be repeated one or more times []: match characters in it  E.g: \b[aeiou]+[a-zA-Z]*\b {n}: repeat n times {n,}: repeat n or more times {n,m}: repeat n to m times

14 Regular Expression Target: Detect the email address Specifications:  A@B  A: combinations English characters a to z, or digits, or. or _ or % or + or –  B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters) Answer:  \b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b

15 Appendix Mercator Crawler:  http://mias.uiuc.edu/files/tutorials/mercator.pdf Regular Expression tutorial:  http://www.regular-expressions.info/tutorial.html http://www.regular-expressions.info/tutorial.html

16 Questions?


Download ppt "Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou"

Similar presentations


Ads by Google