1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr.

1 News and media websites harvesting

2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr Regional daily newspapers (http://www.charentelibre.fr)http://www.charentelibre.fr News agencies (http://fr.reuters.com)http://fr.reuters.com Web sites buzz (http://www.buzzactus.com)http://www.buzzactus.com News portal (http://actu.orange.fr)http://actu.orange.fr

3 A specific profile “News”, based on “Page + 1 click” The crawl is stopped after 23 hours true 82800 Terminate job The scope of the crawl Max-hops = 1 (for the others, we use 20) Max-trans-hop = 2 (for the others crawls, we use 3) Delay between each query server max-retries = 10 (for the others, we use 30) and retry-delay-seconds = 60 (we use 900)

4 A few key statistics… For the first 3 quarters : –81 672 059 URL collected –511,86 Go (compressed) In one year, it will represent about : –109 000 000 URL collected = 18 % of our annual budget –700 Go (compressed) = 2,7 % of our annual budget

5 Crawl quality The crawl finish in about 8 hours The quality of the archives is quite good But the archives have their limits: –Some news articles are presented on 2 pages on the active web site (http://fr.reuters.com) –The architecture of the website (http://www.lemonde.fr) –The time to load pages’ loading in the Wayback machine –Compressed code ( http://www.francesoir.fr/ )

6 Regional daily newspapers Example: Ouest-France It’s the biggest title: 47 editions In the past, we tested the deposit of PDF files without success In line, the PDF’s newspaper isn’t free. –A password is required to access the publication after subscription We added the password into the Heritrix profile but: –The login/password is available for 3 months only –Often, the crawler gets disconnected A big part of the site is programmed in JavaScript Heritrix extracts a lot of false URLs from JavaScript Any false URL causes a disconnect and leads to the login page But Heritrix enters the password only once a job (the page is then marked as “already seen” and is not collected again) –We have crawled the articles but not the integral PDF versions

8 Today… Do you crawl paid newspapers? –Do you use some password to crawl some publications? –Or do you use only the IP addresses? –How do you save the passwords in NAS? What about their access? –Is it necessary to save the passwords in WB? –How do you communicate the passwords to the researchers?

1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr.

Similar presentations

Presentation on theme: "1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr.

Similar presentations

Presentation on theme: "1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (http://www.lemonde.fr)http://www.lemonde.fr."— Presentation transcript:

Similar presentations

About project

Feedback