Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.

Similar presentations


Presentation on theme: "1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains."— Presentation transcript:

1 1 Video and flash harvesting

2 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011, par example http://www.dailymotion.com/20Minutes, http://www.dailymotion.com/user/20Minutes/1, http://www.dailymotion.com/user/20Minutes/2

3 3 DM – Technical Solutions 1 st crawl, August 2007 –so46c979db49349.addVariable("url", "http%3A%2F%2Fwww.dailymotion.com%2Fget%2F 14%2F320x240%2Fflv%2F3208281.flv%3Fkey%3Df1 31548d430fdc0700d90ecc01a53a4512e0656"); –http://www.dailymotion.com/get/14/320x240/flv/32082 81.flv?key=f131548d430fdc0700d90ecc01a53a4512e 0656 i.e. a video file with an access key –Beanshell script in “extract-processors” chaine –Good result 919 seeds, 11860 video files collected

4 4 DM – Technical Solutions dailymotion.bsh import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; import java.net.*; import java.util.Collection; import java.util.logging.Level; import java.util.logging.Logger; import java.util.regex.Matcher; String trigger = "^(?i)http://www.dailymotion.com/.*?video/(http%3A%2F%2F.*)$"; String build = "$1"; process(CrawlURI curi) { int size = curi.getOutLinks().size(); if ( size == 0) { return; } // use array copy because implied URIs will be added to outlinks Link[] links = curi.getOutLinks().toArray(new Link[size]); for (Link outlink : links) { Matcher m = TextUtils.getMatcher(trigger, outlink.getDestination()); if (m.matches()) { String implied = m.replaceFirst(build); TextUtils.recycleMatcher(m); if (implied != null) { try { implied = URLDecoder.decode(implied, "utf8"); curi.createAndAddLink(implied, Link.SPECULATIVE_MISC,Link.SPECULATIVE_HOP); } catch (e) { System.out.println("Dailymotion beanshell processor: ERROR : Probably Bad URI " + e); } if (curi.getOutLinks().remove(outlink)) { System.out.println("Dailymotion beanshell processor: Outward link " + outlink + " has been removed form " + outlink.getSource()); } else { System.out.println("Dailymotion beanshell processor: ERROR: Outward link " + outlink + " has NOT been removed form " + outlink.getSource()); }

5 5 DM – Technical Solutions 2 nd crawl, January 2008 –Beanshell script, –Rather good result 3811 seeds, 62127 video files collected 3 rd crawl, September 2008 –Beanshell script –Result less good 9683 seeds, but only 47382 videos found, 30842 HTTP 403 errors –Problem due to limited validity of access key (less than two hours) 4 th crawl, February 2009 –Crawled in two steps: First step, the videos pages, with a harvest template “Page + 1 click” In a second step, the video files, with a “video” harvest template and a Bash script to generate video file URIs with valid access keys –Rather good result 10949 seeds, 73335 video files collected

6 6 DM – Technical Solutions How the two jobs solution works –Extraction of all video page URIs from first job’s crawl.log –Second job is configured with “pause-at-finish=true” –A Bash script is launched on the crawler machine which Checks the jobstate via JMX interface and wait until job is paused Fetches the video page with curl Extracts video file URI Feeds this URI to the job via JMX (importUri command) –20 crawlers worked in parallel for the 2011 crawl The big disadvantage: In the Wayback Machine, the video files are not accessible anymore via the video pages because of different access keys –But they are available via their URL –No solution found so far

7 7 DM – Technical Solutions 5 th crawl, October 2009 –Two jobs solution –Rather good result 5659 seeds, 145761 video files collected 6 th crawl, November 2010 –Big surprise: Video file URIs directly in source code of video page, so no special solution needed –Good result 8649 seeds, 135599 videos collected 7 th crawl, July 2011 –The two jobs solution again –Result less good 13406 seeds, 182538 video files collected –But a new phenomenon arrived: Only 96968 unique video files A number of missing video files left. We don’t know why. That’s work for our next crawl …

8 8 DM – Indicators CrawlSeeds Video files total Video files 200 Video files 403 Video files 200 unique%SizeSolutionWB 2007-0891911 94511 8608511 79599206.3 GBBeanshellYes 2008-013 81162 22562 1279842 045681.0 TBBeanshellYes 2008-099 68378 22447 38230 84244 56394567.7 GBBeanshellYes 2009-0210 94973 335 060 494821.0 TBTwo jobsNo 2009-105 659146 501145 761740113 493781.5 TBTwo jobsNo 2010-118 649135 603135 5994133 184982.5 TBDirectNo 2011-0713 406182 538 096 968534.4 TBTwo jobsNo

9 9 Examples http://www.dailymotion.com/user/afp/1 http://www.dailymotion.com/user/afp/1 We crawled : http://www.dailymotion.c om/video/xk1mpz_la- transition-a-commence-a- herat-dans-l-ouest-de-l- afghanistan_news http://www.dailymotion.c om/video/xk1mpz_la- transition-a-commence-a- herat-dans-l-ouest-de-l- afghanistan_news The video file’s URL in our archives is : http://www.dailymotion.c om/cdn/H264- 512x384/video/xk1mpz.m p4?auth=1315101965- 79b2f0e2f64eb356828be0 911dbd2058

10 10 We didn’ crawl on the same page… http://www.dailymotion.com/video/xk1g8b_roms-les-associations- denoncent-une-politique-de-stigmatisation_news

11 11 Which harvest template do you use? How do you manage to crawl Dailymotion? Today we need to reduce our seed list, so we test other harvest template: Ex : http://www.dailymotion.com/20Minuteshttp://www.dailymotion.com/20Minutes Users’ pages : dailymotion.com/user/20Minutes/… Videos’ pages : dailymotion.com/video/ 1 st solution : path + scope one plus 2 nd solution : path and page + 1

12 12 To access the videos in the Wayback Today it’s very complicated because the model changes each year The link between the video’s page and the video is broken because of the URL key Have you got a solution?

13 13 Flash videos on you tube We tested IA’s harvest profil but we were unable to crawl videos on You tube The flash is often a problem : –It’s difficult to identify the videos URL –Do you use some special tools?

14 14 The audio files The reader doesn’t work in our archives http://www.deezer.com/fr/ The audio files are streamed. So we cannot archive them. The time to load the files is long http://fr.myspace.com/katyperry/music This causes timeouts during the crawl. So we cannot (simply) archive them. The music files are hosted on an other domain http://www.dogmazic.net So we must include the other domain into the scope.

15 15 Social media websites harvesting

16 16 Facebook and Twitter Especially for the elections, we would like to crawl facebook and twitter

17 17 Facebook For the last crawl, we used generic harvest template: “Path” and “Page + 2 clicks” –E.g.: http://www.facebook.com/LesPrimaires/ There are several problems : –The URL contains # http://www.facebook.com/segoleneroyal#!/segoleneroyal –The function to read commentaries isn’t crawled –It takes a lot of time to load a page

18 18 A special harvest template? In 2010, we used a special harvest template The main idea in our profile is to crawl only URIs from facebook.com which are directly related to a specific Facebook user or group We identify those URIs by the numeric user or group ID or by the user or group name contained in the URI –http://www.facebook.com/francois.bayrou –http://www.facebook.com/pages/JO-Vancouver-2010-France/287369570644 –http://www.facebook.com/group.php?v=wall&ref=search&gid=20473543113 We use a Heritrix (1.14.4) profile which is based on a SURT prefixed scope At first in the decide rule sequence, we 'reject' anything from facebook.com Then, we 'accept' only URIs from facebook.com containing a user ID or a group name that we want to crawl This makes sure that the robot will stay on user or group related pages and would not break out to crawl the entire Facebook site

19 19 A special harvest template? [...] ACCEPT true surts-dump.txt false true REJECT OR ^http://.*\.facebook\.com/.*$ ACCEPT OR ^http://.*\.facebook\.com/.*123456789.*$ [...] ^http://.*\.facebook\.com/.*name.*$ [...]

20 20 A special harvest template? What’s the difference with the Danish harvest template? Do you notice some differences between a group’s facebook and a person member’s?

21 21 Twitter The # sends to the homepage and we aren’t able to pass over this

22 22 In 2010, we succeeded to crawl but we met some problems : With the type mime described in the code source to read more twitts


Download ppt "1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains."

Similar presentations


Ads by Google