1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.

Slides:



Advertisements
Similar presentations
JQuery MessageBoard. Lets use jQuery and AJAX in combination with a database to update and retrieve information without refreshing the page. Here we will.
Advertisements

An Introduction To Heritrix
How to begin. Step 1 Create a free account with weebly by logging in with Facebook, or using an and password you choose.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
S.E.O. What we need to do for every site we build.
1 Archive-It Training University of Maryland July 12, 2007.
Games and Simulations O-O Programming in Java The Walker School
Seo.blekko.com. Who Am I? Daniel Swartz Director of Product Management & Design.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Annick Le Follic Bibliothèque nationale de France Tallinn,
SEO. Self Exploding Organs SEO Search Engine Optimisation By Joey Cannon.
CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
Web forms in PHP Forms Recap  Way of allowing user interaction  Allows users to input data that can then be processed by a program / stored in a back-end.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Open Source Server Side Scripting ECA 236 Open Source Server Side Scripting Cookies & Sessions.
Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.
Cookies Set a cookie – setcookie() Extract data from a cookie - $_COOKIE Augment user authentication script with a cookie.
Dr Lisa Wise 18/10/2002 Website Metrics Dr Lisa Wise.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Lecturer: Ghadah Aldehim
Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Wyatt Pearsall November  HyperText Transfer Protocol.
Open Source Server Side Scripting ECA 236 Open Source Server Side Scripting Includes and Dates.
Session 1: Advanced Content Model Wednesday 06 February 2007 Sitecore for Experts “Sitecore skills for real men”
Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Session tracking There are a number of problems that arise from the fact that HTTP is a "stateless" protocol. In particular, when you are doing on- line.
Extending HTML CPSC 120 Principles of Computer Science April 9, 2012.
Curator wishes for the roadmap november 2011 updates.
UNIT 13 The World Wide Web.
Variables and ConstantstMyn1 Variables and Constants PHP stands for: ”PHP: Hypertext Preprocessor”, and it is a server-side programming language. Special.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than URLs in 2011,
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
CONTENTS Processing structures and commands Control structures – Sequence Sequence – Selection Selection – Iteration Iteration Naming conventions – File.
Integrating and Troubleshooting Citrix Access Gateway.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Slide 1 Project 1 Task 2 T&N3311 PJ1 Information & Communications Technology HD in Telecommunications and Networking Task 2 Briefing The Design of a Computer.
This document gives one example of how one might be able to “fix” a meteorological file, if one finds that there may be problems with the file. There are.
Facebook Ad’s Reloaded Keep it Organized. To use Facebook Ad’s effectively in today’s market you must have Funnel Over the last week we have covered the.
1 Advanced Archive-It Application Training: Crawl Scoping.
KW Advanced Agent Website Training April, What We will Discuss Using hyperlinks to your “contact me/us page” Which color boxes control what areas.
GOSS iCM Forms Gary Ratcliffe. 2 Agenda Webinar Programme Form Groups Publish Multiple Visual Script Editor Scripted Actions Form Examples.
ICM – API Server & Forms Gary Ratcliffe.
Debugging tools in Flash CIS 126. Debugging Flash provides several tools for testing ActionScript in your SWF files. –The Debugger, lets you find errors.
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
READERS’ CHOICE PROGRAM TUTORIAL February Enter to Nominate 2. View 2015 Winners List 3. Info to Promote Business Readers’ Choice Landing Page.
IST 210: PHP Basics IST 210: Organization of Data IST2101.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
REST API Design. Application API API = Application Programming Interface APIs expose functionality of an application or service that exists independently.
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
SEO FOR REDESIGN Eric Werner. DON’T WAIT “ We are going to wait until the redesign is complete to work on SEO” No problem unless any of the following.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Social Media Security: Understanding how to keep yourself safe.
Data Virtualization Tutorial… CORS and CIS
Containers and Lists CIS 40 – Introduction to Programming in Python
9A0-411 Exam PDF | 9A0-411 Questions Answers | Dumps4download.us
Latin American Government Documents Archive, LAGDA
Podcasting “Podcast” is one of those words that we hear tossed around a lot these days – it sounds kind of intimidating -- but what exactly is a podcast?
Presentation transcript:

1 Video and flash harvesting

2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than URLs in 2011, par example

3 DM – Technical Solutions 1 st crawl, August 2007 –so46c979db49349.addVariable("url", "http%3A%2F%2Fwww.dailymotion.com%2Fget%2F 14%2F320x240%2Fflv%2F flv%3Fkey%3Df d430fdc0700d90ecc01a53a4512e0656"); – 81.flv?key=f131548d430fdc0700d90ecc01a53a4512e 0656 i.e. a video file with an access key –Beanshell script in “extract-processors” chaine –Good result 919 seeds, video files collected

4 DM – Technical Solutions dailymotion.bsh import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; import java.net.*; import java.util.Collection; import java.util.logging.Level; import java.util.logging.Logger; import java.util.regex.Matcher; String trigger = "^(?i) String build = "$1"; process(CrawlURI curi) { int size = curi.getOutLinks().size(); if ( size == 0) { return; } // use array copy because implied URIs will be added to outlinks Link[] links = curi.getOutLinks().toArray(new Link[size]); for (Link outlink : links) { Matcher m = TextUtils.getMatcher(trigger, outlink.getDestination()); if (m.matches()) { String implied = m.replaceFirst(build); TextUtils.recycleMatcher(m); if (implied != null) { try { implied = URLDecoder.decode(implied, "utf8"); curi.createAndAddLink(implied, Link.SPECULATIVE_MISC,Link.SPECULATIVE_HOP); } catch (e) { System.out.println("Dailymotion beanshell processor: ERROR : Probably Bad URI " + e); } if (curi.getOutLinks().remove(outlink)) { System.out.println("Dailymotion beanshell processor: Outward link " + outlink + " has been removed form " + outlink.getSource()); } else { System.out.println("Dailymotion beanshell processor: ERROR: Outward link " + outlink + " has NOT been removed form " + outlink.getSource()); }

5 DM – Technical Solutions 2 nd crawl, January 2008 –Beanshell script, –Rather good result 3811 seeds, video files collected 3 rd crawl, September 2008 –Beanshell script –Result less good 9683 seeds, but only videos found, HTTP 403 errors –Problem due to limited validity of access key (less than two hours) 4 th crawl, February 2009 –Crawled in two steps: First step, the videos pages, with a harvest template “Page + 1 click” In a second step, the video files, with a “video” harvest template and a Bash script to generate video file URIs with valid access keys –Rather good result seeds, video files collected

6 DM – Technical Solutions How the two jobs solution works –Extraction of all video page URIs from first job’s crawl.log –Second job is configured with “pause-at-finish=true” –A Bash script is launched on the crawler machine which Checks the jobstate via JMX interface and wait until job is paused Fetches the video page with curl Extracts video file URI Feeds this URI to the job via JMX (importUri command) –20 crawlers worked in parallel for the 2011 crawl The big disadvantage: In the Wayback Machine, the video files are not accessible anymore via the video pages because of different access keys –But they are available via their URL –No solution found so far

7 DM – Technical Solutions 5 th crawl, October 2009 –Two jobs solution –Rather good result 5659 seeds, video files collected 6 th crawl, November 2010 –Big surprise: Video file URIs directly in source code of video page, so no special solution needed –Good result 8649 seeds, videos collected 7 th crawl, July 2011 –The two jobs solution again –Result less good seeds, video files collected –But a new phenomenon arrived: Only unique video files A number of missing video files left. We don’t know why. That’s work for our next crawl …

8 DM – Indicators CrawlSeeds Video files total Video files 200 Video files 403 Video files 200 unique%SizeSolutionWB GBBeanshellYes TBBeanshellYes GBBeanshellYes TBTwo jobsNo TBTwo jobsNo TBDirectNo TBTwo jobsNo

9 Examples We crawled : om/video/xk1mpz_la- transition-a-commence-a- herat-dans-l-ouest-de-l- afghanistan_news om/video/xk1mpz_la- transition-a-commence-a- herat-dans-l-ouest-de-l- afghanistan_news The video file’s URL in our archives is : om/cdn/H x384/video/xk1mpz.m p4?auth= b2f0e2f64eb356828be0 911dbd2058

10 We didn’ crawl on the same page… denoncent-une-politique-de-stigmatisation_news

11 Which harvest template do you use? How do you manage to crawl Dailymotion? Today we need to reduce our seed list, so we test other harvest template: Ex : Users’ pages : dailymotion.com/user/20Minutes/… Videos’ pages : dailymotion.com/video/ 1 st solution : path + scope one plus 2 nd solution : path and page + 1

12 To access the videos in the Wayback Today it’s very complicated because the model changes each year The link between the video’s page and the video is broken because of the URL key Have you got a solution?

13 Flash videos on you tube We tested IA’s harvest profil but we were unable to crawl videos on You tube The flash is often a problem : –It’s difficult to identify the videos URL –Do you use some special tools?

14 The audio files The reader doesn’t work in our archives The audio files are streamed. So we cannot archive them. The time to load the files is long This causes timeouts during the crawl. So we cannot (simply) archive them. The music files are hosted on an other domain So we must include the other domain into the scope.

15 Social media websites harvesting

16 Facebook and Twitter Especially for the elections, we would like to crawl facebook and twitter

17 Facebook For the last crawl, we used generic harvest template: “Path” and “Page + 2 clicks” –E.g.: There are several problems : –The URL contains # –The function to read commentaries isn’t crawled –It takes a lot of time to load a page

18 A special harvest template? In 2010, we used a special harvest template The main idea in our profile is to crawl only URIs from facebook.com which are directly related to a specific Facebook user or group We identify those URIs by the numeric user or group ID or by the user or group name contained in the URI – – – We use a Heritrix (1.14.4) profile which is based on a SURT prefixed scope At first in the decide rule sequence, we 'reject' anything from facebook.com Then, we 'accept' only URIs from facebook.com containing a user ID or a group name that we want to crawl This makes sure that the robot will stay on user or group related pages and would not break out to crawl the entire Facebook site

19 A special harvest template? [...] ACCEPT true surts-dump.txt false true REJECT OR ^ ACCEPT OR ^ [...] ^ [...]

20 A special harvest template? What’s the difference with the Danish harvest template? Do you notice some differences between a group’s facebook and a person member’s?

21 Twitter The # sends to the homepage and we aren’t able to pass over this

22 In 2010, we succeeded to crawl but we met some problems : With the type mime described in the code source to read more twitts