27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE.

Slides:



Advertisements
Similar presentations
Multimedia Web Site Design Chapter Building an Effective Web Site Creating a Web site is easy, but creating one that is useful and attractive takes.
Advertisements

Website Design What is Involved?. Web Design ConsiderationsSlide 2Bsc Web Design Stage 1 Website Design Involves Interface Design Site Design –Organising.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
Microsoft Word 2013 An Overview. Your Environment Quick Access Toolbar Customizable toolbar for one-click shortcuts Tabs Backstage View Tools located.
Chapter 8 Creating Style Sheets.
Web Design Bill Pegram April 25, Goal of Presentation Summarize ideas from part of The Non- Designers Web Book, Third Edition, Robin Williams &
M I S Dr. Ernst-Gerd vom Kolke 1 Web Design - Introduction n Design for printed and electronic information isn’t very different n Special aspects for web.
Separating the wheat from the chaff: Identifying key elements in the NLA.au domain harvest Preservation for Ongoing Accessibility: research group Professor.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
1 Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library.
XHTML1 Tables and Lists. XHTML2 Objectives In this chapter, you will: Create basic tables Structure tables Format tables Create lists.
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Imperial College Web Review Imperial College.... An audience-focused realignment of our web strategy with our College strategy, our market, technology.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
CHAPTER Planning for Focus and Flow 5 Desktop Publishing: Chapter 5 ©2008Thomson/South-Western.
Chapter Objectives Discuss the relationship between page length, content placement, and usability Complete Step 4: Specify the website’s navigation system.
CS150 Final Exam Review 1 CS150. What you can bring with you  The paper (cheat sheet) that you used on the excel exam 2 CS150.
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
Managing Business Data Lecture 8. Summary of Previous Lecture File Systems  Purpose and Limitations Database systems  Definition, advantages over file.
Internet Fundamentals Total Advantage MS Excel 97, Hutchinson, Coulthard, 1998 McGraw Introduction to HTML Chapter 7.
VERSITET Niels Brügger AARHUS UNIVERSITY 4 DECEMBER 2014 UNI Status update: National web spheres within the European Union.
VERSITET Niels Brügger HEAD OF NETLAB & THE CENTRE FOR INTERNET STUDIES AARHUS UNIVERSITY 19 MAY 2014 UNI Concluding remarks.
1 Semanticommunity.info Tutorial Brand Niemann December 7, 2010.
Wikispaces Tutorial Adapted from a slideshow by: Jennifer Carrier Dorman
Micro sites Basic training guide. Welcome to your Micro site. Here you can create your own personal page within the Countrywide website. When you first.
CHAPTER Planning for Focus and Flow 5 Desktop Publishing: Chapter 5 ©2008Thomson/South-Western.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
Patterns, effective design patterns Describing patterns Types of patterns – Architecture, data, component, interface design, and webapp patterns – Creational,
CPG 4331 Class Agenda Word  Getting Started  Editing Documents  Changing Views in Documents  Format Text / Format Documents  Work With Tables  Work.
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Visual Rhetoric: The Basics Romberger.
Deposit Module for Depositor DigiTool Version 3.0.
CIS234A Lecture 8 Instructor Greg D’Andrea. Review Text Table contains only text, evenly spaced on the Web page in rows and columns uses only standard.
Introduction to Visual Rhetoric. Visual Rhetoric Definition Is the “how to” of visual literacy Visual rhetoric applies the rhetorical situation to decision.
4 HTML Basics Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and.
Creating Google Sites Laura Assem, Director of Technology.
Lesson 6 Formatting Cells and Ranges. Objectives:  Insert and delete cells  Manually format cell contents  Copy cell formatting with the Format Painter.
KATRINE GASSER Meeting: Data Management projects 15/
Introduction to HTML. _______________________________________________________________________________________________________________ 2 Outline Key issues.
Chapter 10 Creating a Template for an Online Form Microsoft Word 2013.
3/30/15.  Who is Tim Berners-Lee? 1. Assessing needs 2. Determining content structure 3. Determining site structure 4. Determining navigation structure.
1 NetarchiveSuite Workshop Paris November , 2011.
Computer Literacy for IC 3 Unit 2: Using Productivity Software Chapter 10: Enhancing a Presentation © 2010 Pearson Education, Inc. | Publishing as Prentice.
2015 NetarchiveSuite Workshop Eesti Rahvusraamatukogu Tallinn, Estonia January
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
Revision Webpage design HTML.   FACE  Attributes  Marquee  Define the following terms.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
© 2004 The McGraw-Hill Companies, Inc. All rights reserved. The Advantage Series Microsoft Office Word 2003 CHAPTER 4 Printing and Web Publishing.
Copenhagen 11 March 2015 Dias 1 Theme 2a: Media Tools — NetLab, a Research Infrastructure for Internet Studies Niels Brügger, Aarhus University Advisory.
Workshop on Web Archiving
OARE Module 5A: Scopus (Elsevier)
Microsoft Excel.
Institution update KB DK
>> Introduction to CSS
Workshop on Web Archiving
Madam Hazwani binti Rahmat
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Documentation as part of curation in web archiving.
Fixed Positioning.
Microsoft Excel All editions of Microsoft office.

Web archive data and researchers’ needs: how might we meet them?
Lesson 4 – Introduction to CSS
The ultimate in data organization
Multimedia Web Site Design
Microsoft Publisher 2016.
Web archives as a research subject
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 OVERVIEW OF PRESENTATION 1.The project › Why study the development of a nation’s web domain? › How to study the development of a nation’s web domain? — an outline of an analytical design 2.Methodological challenges 3.Solutions 4.Results › Registry of.dk domains › Corpus creation 5.Next steps 2 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 THE PROJECT What has the entire Danish web looked like in the past, and how has it developed? What are the methodological challenges in conducting such a study? What kind of research infrastructure do we need to conduct such a study? 3 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 WHY STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? › It is an important part of a nation’s cultural heritage › It is a back cloth for all other types of web entities and activities › It can identify some of the patterns of the developments of the web and relate them to the web of today Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW TO STUDY THE DEVELOPMENT OF A NATION’S WEB DOMAIN? An outline of an analytical design — A gross list of possible ’probes’: › Size › Space › Structure › Aliveness › Content 5 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SIZE — BYTES › How small/big is a nation’s web domain? › The size of different file types and of file types in general › How big/small are websites? 6 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SPACE – GEOLOCATION › Where are websites located? › Search the text for geographic references, e.g. postcodes in footers 7 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? NETWORKS Website internal/external hyperlinks › Are websites closed or open towards the web? › How flat/deep are websites? Web domain internal/external hyperlinks › Centrality based on in-links › How well-linked is the national web domain to the rest of the web? › Which other domain names are the most linked-to? 8 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? ALIVENESS – UPDATING › Domain names: number of new/inactive/disappeared domain names › Updating: number of web objects having been changed since last archiving 9 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 1 Closedness › How many websites are password protected? File and software types › Which file types are the most prevalent? › Which software types are the most widespread? Language › Does the national language prevail? — Or foreign languages? 10 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 2 Textual elements on webpage › Background color › Most used fonts › Length of webpages › Placing of menu items (left align, vertical, or top align, horizontal) Semantics › Word frequencies › Where specific issues or topics are to be found, and how they spread 11 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES The web of the past is gone Possible solution: using (national) web archives › DK: Legal Deposit law effective July 2005 › DK: web material within the ccTLD.dk and websites on other domains aimed at a Danish audience › DK: 2015: approx 1 million active domain names within the ccTLD.dk — 583 Terabytes No 1:1 relation between archive and the Danish web domain 12 Ditte

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES No 1:1 relation between Danish national archive and the Danish national web domain › Not everything has been archived › Unsystematic, no register, no original to compare with › Archiving takes time, e.g. the link structure becomes inconsistent › Deduplication may affect the subsequent use of the archived material › Archiving strategies may be changed between two archivings › Parts of domains may be harvested more than once 13 Ditte

NETLAB WORKSHOP OM WEBARKIVERING 18. MARTS 2015 PARTS OF DOMAINS MAY BE HARVESTED MORE THAN ONCE 14 start url url harvester (web crawler/spider) Ditte domain

domain A url domain A domain B domain C … domain B url domain C url Ditte

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES › Main harvest: objects within a domain which have been harvested in the job to which the harvest of the domain was assigned › By-harvest: objects within a domain which have been harvested in another job than the one to which the harvest of the domain was assigned 16 Domain A — MH JOB 1 Domain B — MH Domain C — MH Domain E — MH JOB 2 B1 — BH Domain F — MH JOB 3 B2 — BH D1 — BH Domain D — MH Ditte

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 SOLUTIONS Not to use the archive after all › Use the registry of.dk domains Corpus creation › Selection of harvests › Selection of one version of each domain (consisting of the main harvest and possibly by-harvests) 17 Ditte

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 REGISTRY OF.DK DOMAINS Size and aliveness – 2006, 2009, 2012, 2015 › What are the total number of domain names over time? › How many domain names have disappeared compared to the previous years? (and which ones) › How many domain names have been created compared to the previous year? (and which ones) › How many domain names have changed hands compared to the previous years? (and which ones) › How is the relationship of ownership and domains over time? (cf. long tail) 18 Ditte

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 19 Ditte Number of domain names over time

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 20 Ditte New and disappearing domain names from 2005 to 2015

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 21 Ditte Number of domain names which have changed hands over time In 2015, 14% of the domains from 2012 had changed the owner name Both in 2012 and in 2015, just less of 10% of the total number of owners owned 50% of the Danish domains An observation: If you own more than three domains you are part of the top 10% of domain owners YearDomainsOwnersAnonymous

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 22 Ditte Relationship of ownership and domain names over time. Anonymous registrants removed. Chart shows 2012—no visual difference between 2012 and 2015 Parameter Max Mean Median11

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 PRE/POST-STEPS: REGISTRY OF.DK DOMAINS 23 Ditte Pre-steps › DK Hostmaster has shifted from ISO-8853 to UTF-8 › Earlier attempts at handling the data assumed space separated data sets when in fact they are fixed width fields › Data from DK hostmaster contains dirt, e.g. tab characters and in one year some sort of header: Post-steps › Same questions on several years (all years, up till four times a year) › Further investigation on which domains have disappeared › New questions emerged in the process

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Collaboration between researchers, curators, developers and management at the archive › How is a broad crawl performed? ie. several ”steps” › When were broad crawls performed? › How to find the most complete version of a domain within a certain timespan within a broad crawl? › What do we mean when we talk about a ”web element”, a ”web page”, a ”version” etc.? › What could a corpus creation algorithm look like? › How many resources are needed to test and implement a creation of a corpus? 24 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Use of broad crawls › Internationally recognized as a suitable web harvesting strategy for national archives › 2-4 broad crawls each year of all domains from.dk as well as Danish websites published under other extensions › Comprehensive in nature and consistent over time 25 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Selection of broad crawls › Four broad crawls, one from each of the years 2006, 2009, 2012 and 2015 (first crawl of the year) Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Selection of one havested version of each domain › Domain version from ’main harvest’ › Inclusion of unique materials from the’ by-harvest’ if the material is within our selected time span 27 Niels Domain A — MH JOB 1 Domain B — MH Domain C — MH Domain E — MH JOB 2 B1 — BH Domain F — MH JOB 3 B2 — BH D1 — BH Domain D — MH

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Test of the algorithm › Tested on the first broad crawl from January 2006 (1TB, only websites <10MB) › This harvest consists of 127 jobs › Each job consist of several domains › We produce an 18GB crawl log enhanced with job IDs 28 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Test of the algorithm › Using IBM BigInsights we can perform the algorithm on this large spreadsheet › The algorithm locates the objects that are not included in a main harvest (’by-harvests’) › There might be duplicates — in these cases, the algorithm identifies and selects the objects closest to the time of the main harvest 29 Niels

STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 NEXT STEPS From test to implementation › How to get from crawl logs to the material that the crawl logs refer to and that we want to analyze? — Should WARC files be opened? Should a subset of an index be used? › Start making some of the analyzes Dissemination and networking › Book chapters and papers › An open workshop in Aarhus, Denmark in 2016 for other national web archives and scholars wanting to do similar projects — aiming at establishing transnational ’best practice’ and analytical design 30 Niels

27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY Niels