Presentation is loading. Please wait.

Presentation is loading. Please wait.

27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE.

Similar presentations


Presentation on theme: "27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE."— Presentation transcript:

1 27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY Niels

2 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 OVERVIEW OF PRESENTATION 1.The project › Why study the development of a nation’s web domain? › How to study the development of a nation’s web domain? — an outline of an analytical design 2.Methodological challenges 3.Solutions 4.Results › Registry of.dk domains › Corpus creation 5.Next steps 2 Niels

3 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 THE PROJECT What has the entire Danish web looked like in the past, and how has it developed? What are the methodological challenges in conducting such a study? What kind of research infrastructure do we need to conduct such a study? 3 Niels

4 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 WHY STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? › It is an important part of a nation’s cultural heritage › It is a back cloth for all other types of web entities and activities › It can identify some of the patterns of the developments of the web and relate them to the web of today Niels

5 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW TO STUDY THE DEVELOPMENT OF A NATION’S WEB DOMAIN? An outline of an analytical design — A gross list of possible ’probes’: › Size › Space › Structure › Aliveness › Content 5 Niels

6 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SIZE — BYTES › How small/big is a nation’s web domain? › The size of different file types and of file types in general › How big/small are websites? 6 Niels

7 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SPACE – GEOLOCATION › Where are websites located? › Search the text for geographic references, e.g. postcodes in footers 7 Niels

8 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? NETWORKS Website internal/external hyperlinks › Are websites closed or open towards the web? › How flat/deep are websites? Web domain internal/external hyperlinks › Centrality based on in-links › How well-linked is the national web domain to the rest of the web? › Which other domain names are the most linked-to? 8 Niels

9 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? ALIVENESS – UPDATING › Domain names: number of new/inactive/disappeared domain names › Updating: number of web objects having been changed since last archiving 9 Niels

10 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 1 Closedness › How many websites are password protected? File and software types › Which file types are the most prevalent? › Which software types are the most widespread? Language › Does the national language prevail? — Or foreign languages? 10 Niels

11 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 2 Textual elements on webpage › Background color › Most used fonts › Length of webpages › Placing of menu items (left align, vertical, or top align, horizontal) Semantics › Word frequencies › Where specific issues or topics are to be found, and how they spread 11 Niels

12 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES The web of the past is gone Possible solution: using (national) web archives › DK: Legal Deposit law effective July 2005 › DK: web material within the ccTLD.dk and websites on other domains aimed at a Danish audience › DK: 2015: approx 1 million active domain names within the ccTLD.dk — 583 Terabytes No 1:1 relation between archive and the Danish web domain 12 Ditte

13 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES No 1:1 relation between Danish national archive and the Danish national web domain › Not everything has been archived › Unsystematic, no register, no original to compare with › Archiving takes time, e.g. the link structure becomes inconsistent › Deduplication may affect the subsequent use of the archived material › Archiving strategies may be changed between two archivings › Parts of domains may be harvested more than once 13 Ditte

14 NETLAB WORKSHOP OM WEBARKIVERING 18. MARTS 2015 PARTS OF DOMAINS MAY BE HARVESTED MORE THAN ONCE 14 start url url 1 0 2 3 harvester (web crawler/spider) Ditte domain

15 domain A url domain A domain B domain C … domain B url domain C url Ditte

16 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 METHODOLOGICAL CHALLENGES › Main harvest: objects within a domain which have been harvested in the job to which the harvest of the domain was assigned › By-harvest: objects within a domain which have been harvested in another job than the one to which the harvest of the domain was assigned 16 Domain A — MH JOB 1 Domain B — MH Domain C — MH Domain E — MH JOB 2 B1 — BH Domain F — MH JOB 3 B2 — BH D1 — BH Domain D — MH Ditte

17 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 SOLUTIONS Not to use the archive after all › Use the registry of.dk domains Corpus creation › Selection of harvests › Selection of one version of each domain (consisting of the main harvest and possibly by-harvests) 17 Ditte

18 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 REGISTRY OF.DK DOMAINS Size and aliveness – 2006, 2009, 2012, 2015 › What are the total number of domain names over time? › How many domain names have disappeared compared to the previous years? (and which ones) › How many domain names have been created compared to the previous year? (and which ones) › How many domain names have changed hands compared to the previous years? (and which ones) › How is the relationship of ownership and domains over time? (cf. long tail) 18 Ditte

19 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 19 Ditte Number of domain names over time

20 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 20 Ditte New and disappearing domain names from 2005 to 2015

21 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 21 Ditte Number of domain names which have changed hands over time In 2015, 14% of the domains from 2012 had changed the owner name Both in 2012 and in 2015, just less of 10% of the total number of owners owned 50% of the Danish domains An observation: If you own more than three domains you are part of the top 10% of domain owners YearDomainsOwnersAnonymous 20121.163.250513.32646.727 20151.277.035549.97858.710

22 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 RESULTS: REGISTRY OF.DK DOMAINS 22 Ditte Relationship of ownership and domain names over time. Anonymous registrants removed. Chart shows 2012—no visual difference between 2012 and 2015 Parameter20122015 Max34223786 Mean2.1752.215 Median11

23 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 PRE/POST-STEPS: REGISTRY OF.DK DOMAINS 23 Ditte Pre-steps › DK Hostmaster has shifted from ISO-8853 to UTF-8 › Earlier attempts at handling the data assumed space separated data sets when in fact they are fixed width fields › Data from DK hostmaster contains dirt, e.g. tab characters and in one year some sort of header: Post-steps › Same questions on several years (all years, up till four times a year) › Further investigation on which domains have disappeared › New questions emerged in the process

24 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Collaboration between researchers, curators, developers and management at the archive › How is a broad crawl performed? ie. several ”steps” › When were broad crawls performed? › How to find the most complete version of a domain within a certain timespan within a broad crawl? › What do we mean when we talk about a ”web element”, a ”web page”, a ”version” etc.? › What could a corpus creation algorithm look like? › How many resources are needed to test and implement a creation of a corpus? 24 Niels

25 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Use of broad crawls › Internationally recognized as a suitable web harvesting strategy for national archives › 2-4 broad crawls each year of all domains from.dk as well as Danish websites published under other extensions › Comprehensive in nature and consistent over time 25 Niels

26 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Selection of broad crawls › Four broad crawls, one from each of the years 2006, 2009, 2012 and 2015 (first crawl of the year) 26 2006200920122015 Niels

27 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Selection of one havested version of each domain › Domain version from ’main harvest’ › Inclusion of unique materials from the’ by-harvest’ if the material is within our selected time span 27 Niels Domain A — MH JOB 1 Domain B — MH Domain C — MH Domain E — MH JOB 2 B1 — BH Domain F — MH JOB 3 B2 — BH D1 — BH Domain D — MH

28 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Test of the algorithm › Tested on the first broad crawl from January 2006 (1TB, only websites <10MB) › This harvest consists of 127 jobs › Each job consist of several domains › We produce an 18GB crawl log enhanced with job IDs 28 Niels

29 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 CORPUS CREATION Test of the algorithm › Using IBM BigInsights we can perform the algorithm on this large spreadsheet › The algorithm locates the objects that are not included in a main harvest (’by-harvests’) › There might be duplicates — in these cases, the algorithm identifies and selects the objects closest to the time of the main harvest 29 Niels

30 STUDYING A NATION’S WEB DOMAIN OVER TIME Niels Brügger, Ditte Laursen & Janne Nielsen 27 APRIL 2015 NEXT STEPS From test to implementation › How to get from crawl logs to the material that the crawl logs refer to and that we want to analyze? — Should WARC files be opened? Should a subset of an index be used? › Start making some of the analyzes Dissemination and networking › Book chapters and papers › An open workshop in Aarhus, Denmark in 2016 for other national web archives and scholars wanting to do similar projects — aiming at establishing transnational ’best practice’ and analytical design 30 Niels

31 27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY Niels


Download ppt "27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE."

Similar presentations


Ads by Google