This is difficult to explain without explaining everything we will go through"> This is difficult to explain without explaining everything we will go through">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital recordkeeping and preservation II

Similar presentations


Presentation on theme: "Digital recordkeeping and preservation II"— Presentation transcript:

1 Digital recordkeeping and preservation II
What digital means ARK2200 Digital recordkeeping and preservation II 2017 Thomas Sødring P48-R407

2 Because I have to explain this
<?xml version="1.0" encoding="UTF-8" ?> This is difficult to explain without explaining everything we will go through

3 What does digtial mean When we say that something is digital we mean that we expressing an observation of a real world phenomenon with a (digital) specific value

4 Digitalisation of the real world
In order to get something from the real world into a computer we need some kind of input device A keyboard generates a binary number for each key that is pressed A camera 'captures' light and converts it to binary data using a Analogue to Digital Converter (ADC) A microphone 'captures' sound and converts it to a binary number using an Analogue to Digital Converter (ADC)

5 Digitising A computer needs to be able to store these captured values in some way Has to assign the value of the recorded information to some kind of value system / language that the computer uses Computers are very basic, the only values it can work with are 0 and 1

6 Number systems The decimal number system describes a system that has ten possible digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Any number expressed in the decimal system is a combination of these ten digits The binary number system describes a system that has two possible digits 0, 1

7 The binary number system
The binary number system works essentially in the same way as the decimal number system, but only uses two digits Each number expressed in the binary number system is a combination of the two digits 0 and 1 The binary number system has been essential in the development of information technology Electronic circuits typically only have two possible states, on or off Think of a light switch where the switch gives you two choices The light is either on or off

8 Morse code

9 SOS · · · _ _ _ · · ·

10 So far ... Trying to understand the nature of something being digital
Made a note of the binary number system as opposed to the decimal number system Now we look at the concept of there being a limited number of times we can use a number Let's say we have one “box” we can use for a binary number What if we have two “boxes”?

11 Driving straight ahead
Driving right 1 We call this a two-bit system Driving left 1 Parking here for two minutes! 1

12 How do we use the binary system
Humans do not work well with the binary system So we often have to convert between the binary and decimal system Easily done if we understand the math on the right 20 = 1 21 = 2 22 = 4 23 = 8 24 = 16 25 = 32 26 = 64 27 = 128 64 32 16 128 4 2 1 8

13 From binary to decimal 2 1 (2*0) + (1*0) = 0 1 (2*0) + (1*1) = 1 1
(2*0) + (1*0) = 0 1 (2*0) + (1*1) = 1 1 (2*1) + (1*0) = 2 1 1 (2*1) + (1*1) = 3

14 From binary to decimal 4 2 1 (4*0) + (2*0) + (1*0) = 0 1
(4*0) + (2*0) + (1*0) = 0 1 (4*0) + (2*0) + (1*1) = 1 1 (4*0) + (2*1) + (1*0) = 2 1 1 (4*0) + (2*1) + (1*1) = 3 1 (4*1) + (2*0) + (1*0) = 4 1 1 (4*1) + (2*0) + (1*1) = 5 1 1 (4*1) + (2*1) + (1*0) = 6 1 1 1 (4*1) + (2*1) + (1*1) = 7

15 1 1 1 2 3 bits system 1 1 3 1 4 2 bits system 1 1 1 1 5 1 bits system 1 2 1 1 6 1 1 1 1 3 1 1 1 7 4 bits system 1 1 1 2 1 4 1 8 1 1 3 1 1 5 1 1 9 1 1 6 1 1 10 1 1 1 7 1 1 1 11 1 1 12 1 1 1 13 14 1 1 1 1 1 1 1 15

16 There are 10 types of people in this world,
those who understand binary and those who don't

17 A byte 8 bits is called a byte
A byte is the basic unit when talking about storage size of digital information An English character normally requires 8 bits or 1 byte of storage 64 32 16 128 4 2 1 8 A = bin = 65des B = bin = 66des C = bin = 67des

18 Number systems We are used to the decimal system
0, 1, 2, 3, 4, 5, 6, 7, 8, 9 And now we have looked at the binary system 0, 1 Another important number system is the hexadecimal system It has 16 digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F The hexadecimal system is often just called HEX We use it by converting a single 8-bit binary number into two HEX numbers

19 HEX It is inconvenient for humans to process binary numbers e.g bin It is also inconvenient if we constantly have to convert back and forth between the binary and decimal systems bin or 207des HEX is an easy way to shorten an 8-bit binary number to a 2 character HEX numbers

20 0des or 0hex 1 1des or 1hex 1 2des or 2hex 1 1 3des or 3hex 1 4des or 4hex 1 1 5des or 5hex 1 1 6des or 6hex 1 1 1 7des or 7hex 1 8des or 8hex 1 1 9des or 9hex 1 1 10des or Ahex 1 1 1 11des or Bhex 1 1 12des or Chex 1 1 1 13des or Dhex 1 1 1 14des or Ehex 1 1 1 1 15des or Fhex

21 HEX examples 00000101bin == 005des == 05hex
bin == 128des == 0Fhex bin == 136des == 88hex bin == 207des == CFhex bin == 256des == FFhex

22 Do I really need to know this
In many ways this is the domain of programmers As a RM or archivist you will most likely not work with this But this gives you some background understanding to HEX-numbers You probably have to know what they are And it might help you understand character sets and associated problems These you need to know

23 HEX file example

24 Character sets A character set is the set of characters that we can both write and read on a computer Consisting of letters, numbers, punctuation and control characters A character set is a mapping of characters (letters and numbers) to a binary number that a computer can interpret We need a character set so that electronic devices can exchange and understand data The letter 'a' is mapped to (9710, 6116) The letter 'b' is mapped to (9810, 6216)

25 ASCII Developed by ANSI (American National Standards Institute)
Defined in ANSI document X 7-bit code, 27 = 128 different codes (0-127) Eighth bit is unused Two general types of characters: 95 characters we can see on a screen 33 characters are control characters control functions for screen / printer or communications Represents Latin alphabet, Arabic numerals, standard punctuation Plus a small set of accents and other special European characters

26 ASCII Where are: Control characters ? Dec (0-32 + 127)
Alphanumeric characters ? Dec (65-90, ) Numerical characters ? Dec (48-57) Punctuation characters ?

27 ASCII Control characters

28 ASCII Control characters
CR (13, ^M) carriage return HT (9,^I) horisontal tab LF (10, ^J) linefeed DEL (127, ALT-127) delete FF (12, ^L) form feed (new page) BS (8, ^H) backspace ESC (27, ^[) escape

29 ASCII alphabet/numbers characters
52 letters, 10 numbers

30 ASCII Punctuation codes

31 ASCII The use of 7-bits in ASCII was extended to 8 bits
Known as Latin-1, standardised as ISO This allowed for a mapping of 256 characters (28 = 256) and most European characters including æ, ø and å are included here There are a number of different character sets defined under ISO 8859 e.g. ISO , Latin-2 for Central European languages For simplicity the first 127 characters will always correspond to the old ASCII standard But this has created some problems especially in the programming world

32 From ASCII to Unicode 8-bits (256 characters) is not enough to represent all characters of world Unicode was developed to solve this problem Unicode (Unicode Transformation Format, UTF) combines many of the various character sets into a single character set Unicode provides a unique number for every character independent of platform and programs Remember ASCII A is 0x41 LibreOffice [CTRL]+[SHIFT]+u and code See also:

33 Unicode As Unicode represents more than 256 characters in one character set, 8 bits is insufficient The use of Unicode has an effect on the size of files - they will be bigger But it solves an interoperability problem Various Unicode implementations UTF-8, UTF-16, UTF-32 A common misconception is that UTF-number stands for bits UTF-8, is not an 8-bit character set UTF-16, is not an 16-bit character set

34 UTF-8 UTF-8 is a character set that uses between 1 and 4 bytes (8 to 32 bits) to represent a character The original ASCII characters are always stored in the first byte (bits 0-7) always begins with 0 UTF-8 is compatible with ASCII 0 -127 1 1 1 1 1 1 1 1 65536 – 1 1 1 1 1 1 1 64 32 16 128 4 2 1 8 64 32 16 128 4 2 1 8 64 32 16 128 4 2 1 8 64 32 16 128 4 2 1 8 Byte 1 Byte 2 Byte 3 Byte 4

35 UTF-8 UTF-8 uses a variable length for character (0-32 bits)
This can lead to decreased speed in systems that process UTF-8 text You constantly have to calculate and delineate the various characters It also has a lot to say for the file size 10,000 characters * 1 bytes (10,000 bytes) with English characters 10,000 characters * 2 bytes (20,000 bytes) with Norwegian characters 10,000 characters * 3 bytes (30,000 bytes) with some international characters 10,000 characters * 4 bytes (40,000 bytes) with some international characters

36 ISO and Unicode ISO and the Unicode consortium tried to develop their own standards at the same time but in the Unicode 'won' This standard was known as UCS UTF-16 Has variable-length (2-4) Using minimum 2 bytes and as such is not backwards compatible with ASCII UTF-16 is almost the same as UCS-2 UTF-32 Fixed-length, faster processing Has fixed length of 4 bytes English files will be 4 times larger than ASCII- files UTF-32 is almost the same as UCS-4 UCS= Universal Character Set

37 ISO-8859-? ISO 8859-1 or Latin 1 Part 1
Covers most Western European languages Danish* Dutch*, English, Faroese, Finnish*, French*, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Romansh, Scottish Gaelic, Spanish and Swedish ISO or Latin 1 Part 4 Covers Estonian, Latvian, Lithuanian, Greenlandic and Sami When transferring documents to an archive Use Latin 1 Part 1 for Norwegian text Latin 1 Part 4 for Sami text *missing some characters

38 Latin 1 : Part 1 and Part 4

39 Do I really need to know this
As a digital archivist – yes! This is extremely important to understand The wrong use of character sets can result in corruption problems when processing data Very expensive to fix, may not be possible Standardisation with Unicode fixes a lot of this But there are many archives that have electronic data in older character sets e.g. EBCDIC there may even be some electronic data without an associated character set

40 Unicode could lead to homograph problems
The Russian word 'raural' (using Cyrillic characters) looks like paypal when interpreted using Unicode It is hard to say if this could be a problem for archives, but something to be aware of Taken from

41 Unicode could lead to homograph problems
But it turns out that this is a problem after all 2014 digi.no an article about this * Google's blog about it** (c) Google * **

42 Summary Character sets can be a problem for an archive but need not be
In Norway we have used Latin 1, Part 1 National Archives rules Latin 1, Part 1 is used for Norwegian Latin 1, Part 4 used for Sami In a wider European context (Europeana) it becomes more important from an interoperability standpoint Globally, it is very important In short ASCII (7bit), ASCII (8bit), Unicode

43 Because I have to explain this?
<?xml version="1.0" encoding="UTF-8" ?> This is difficult to explain without explaining everything we just went through


Download ppt "Digital recordkeeping and preservation II"

Similar presentations


Ads by Google