Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to File Structures

Similar presentations


Presentation on theme: "Introduction to File Structures"— Presentation transcript:

1 Introduction to File Structures
CENG 351 1 1

2 File Structures: What is it about ?
So far we have talked about database tables. From tables to file structures. Storage of data Organization of data Access to data Processing of data CENG 351 2 2

3 Where do File Structures fit in Computer Science?
Application DBMS File system Operating System Hardware CENG 351 3 3

4 Computer Architecture
data is manipulated here - Semiconductors - Fast, expensive, volatile, small Main Memory (RAM) data transfer Secondary Storage - disks, tape - Slow,cheap, stable, large data is stored here CENG 351 4 4

5 Primary vs. Secondary Storage
Fast (+) Capacity Small (- ) (Many databases are too large to fit in main memory) Volatile (-) Secondary Slow (-) Capacity large (+) (cheaper) Non-volatile (+)

6 How fast is main memory? Typical time for getting info from:
Main memory: ~12 nanosec = 120 x 10-9 sec Magnetic disks: ~30 milisec = 30 x 10-3 sec CENG 351 6 6

7 Normal Arrangement Secondary storage (SS) provides reliable, long- term storage for large volumes of data At any given time, we are usually interested in only a small portion of the data This data is loaded temporarily into main memory, where it can be rapidly manipulated and processed. As our interests shift, data is transferred automatically between MM and SS, so the data we are focused on is always in MM. CENG 351 7 7

8 Goal of the file structures
Minimize the number of trips to the disk in order to get desired information Grouping related information so that we are likely to get everything we need with only one trip to the disk. CENG 351 8 8

9 Physical Files and Logical Files
physical file: a collection of bytes stored on a disk or tape logical file: a "channel" (like a telephone line) that connects the program to a physical file The program (application) sends (or receives) bytes to (from) a file through the logical file. The program knows nothing about where the bytes go (came from). The operating system is responsible for associating a logical file in a program to a physical file in disk or tape. Writing to or reading from a file in a program is done through the operating system. CENG 351 9 9

10 Files The physical file has a name, for instance myfile.txt
The logical file has a logical name (a variable) inside the program. In C : FILE * outfile; In C++: fstream outfile; CENG 351 10 10

11 Basic File Processing Operations
Opening Closing Reading Writing Seeking CENG 351 11 11

12 File Systems Stored data is organized into files.
Files are organized into records. Records are organized into fields. CENG 351 12 12

13 Example A student file may be a collection of student records, one record for each student Each student record may have several fields, such as Name Address Student number Gender Age GPA Typically, each record in a file has the same fields. CENG 351 13 13

14 Properties of Files Persistance: Data written into a file persists after the program stops, so the data can be used later. Sharability: Data stored in files can be shared by many programs and users simultaneously. Size: Data files can be very large. Typically, they cannot fit into main memory. CENG 351 14 14

15 Secondary Storage Devices
CENG 351 15 15

16 Secondary Storage Devices
Two major types of storage devices: Direct Access Storage Devices (DASDs) Magnetic Disks Hard disks (high capacity, low cost per bit) Floppy disks (low capacity, slow, cheap) Optical Disks CD-ROM = (Compact disc, read-only memory) DVD Serial Devices Magnetic tapes (very fast sequential access) CENG 351 16 16

17 Magnetic Disks Bits of data (0’s and 1’s) are stored on circular magnetic platters called disks. A disk rotates rapidly (& never stops). A disk head reads and writes bits of data as they pass under the head. Often, several platters are organized into a disk pack (or disk drive). CENG 351 17 17

18 Top view of a 36 GB, 10,000 RPM, IBM SCSI server hard disk, with its top cover removed. Note the height of the drive and the 10 stacked platters. (The IBM Ultrastar 36ZX.) 18

19 19

20 Components of a Disk Spindle Tracks Disk head Sector Platters
Arm movement Platters Arm assembly 21

21 Surface of disk showing tracks and sectors
Looking at a surface tracks sector Surface of disk showing tracks and sectors CENG 351 21 21

22 Organization of Disks Disk contains concentric tracks.
Tracks are divided into sectors A sector is the smallest addressable unit in a disk. Sectors are addressed by: surface # cylinder (track) # sector # CENG 351 22 22

23 Accessing Data When a program reads a byte from the disk, the operating system locates the surface, track and sector containing that byte, and reads the entire sector into a special area in main memory called buffer. The bottleneck of a disk access is moving the read/write arm. So it makes sense to store a file in tracks that are below/above each other in different surfaces, rather than in several tracks in the same surface. CENG 351 23 23

24 Cylinders A cylinder is the set of tracks at a given radius of a disk pack. i.e. a cylinder is the set of tracks that can be accessed without moving the disk arm. All the information on a cylinder can be accessed without moving the read/write arm. CENG 351 24 24

25 Cylinders CENG 351 25

26 Estimating Capacities
Track capacity = # of sectors/track * bytes/sector Cylinder capacity = # of tracks/cylinder * track capacity Drive capacity = # of cylinders * cylinder capacity Number of cylinders = # of tracks in a surface Knowing these relationships allows us to compute the amount of disk space a file is likely to require CENG 351 26 26

27 Exercise Store a file of records on a disk with the following characteristics: # of bytes per sector = 512 # of sectors per track = 40 # of tracks per cylinder = 12 # of cylinders = 1331 Q1. How many cylinders does the file require if each data record requires 256 bytes? Q2. What is the total capacity of the disk? CENG 351 27 27

28 Clusters Another view of sector organization is the one maintained by the O.S.’s file manager. It views the file as a series of clusters of sectors. File manager uses a file allocation table (FAT) to map logical sectors of the file to the physical clusters. CENG 351 28 28

29 Extents If there is a lot of room on a disk, it may be possible to make a file consist entirely of contiguous clusters. Then we say that the file is one extent. (very good for sequential processing) If there isn’t enough contiguous space available to contain an entire file, the file is divided into two or more noncontiguous parts. Each part is an extent. CENG 351 29 29

30 Fragmentation Internal fragmentation: loss of space within a sector or a cluster. Due to records not fitting exactly in a sector: e.g. Sector size is 512 and record size is 300 bytes. Either store one record per sector, or allow records span sectors. Due to the use of clusters: If the file size is not a multiple of the cluster size, then the last cluster will be partially used. CENG 351 30 30

31 The Cost of a Disk Access
The time to access a sector in a track on a surface is divided into 3 components: Time Component Action Seek Time Time to move the read/write arm to the correct cylinder Rotational delay (or latency) Time it takes for the disk to rotate so that the desired sector is under the read/write head Transfer time Once the read/write head is positioned over the data, this is the time it takes for transferring data CENG 351 31 31

32 Seek time Seek time is the time required to move the arm to the correct cylinder. Largest in cost. Typically: 5 ms (miliseconds) to move from one track to the next (track-to-track) 50 ms maximum (from inside track to outside track) 30 ms average (from one random track to another random track) CENG 351 32 32

33 Average Seek Time (s) Since it is usually impossible to know exactly how many tracks will be traversed in every seek, we usually try to determine the average seek time (s) required for a particular file operation. If the starting and ending positions for each access are random, it turns out that the average seek traverses one third of the total number of cylinders. Manufacturer’s specifications for disk drives often list this figure as the average seek time for the drives. Most hard disks today have s of less than 10 ms, and high- performance disks have s as low as 7.5 ms. CENG 351 33 33

34 Latency (rotational delay)
Latency is the time needed for the disk to rotate so the sector we want is under the read/write head. Hard disks usually rotate at about 5000rpm, which is one revolution per 12 msec. Note: Min latency = 0 Max latency = Time for one disk revolution Average latency (r) = (min + max) / 2 = max / 2 = time for ½ disk revolution Typically 6 – 8 ms average CENG 351 34 34

35 Transfer Time Transfer time is the time for the read/write head to pass over a block. The transfer time is given by the formula: number of bytes transferred Transfer time = x rotation time number of bytes on a track e.g. if there are 63 sectors per track, the time to transfer one sector would be 1/63 of a revolution. CENG 351 35 35

36 Exercise Given the following disk: Find:
20 surfaces 800 tracks/surface 25 sectors/track 512 bytes/sector 3600 rpm (revolutions per minute) 7 ms track-to-track seek time 28 ms avg. seek time 50 ms max seek time. Find: Average latency Disk capacity Time to read the entire disk, one cylinder at a time CENG 351 36 36

37 Solution Average Latency: 3600 rev/min 1 min = 60000 msec =
Average latency = ½ * (60000 / 36000) = 16.7/2 = 8.3 ms b) Disk capacity 25*512*800*20 = 204.8MB c) Time to read the disk: Track read time = 1 revolution time= 16.7 ms Cylinder read time = 20*16.7= 334ms Total read time = 800*cylinder reads cylinder switches = 800*334 ms * 7ms = 267 sec sec = sec CENG 351 37

38 Exercise Disk characteristics:
Average seek time = 8 msec. Spindle speed = 10,000 rpm Sectors per track = 170 Sector size = 512 bytes Q) What is the average time to read one sector? CENG 351 38 38

39 Solution Average time to read one sector: s + r + btt What is btt?
btt : block transfer time = revolution time/ #of sectors per track Revolution time = 60000/10000 = 6 msec btt = 6/170 = ms s + r + btt = = ms CENG 351 39

40 Sequential Reading Given the following disk:
s = 16 ms r = 8.3 ms Block transfer time = 0.84 ms Calculate the time to read 10 sequential blocks Calculate the time to read 100 sequential blocks CENG 351 40 40

41 Solution Reading 10 sequential blocks: = s + r+ 10 * btt
= * 0.84 = 32.7 ms b) 100 blocks: = * 0.84 = ms CENG 351 41

42 Random Reading Given the same disk,
Calculate the time to read 10 blocks randomly Calculate the time to read 100 blocks randomly CENG 351 42 42

43 Solution Reading 10 blocks randomly: = 10 * (s + r + btt)
= 10 * ( ) = ms b) 100 blocks: = 100 *( ) = 2514 ms CENG 351 43

44 Fast Sequential Reading
We assume that blocks are arranged so that there is no rotational delay in transferring from one track to another within the same cylinder. This is possible if consecutive track beginnings are staggered (like running races on circular race tracks) We also assume that the consecutive blocks are arranged so that when the next block is on an adjacent cylinder, there is no rotational delay after the arm is moved to new cylinder Fast sequential reading: no rotational delay after finding the first block. CENG 351 44 44

45 Consequently … Reading b blocks: Sequentially: s + r + b * btt
Randomly: b * (s + r + btt) insignificant for large files CENG 351 45 45

46 Exercise Given a file of records, 1600 bytes each, and block size 2400 bytes, how does record placement affect sequential reading time? Empty space in blocks. Records overlap block boundaries. CENG 351 46 46

47 Solution Empty space in blocks: b = # of blocks = n = # of records
30000*0.84 = 25.2 sec ii) Records overlap boundaries: Bfr = Blocking factor = 2400/1600 =3/2 b = 30000/1.5 = blocks Time = * 0.84 = 16.8 sec (1/3 faster) CENG 351 47

48 Exercise Specifications of a 300MB disk drive: Min seek time = 6ms. Average seek time = 18ms Rotational delay = 8.3ms transfer rate = 16.7 ms/track or 1229 bytes/ms Bytes per sector = 512 Sectors per track = 40 Tracks per cylinder = 12 Tracks per surface = 1331 Interleave factor = 1 Cluster size= 8 sectors Extent size = 5 clusters Q) How long will it take to read a 2048Kb file that is divided into byte records? Access the file sequentially Access the file randomly CENG 351 48 48

49 Solution First find the # of extents:
1 cluster = 8 sectors = 8 *512 = 4096 bytes 16 records per cluster File contains 8000/16 = 500 clusters Extent size = 5 clusters = 1 track File contains 100 extents => 100 tracks i) Access the file sequentially: For 1 track = s + r + track transfer time = = 43 ms 100 tracks = 4300 ms = 4.3 sec Access the file randomly: (8000 records) For each record: s+ r + read 1 cluster = /5 * 16.7 = 29.6 ms 8000 records => 8000 * 29.6 = sec CENG 351 49

50 Secondary Storage Devices: Floppy Disks
CENG 351 50 50

51 Floppy Disks A floppy disk is a disk storage medium composed of a disk of thin and flexible magnetic storage medium. Developed by IBM 3.5-inch, 5.24-inch and 8-inch forms CENG 351

52 Internal parts of a 3½-inch floppy disk.
A hole that indicates a high-capacity disk. The hub that engages with the drive motor. A shutter that protects the surface when removed from the drive. The plastic housing. A polyester sheet reducing friction against the disk media as it rotates within the housing. The magnetic coated plastic disk. A schematic representation of one sector of data on the disk; the tracks and sectors are not visible on actual disks. CENG 351

53 Floppy Disks A spindle motor in the drive rotates the magnetic medium at a certain speed A stepper motor-operated mechanism moves the magnetic read/write head(s) along the surface of the disk CENG 351

54 Secondary Storage Devices: Magnetic Tapes
CENG 351 54 54

55 Characteristics No direct access, but very fast sequential access.
Resistant to different environmental conditions. Easy to transport, store, cheaper than disk. Before it was widely used to store application data; nowadays, it’s mostly used for backups or archives CENG 351 55 55

56 Magnetic tapes A sequence of bits are stored on magnetic tape.
For storage, the tape is wound on a reel. To access the data, the tape is unwound from one reel to another. As the tape passes the head, bits of data are read from or written onto the tape. CENG 351 56 56

57 Reel 2 Reel 1 tape Read/write head CENG 351 57 57

58 Tracks Typically data on tape is stored in 9 separate bit streams, or tracks. Each track is a sequence of bits. Recording density = # of bits per inch (bpi). Typically 800 or 1600 bpi bpi on some recent devices. CENG 351 58 58

59 In detail … … … … … … 8 bits = 1 byte ½” parity bit 1 1 1 1 CENG 351
1 1 1 1 ½” parity bit CENG 351 59 59

60 Tape Organization 2400’ Data blocks BOT marker Header block
Inter block gap EOT marker BOT = beginning of tape; EOT = end of tape Header block: describes data blocks Inter block gap: For acceleration and deceleration of tape Blocking factor: # records per block CENG 351 Spring 2006 by Li Ma, TSU - cs344

61 Data Blocks and Records
Each data block is a sequence of contiguous records. A record is the unit of data that a user’s program deals with. The tape drive reads an entire block of records at once. Unlike a disk, a tape starts and stops. When stopped, the read/write head is over an interblock gap. CENG 351 61 61

62 Example: tape capacity
Given the following tape: Recording density = 1600 bpi Tape length = 2400 ’ (feet) Interblockgap = ½ ” (inch) 512 bytes per record Blocking factor = 25 How many records can we write on the tape? (ignoring BOT and EOT markers and the header block for simplicity) CENG 351 62 62

63 Solution #bytes/block = (512 bytes/record) * (25 records/block)
Block length = (#bytes/block) / (#bytes/inch) = 12,800/1600 inches = 8 inches Block + gap = 8” + 1/2” = 8.5” Tape length =2400 ft * 12 in/ft = 28,800 in #blocks = (tape length) / (block + gap) = 28,800/8.5 = 3388 blocks #records = (#blocks) * (#records/block) = 3388 * 25 = 84,700 records CENG 351 Spring 2006 by Li Ma, TSU - cs344

64 Secondary Storage Devices: CD-ROM
CENG 351 64 64

65 Physical Organization of CD-ROM
Compact Disk – read only memory (write once) Data is encoded and read optically with a laser Can store around 600MB data Digital data is represented as a series of Pits and Lands: Pit = a little depression, forming a lower level in the track Land = the flat part between pits, or the upper levels in the track CENG 351 65 65

66 Organization of data Reading a CD is done by shining a laser at the disc and detecting changing reflections patterns. 1 = change in height (land to pit or pit to land) 0 = a “fixed” amount of time between 1’s LAND PIT LAND PIT LAND |_____| |_______| Note : we cannot have two 1’s in a row! => uses Eight to Fourteen Modulation (EFM) encoding table. CENG 351 66 66

67 Properties Note that: Since 0's are represented by the length of time between transitions, we must travel at constant linear velocity (CLV)on the tracks. Sectors are organized along a spiral Sectors have same linear length Advantage: takes advantage of all storage space available. Disadvantage: has to change rotational speed when seeking (slower towards the outside) CENG 351 67 67

68 Addressing 1 second of play time is divided up into 75 sectors.
Each sector holds 2KB 60 min CD: 60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540 MB A sector is addressed by: Minute:Second:Sector e.g. 16:22:34 CENG 351 68 68

69 DVD (Digital Video Disc) Characteristics
A DVD disc has the same physical size as a CD disc, but it can store from to 17 GB of data. Like a CD disc, data is recorded on a DVD disc in a spiral trail of tiny pits separated by lands. The DVD’s larger capacity is achieved by making the pits smaller and the spiral tighter, and by recording the data as many as four layers, two on each side of the disc. To read these tightly packed discs, lasers that produce a shorter wavelength beam of light are required to achieve more accurately aiming and focusing mechanism. In fact, the focusing mechanism is the technology that allows data to be recorded on two layers. To read the second layer, the reader simply focuses the laser a little deeper into the disc, where the second layer of data is recorded. 69

70 Secondary Storage: Flash Memory
CENG 351 70 70

71 Flash Memory Non-volatile computer storage chip that can be electrically erased and reprogrammed. It was developed from EEPROM (electrically erasable programmable read-only memory) The NAND type: primarily used in memory cards, USB flash drives, for general storage and transfer of data. The NOR type: used as a replacement for the older EPROM and as an alternative to certain kinds of ROM applications. CENG 351

72 Flash Memory Replacement for hard disks:
Adv: Flash memory does not have the mechanical limitations and latencies of hard drives Disadv: The cost per gigabyte of flash memory remains significantly higher than that of hard disks. CENG 351

73 Buffer Management CENG 351 73 73

74 Buffer Management Buffering means working with large chunks of data in main memory so the number of accesses to secondary storage is reduced. System I/O buffers: These are beyond the control of application programs and are manipulated by the O.S. CENG 351 74 74

75 System I/O Buffer Data transferred by blocks Secondary Storage Program
Data transferred by records Temporary storage in MM for one block of data CENG 351 75 75

76 Buffer Bottlenecks Consider the following program segment:
while (1) { infile >> ch; if (infile.fail()) break; outfile << ch; } What happens if the O.S. used only one I/O buffer? Buffer bottleneck Most O.S. have an input buffer and an output buffer. CENG 351 76

77 Buffering Strategies Double Buffering: Two buffers can be used to allow processing and I/O to overlap. Suppose that a program is only writing to a disk. CPU wants to fill a buffer at the same time that I/O is being performed. If two buffers are used and I/O-CPU overlapping is permitted, CPU can be filling one buffer while the other buffer is being transmitted to disk. When both tasks are finished, the roles of the buffers can be exchanged. The actual management is done by the O.S. CENG 351 77 77

78 Other Buffering Strategies
Multiple Buffering: instead of two buffers any number of buffers can be used to allow processing and I/O to overlap. Buffer pooling: There is a pool of buffers. When a request for a sector is received, O.S. first looks to see that sector is in some buffer. If not there, it brings the sector to some free buffer. If no free buffer exists, it must choose an occupied buffer. (usually LRU strategy is used) CENG 351 78 78


Download ppt "Introduction to File Structures"

Similar presentations


Ads by Google