Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations



2 Module Overview We will focus on the main challenges of computer architecture –Managing the I/O hierarchy Caching, multiprogramming, virtual memory –Speeding up the CPU Pipelined and multicore architectures –Protecting user computations and data Memory protection, privileged instructions


4 The memory hierarchy (I) CPU registers Main memory (RAM) Secondary storage (Disks) Mass storage (Often offline)

5 CPU registers Inside the processor itself Some can be accessed by our programs –Others no Can be read/written to in one processor cycle –If processor speed is 2 GHz 2,000,000,000 cycles per second 2 cycles per nanosecond

6 Main memory (I) Byte accessible –Each group of 8 bits has an address Dynamic random access memory (DRAM) –Slower but much cheaper than static RAM –Contents must be refreshed every 64 ms Otherwise its contents are lost: –DRAM is volatile

7 Main memory (II) Memory is organized as a sequence of 8-bit bytes –Each byte an address –Bytes can contain one character Roman alphabet with accents 0123456789101112131415

8 Main memory (III) Groups of four bytes starting at addresses that are multiple of 4 form words – Better suited to hold numbers –Also have half-words, double words, quad words 04812

9 Accessing main memory contents (I) When look for some item, our search criteria can include the location of the item – The book on the table – The student behind you, … More often our main search criterion is some attribute of the item – The color of a folder – The title or the authors of a book – The name of an individual

10 Accessing main memory contents (II) Computers always access their memory by location –The byte at address 4095 –The word at location 512 States the address of the first byte in the word Why? – Fastest way for them to accessan item 512513514515

11 An analogy (I) Some research libraries have a closed-stack policy –Only library employees can access the stacks –Patrons wanting to get an item fill a form containing a call number specifying the location of the item Could be Library of Congress classification if the stacks are organized that way.

12 An analogy (II) The procedure followed by the employee fetching the book is fairly simple –Go at location specified by the book call number –Check it the book is there –Bring it to the patron

13 An analogy (III) The memory operates in an even simpler manner – Always fetch the contents of the addressed bytes Junk or not

14 Disk drives (I) Sole part of computer architecture with moving parts: Data stored on circular tracks of a disk –Spinning speed between 5,400 and 15,000 rotations per minute –Accessed through a read/write head

15 Disk drives (II) Platter R/W head Arm Servo

16 Disk drives (III) Data can be accessed by blocks of 4KB, 8 KB, … –Depends on disk partition parameters User selectable To access a disk block –Read/write head must be over the right track Seek time –Data to be accessed must pass under the head Rotational latency

17 Estimating the rotational latency On the average half a disk rotation If disk spins at 15,000 rpm –250 rotations per second –Half a rotation corresponds to 2ms Most desktops have disks that spin at 7,200 rpm Most notebooks have disks that spin at 5,400 or 7,200 rpm

18 Accessing disk contents Each block on a disk has a unique address –Normally a single number Logical block addressing (LBA) –Older PCs used a different scheme

19 The memory hierarchy (II) LevelDeviceAccess Time 1Fastest registers (2 GHz CPU) 0.5 ns 2Main memory 10-70 ns 3Secondary storage (disk) 7 ms 4Mass storage (CD-ROM library) a few s

20 The memory hierarchy (III) To make sense of these numbers, let us consider an analogy

21 Writing a paper (I) LevelResourceAccess Time 1Open book on desk 1 s 2Book on desk 3Book in library 4Book far away

22 Writing a paper (II) LevelResourceAccess Time 1Open book on desk 1 s 2Book on desk 20-140 s 3Book in library 4Book far away

23 Writing a paper (III) LevelResourceAccess Time 1Open book on desk 1 s 2Book on desk 20-140 s 3Book in library 162 days 4Book far away

24 Writing a paper (IV) LevelResourceAccess Time 1Open book on desk 1 s 2Book on desk 20-140 s 3Book in library 162 days 4Book far away 63 years

25 The two gaps (I) Gap between CPU and main memory speeds: –Will add intermediary levels L1, L2, and L3 caches –Will store contents of most recently accessed memory addresses Most likely to be needed in the future – Purely hardware solution Software does not see it

26 Major issues Huge gaps between –CPU speeds and SDRAM access times –SDRAM access times and disk access times Both problems have very different solutions –Gap between CPU speeds and SDRAM access times handled by hardware –Gap between SDRAM access times and disk access times handled by combination of software and hardware

27 Why? Having hardware handle an issue –Complicates hardware design –Offers a very fast solution –Standard approach for very frequent actions Letting software handle an issue –Cheaper –Has a much higher overhead –Standard approach for less frequent actions

28 Will the problem go away? It will become worse –RAM access times are not improving as fast as CPU power –Disk access times are limited by rotational speed of disk drive

29 What are the solutions? To bridge the CPU/DRAM gap: –Interposing between the CPU and the DRAM smaller, faster memories that cache the data that the CPU currently needs Cache memories Managed by the hardware and invisible to the software (OS included)

30 What are the solutions? To bridge the DRAM/disk drive gap: Storing in main memory the data blocks that are currently accessed ( I/O buffer ) Managing memory space and disk space as a single resource ( Virtual memory ) I/O buffer and virtual memory are managed by the OS and invisible to the user processes

31 Why do these solutions work? Locality principle: – Spatial locality: at any time a process only accesses a small portion of its address space – Temporal locality: this subset does not change too frequently

32 The true memory hierarchy CPU registers Main memory (RAM) Secondary storage (Disks) Mass storage (Often offline) L1, L2 and L3 caches

33 Handling the CPU/DRAM speed gap

34 The technology Caches use faster static RAM (SRAM) –(D flipflops) Can have –Separate caches for instructions and data Great for pipelining –A unified cache

35 Basic principles Assume we want to store in a faster memory 2 n words that are currently accessed by the CPU –Can be instructions or data or even both When the CPU will need to fetch an instruction or load a word into a register –It will look first into the cache –Can have a hit or a miss

36 Cache hits Occur when the requested word is found in the cache –Cache avoided a memory access –CPU can proceed

37 Cache misses Occur when the requested word is not found in the cache –Will need to access the main memory –Will bring the new word into the cache Must make space for it by expelling one of the cache entries –Need to decide which one

38 Cache design challenges Cache contains a small subset of memory addresses Must find a very fast access mechanism –No linear search, no binary search –Would like to have an associative memory Can search by content all memory entries in parallel –Like human brains do

39 An associative memory Search for “ice cream” COSC 1306 program Finding a parking spot My last ice cream Other ice cream moment Found

40 An analogy (I) Let go back to our closed-stack library example –Librarians have noted that some books get asked again and again Want to put them closer to the circulation desk –Would result in much faster service –The problem is how to locate these books They will not be at the right location!

41 An analogy (II) Librarians come with a great solution –They put behind the circulation desk shelves with 100 book slots numbered from 00 to 99 –Each slot is a home for the most recently requested book that has a call number whose last two digits match the slot number 3141593 can only go in slot 93 1234567 can only go in slot 67

42 An analogy (III) The call number of the book I need is 3141593 Let me see if it's in bin 93

43 An analogy (IV) To let the librarian do her job each slot much contain either – Nothing or –A book and its reference number There are many books whose reference number ends in 93 or 67 or any two given digits

44 An analogy (V) Could I get this time the book whose call number 4444493? Sure

45 An analogy (VI) This time the librarian will –Go bin 93 –Find it contains a book with a different call number She will –Bring back that book to the stacks –Fetch the new book

46 A very basic cache Has 2 n entries Each entry contains –A word (4 bytes) –Its memory address Sole way to identify the word –A bit indicating whether the cache entry contains something useful

47 A very basic cache (I) RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord Actual caches are much bigger RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord RAM AddressWord TagContentsValid Y/N 110 111 000 001 100 101 010 011

48 Multiword cache Word 000 001 010 011 000 001 100 101 010 011 000 001 110 111 100 101 010 011 000 001 Valid Y/N 110 111 100 101 010 011 000 001 Valid Y/N 110 111 100 101 010 011 000 001 Tag Valid Y/N 110 111 100 101 010 011 000 001 100 101 010 011 Contents 110 111 000 001 100 101 010 011 Word Address * Word

49 Set-associative caches (I) Can be seen as 2, 4, 8 caches attached together Reduces collisions

50 Back to our library example What if two books whose call number have the same last two digits are often asked on the same day: –Say, 3141593 and 4444493 Best solution is –Keep the number of book slots equal to 100 –Store more than one book with same last two digits in the same slot

51 Set-associative caches (II) 000 001 010 011 100 101 110 111 Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block TagContentsValid Y/N 000 001 010 011 100 101 110 111 Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block Address * Block TagContentsValid Y/N

52 Set-associative caches (III) Advantage: –We take care of more collisions Like a hash table with a fixed bucket size –Results in lower miss rates than direct- mapped caches Disadvantage: –Slower access –Best solution if miss penalty is very big

53 Fully associative caches The dream! A block can occupy any index position in the cache Requires an associative memory – Content-addressable –Like our brain! Remains a dream

54 A cache hierarchy Two or three levels (L1, L2, L3) L1 cache: – Highest level –Optimized for speed: Direct mapping and no associativity L2 and L3 caches: –Optimized for hit ratio Higher associativity

55 Handling the DRAM/disk speed gap

56 What can be done? Two main techniques –Making disk accesses more efficient –Doing something else while waiting for an I/O operation Not very different from what we are doing in our every day's lives

57 Optimizing read accesses (I) When we shop in a market that’s far away from our home, we plan ahead and buy food for several days The OS will read as many bytes as it can during each disk access –In practice, whole pages (4KB or more) –Pages are stored in the I/O buffer

58 Optimizing read accesses (II) Process I/O buffer Disk Drive Read operation Physical I/O Most short read operations can be completed without any disk access

59 Optimizing read accesses (III) Buffered writes work quite well –Most systems use it They have a major limitation –If we try to read too much ahead of the program, we risk to bring into main memory data that will never be used

60 Optimizing read accesses (IV) Can also keep in a buffer recently accessed blocks hoping they will be accessed again – Caching Works very well because we keep accessing again and again the data we are working with Caching is a fundamental technique of OS and database design

61 Optimizing write accesses (II) If we live far away from a library, we wait until we have several books to return before making the trip The OS will delay writes for a few seconds then write an entire block –Since most writes are sequential, most short writes will not require any disk access

62 Optimizing write accesses (II) Delayed writes work quite well –Most systems use it It has a major drawback –We will lose data if the system or the program crashes After the program issued a write but Before the data were saved to disk

63 Doing something else When we order something on the web, we do not remain idle until the goods are delivered The OS can implement multiprogramming and let the CPU run another program while a program waits for an I/O

64 Advantages (I) Multiprogramming is very important in business applications –Many of these applications use the peripherals much more than the CPU –For a long time the CPU was the most expensive component of a computer – Multiprogramming was invented to keep the CPU busy

65 Advantages (II) Multiprogramming made time-sharing possible Multiprogramming lets your PC run several applications at the same time –MS Word and MS Outlook

66 Requirements Two basic requirements –Must have a mechanism that handles I/O without CPU intervention I/O controller –Must have a way to notfy the CPU when an I/O operation is completed Interrupts

67 Analogy When we buy a book in brick and mortar store, –We wait until we get the book May have to waste time waiting in line When we order a book over the Internet, –We do other things while waiting for the book UPS takes care of it We get notified when book has arrived

68 Interrupts Normally the kernel is inactive –Users' programs are in control When OS intervention is required –Must interrupt the flow of execution of the CPU –Give CPU to the OS

69 Interrupts Detected by the CPU hardware –After it has executed the current instruction – Before it starts the next instruction

70 A very schematic view (I) A very basic CPU would execute the following loop: forever { fetch_instruction(); decode_instruction(); execute_instruction(); } Pipelining makes things more complicated

71 A very schematic view (II) We add an extra step: forever { check_for_interrupts(); fetch_instruction(); decode_instruction(); execute_instruction(); }

72 Types of interrupts I/O completion interrupts – Requested data are now in memory Timer interrupts – A process has been using the CPU for more than x ms System calls – Running process needs something from the OS …

73 Kernel System calls Program System call / system request

74 Managing the main memory

75 Monitor Multiprogramming/variable partitions Initially everything works fine –Three processes occupy most of memory –Unused part of memory is very small P0 P1 P2

76 Monitor Multiprogramming/variable partitions When P0 terminates –Replaced by P3 – P3 must be smaller than P0 Start wasting memory space P3 P1 P2

77 Monitor Multiprogramming/variable partitions When P2 terminates –Replaced by P4 – P4 must be smaller than P0 plus the free space Start wasting more memory space P3 P1 P4

78 External fragmentation Happens in all systems using multiprogramming with variable partitions Occurs because new process must fit in the hole left by terminating process –Very low probability that both process will have exactly the same size –Typically the new process will be a bit smaller than the terminating process

79 An Analogy Replacing an old book by a new book on a bookshelf New book must fit in the hole left by old book –Very low probability that both books have exactly the same width –We will end with empty shelf space between books Solution it to push books left and right

80 Virtual memory Combines two big ideas – Non-contiguous memory allocation: processes are allocated page frames scattered all over the main memory – On-demand fetch: Process pages are brought in main memory when they are accessed for the first time MMU takes care of almost everything

81 Main memory Divided into fixed-size page frames –Allocation units –Sizes are powers of 2 (512 B, 1KB, 2KB, 4KB ) –Properly aligned –Numbered 0, 1, 2,... 012345678

82 Program address space Divided into fixed-size pages –Same sizes as page frames –Properly aligned –Also numbered 0, 1, 2,... 01234567

83 The mapping Will allocate non contiguous page frames to the pages of a process 01234567012

84 Is it virtual or real? MMU translates – Virtual addresses used by the process into – Real addresses in main memory

85 Realization

86 On-demand fetch (I) Most processes terminate without having accessed their whole address space – Code handling rare error conditions,... Other processes go to multiple phases during which they access different parts of their address space – Compilers

87 On-demand fetch (II) VM systems do not fetch whole address space of a process when it is brought into memory They fetch individual pages on demand when they get accessed the first time – Page miss or page fault When memory is full, they expel from memory pages that are not currently in use

88 Advantages System does not waste time loading pages that will be never accessed. Can have very large virtual address spaces Could run programs that are too big to fit in main memory – Important during the 70's and early 80's

89 Disadvantages Slows down memory accesses –Address translation overhead –Page faults Page faults introduce unpredictable delays –Very bad for real-time system No substitute for enough physical memory –Page faults are very expensive

90 Implementation To speed up address translation –A few hundred recently accessed page table entries are cached in the Translation Lookaside Buffer (TLB) –Remainder of page table is divided between Main memory (active page table entries) Secondary storage (the other entries)


92 Main techniques Speeding up the CPU clock –A 2GHz CPU goes through to computing cycle every nanosecond Letting the CPU "pipeline" instructions –Let the CPU work as an assembly line Have a multicore architecture

93 Limitations Speeding up the CPU clock –CPU with clock rates over 2 to 3 GHz become increasingly hard to cool Letting the CPU "pipeline" instructions –Cannot have perfect pipelining Have a multicore architecture –Harder to write software

94 Pipelining

95 The big idea Making different parts of the CPU work at the same time at different steps of a different instructions –Transforming the CPU into an assembly line

96 An analogy (I) Washing your clothes –Four steps: 1.Putting in the washer 2.Putting in the dryer 3.Folding/ironing 4.Putting them away

97 An analogy (II) Most people –Start second wash load as soon as first wash load is in dryer –Put second wash load in dryer and start a third wash load while they are folding/ironing the firs wash load

98 Purely sequential approach Time6 pm6:307pm7:308pm8:309pm9:30 WashDryFoldStore WashDryFoldStore

99 Smart approach Time6 pm6:307pm7:308pm8:309pm9:30 WashDryFoldStore WashDryFoldStore WashDryFoldStore WashDryFoldStore Solution assumes that a housemate puts folded/ironed clothes way for us

100 Main advantage Can do much more in much less time

101 An example Pipelining in the MIPS architecture Why? –Architecture developed by a team lead by John Hennessy from Stanford University –Pipelining is well described in architecture textbook by Hennessy and Patterson WARNING: It is just an example

102 Multiprocessor architectures

103 The solutions Many parallel processing solutions –Multiprocessor architectures Two or more microprocessor chips Multiple architectures –Multicore architectures Several processors on a single chip – Can have both

104 A dual core architecture RAM Shared Cache Core Cache Core Cache

105 Even our your cell phone Taiwanese chip maker MediaTek has introduced what it’s calling the first “true” octa-core chip. The MT6592 can use up to 8 processor cores at once –It’s not clear what that will actually mean in terms of day-to-day performance., November 20, 2013

106 Rene Descartes Seventeenth-century French philosopher Invented –Cartesian coordinates –Methodical doubt [To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts

107 A major challenge Keeping the caches consistent –Contents of same memory location can be cached by different processing units –What if one of the processing units modifies these contents All other caches will have the old values –Must update the memory location and invalidate the values in the other caches

108 The software side Two ways for software to exploit parallel processing capabilities of hardware – Job-level parallelism Several sequential processes run in parallel Easy to implement (OS does the job!) – Process-level parallelism A single program runs on several processors at the same time

109 Main considerations Some problems are embarrassingly parallel –Many computer graphics tasks –Brute force searches in cryptography or password guessing Much more difficult for other applications –Communication overhead among sub-tasks –Balancing the load

110 A last issue Humans likes to address issues one after the order –We have meeting agendas –We do not like to be interrupted –We write sequential programs

111 Rene Descartes Seventeenth-century French philosopher Invented –Cartesian coordinates –Methodical doubt [To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts

112 Method's third rule The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

113 My take Things will have to change


115 Protecting users’ data (I) Unless we have an isolated single-user system, we must prevent users from –Accessing –Deleting –Modifying without authorization other people's programs and data

116 Protecting users’ data (III) Two aspects –Protecting user's files on disk –Preventing programs from interfering with each other

117 Historical Considerations Earlier operating systems for personal computers did not have any protection –They were single-user machines –They typically ran one program at a time Windows 2000, Windows XP, Vista, Windows 7 and MacOS X are protected

118 Protecting users’ files Key idea is to prevent users’ programs from directly accessing the disk Will require I/O operations to be performed by the kernel Make them privileged instructions that only the kernel can execute

119 Privileged instructions Require a dual-mode CPU Two CPU modes – Privileged mode or executive mode that allows CPU to execute all instructions – User mode that allows CPU to execute only safe unprivileged instructions State of CPU is determined by a special bit

120 Kernel User Process X

121 Switching between states User mode will be the default mode for all programs –Only the kernel can run in supervisor mode Switching from user mode to supervisor mode is done through an interrupt –Safe because the jump address is at a well- defined location in main memory

122 Performing an I/O Program Kernel I/O request (interrupt) Physical I/O (executed by the kernel)

123 An analogy (I) Most UH libraries are open stacks –Anyone can consult books in the stacks and bring them to checkout National libraries and the Library of Congress have close stack collections –Users fill a request for a specific document –A librarian will bring the document to the circulation desk

124 An analogy (II) O pen stack collections –Let users browse the collections –Users can misplace or vandalize books C lose stack collections –Much slower and less flexible –Much safer

125 More trouble Having a dual-mode CPU is not enough to protect user’s files Must also prevent rogue users from tampering with the kernel –Same as a rogue customer bribing a librarian in order to steal books Done through memory protection

126 Memory protection (I) Prevents programs from accessing any memory location outside their own address space Requires special memory protection hardware Memory protection hardware –Checks every reference issued by program –Generates an interrupt when it detects a protection violation

127 Memory protection (II) Has additional advantages: –Prevents programs from corrupting address spaces of other programs –Prevents programs from crashing the kernel Not true for device drivers which are inside the kernel Required part of any multiprogramming system

128 Even more trouble Having both a dual-mode CPU and memory protection is not enough to protect user’s files Must also prevent rogue users from booting the system with a doctored kernel – Example: Can run Linux from a “live” CD Linux Linux will read all NTFS files ignoring all restrictions set up by Vista or Windows 7

129 Conclusion As computer architecture becomes more complex –Some old problems continue to bother us: Wide access time gaps between –CPU and main memory –Main memory and disk (or even flash) –Some solutions bring new challenges: Multicore architectures


Similar presentations

Ads by Google