Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review EE138 – SJSU.

Similar presentations

Presentation on theme: "Review EE138 – SJSU."— Presentation transcript:

1 Review EE138 – SJSU

2 Memory organization Memory chips are organized into a number of locations within the IC. Each location can hold 1 bit, 4 bits, 8 bits, or even 16 bits, depending on how it is designed internally. The number of bits that each location within the memory chip can hold is always equal to the number of data pins on the chip. How many locations exist inside a memory chip? That depends on the number of address pins. The number of locations within a memory IC always equals 2 to the power of the number of address pins. Therefore, the total number of bits that a memory chip can store is equal to the number of locations times the number of data bits per location. To summarize: 1. A memory chip contains 2x locations, where x is the number of address pins. 2. Each location contains y bits, where y is the number of data pins on the chip. 3. The entire chip will contain 2x × y bits, where x is the number of address pins and y is the number of data pins on the chip


4 EXAMPLES: 1) A given memory chip has 12 address pins and 4 data pins. Find: (a) the organization, and (b) the capacity. Solution: (a) This memory chip has 4,096 locations (212 = 4,096), and each location can hold 4 bits of data. This gives an organization of 4,096 × 4, often represented as 4K × 4. (b) The capacity is equal to 16K bits since there is a total of 4K locations and each location can hold 4 bits of data. 2) A 512K memory chip has 8 pins for data. Find: (a) the organization, and (b) the number of address pins for this memory chip. (a) A memory chip with 8 data pins means that each location within the chip can hold 8 bits of data. To find the number of locations within this memory chip, divide the capacity by the number of data pins. 512K/8 = 64K; therefore, the organization for this memory chip is 64K × 8. (b) The chip has 16 address lines since 216 = 64K.

5 Packaging issue in DRAM
Using the conventional method of data access, a 64K-bit chip (64K × 1) must have 16 address lines and 1 data line. To reduce the number of pins needed for addresses, multiplexing/demultiplexing is used. The method used is to split the address in half and send in each half of the address through the same pins, thereby requiring fewer address pins. Internally, the DRAM structure is divided into a square of rows and columns. The first half of the address is called the row and the second half is called the column. For example, in the case of DRAM of 64K × 1 organization, the first half of the address is sent in through the 8 pins A0–A7, and by activating RAS (row address strobe), the internal latches inside DRAM grab the first half of the address. After that, the second half of the address is sent in through the same pins, and by activating CAS (column address strobe), the internal latches inside DRAM latch the second half of the address. This results in using 8 pins for addresses plus RAS and CAS, for a total of 10 pins, instead of the 16 pins that would be required without multiplexing.


7 EXAMPLES: Discuss the number of pins set aside for addresses in each of the following memory chips. (a) 16K × 4 DRAM (b) 16K × 4 SRAM Solution: Since 214 = 16K: (a) For DRAM we have 7 pins (A0–A6) for the address pins and 2 pins for RAS and CAS. Total 9 pins for address. (b) For SRAM we have 14 pins for address and no pins for RAS and CAS since they are associated only with DRAM. In both cases we have 4 pins for the data bus.

8 Inside CPU A program stored in memory provides instructions to the CPU to perform an action. The function of the CPU is to fetch these instructions from memory and execute them. To perform the actions of fetch and execute, all CPUs are equipped with resources such as the following: Registers: The CPU uses registers to store information temporarily. Registers inside the CPU can be 8-bit, 16-bit, 32-bit, or even 64-bit registers, depending on the CPU. ALU (arithmetic/logic unit): is responsible for performing arithmetic functions such as add, subtract, multiply, and divide, and logic functions such as AND, OR, and NOT. Program Counter: is to point to the address of the next instruction to be executed. As each instruction is executed, the program counter is incremented to point to the address of the next instruction to be executed. Instruction Decoder: is to interpret the instruction fetched into the CPU. One can think of the instruction decoder as a kind of dictionary, storing the meaning of each instruction and what steps the CPU should take upon receiving a given instruction.


10 Harvard and von Neumann architectures
Every microprocessor must have memory space to store program (code) and data. While code provides instructions to the CPU, the data provides the information to be processed. The CPU uses buses (wire traces) to access the code ROM and data RAM memory spaces. von Neumann (Princeton) architecture uses the same bus for accessing both the code and data. The process of accessing the code or data could cause them to get in each other’s way and slow down the processing speed of the CPU, because each had to wait for the other to finish fetching. Harvard architecture speeds up the process of program execution by using separate buses for the code and data memory. A set of data buses for carrying data into and out of the CPU. A set of address buses for accessing the data. A set of data buses for carrying code into the CPU. An address bus for accessing the code. This is easy to implement inside an IC chip such as a microcontroller where both ROM code and data RAM are internal (on-chip) and distances are on the micron and millimeter scale.


12 Mega AVR (ATmegaxxxx) Family
These are powerful microcontrollers with more than 120 instructions and lots of different peripheral capabilities, which can be used in different designs. See Table 1-3. Some of their characteristics are as follows: • Program memory: 4K to 256K bytes • Package: 28 to 100 pins • Extensive peripheral set • Extended instruction set: They have rich instruction sets.


14 THE AVR DATA MEMORY In AVR microcontrollers there are two kinds of memory space: code memory space and data memory space. Our program is stored in code memory space, whereas the data memory stores data. The data memory is composed of three parts: GPRs (general purpose registers), I/O memory, and internal data SRAM.


16 C Programming Example Write an AVR C program to send values 00–FF to Port B. Solution: #include <avr/io.h> //standard AVR header int main(void) { unsigned char z; DDRB = 0xFF; //PORTB is output for(z = 0; z <= 255; z++) PORTB = z; return 0; } //Notice that the program never exits the for loop because if you //increment an unsigned char variable when it is 0xFF, it will //become zero.


18 Write an AVR C program to get a byte of data from Port B, and then send it to Port C.
Solution: #include <avr/io.h> //standard AVR header int main(void) { unsigned char temp; DDRB = 0x00; //Port B is input DDRC = 0xFF; //Port C is output while(1) temp = PINB; PORTC = temp; } return 0;


20 Write an AVR C program to toggle only bit 4 of Port B continuously without disturbing the rest of the pins of Port B. Solution: #include <avr/io.h> //standard AVR header int main(void) { DDRB = 0xFF; //PORTB is output while(1) PORTB = PORTB | 0b ; //set bit 4 (5th bit) of PORTB PORTB = PORTB & 0b ; //clear bit 4 (5th bit) of PORTB } return 0;

21 Write an AVR C program to monitor bit 5 of port C
Write an AVR C program to monitor bit 5 of port C. If it is HIGH, send 55H to Port B; otherwise, send AAH to Port B. Solution: #include <avr/io.h> //standard AVR header int main(void) { DDRB = 0xFF; //PORTB is output DDRC = 0x00; //PORTC is input DDRD = 0xFF; //PORTB is output while(1) if (PINC & 0b ) //check bit 5 (6th bit) of PINC PORTB = 0x55; else PORTB = 0xAA; } return 0;


23 Find the contents of PORTC after execution of the following code:
PORTC = PORTC | 0x99; PORTC = ~PORTC; Solution: 66H PORTC = ~(0<<3); FFH

24 TIMER IN AVR Normal mode
In this mode, the content of the timer/counter increments with each clock. It counts up until it reaches its max of 0xFF. When it rolls over from 0xFF to 0x00, it sets high a flag bit called TOV0 (Timer Overflow). This timer flag can be monitored.

25 CTC Mode The OCR0 register is used with CTC mode. As with the Normal mode, in the CTC mode, the timer is incremented with a clock. But it counts up until the content of the TCNT0 register becomes equal to the content of OCR0 (compare match occurs); then, the timer will be cleared and the OCF0 flag will be set when the next clock occurs. The OCF0 flag is located in the TIFR register.

26 EXAMPLES: 1) In Normal mode, when the counter rolls over it goes from ____ to ____. 2) In CTC mode, the counter rolls over when the counter reaches____. 3) To get a 5-ms delay, what numbers should be loaded into TCNT1H and TCNT1L using Normal mode and the TOV1 flag? Assume that XTAL = 8 MHz. 4) To get a 20-μs delay, what number should be loaded into the TCNT0 register using Normal mode and the TOV0 flag? Assume that XTAL = 1 MHz. 1) Max ($FFFF for 16-bit timers and $FF for 8-bit timers), 0000 2) OCR1A 3) $10000 – (5000 × 8) = = 63C0, TCNT1H = 0x64 and TCNT1L = 0xC0 4) XTAL = 1 MHz  Tmachine cycle = 1/1 M = 1 μs  20 μs / 1 μs = 20 −20 = $100 – 20 = 256 − 20 = 236 = 0xEC


Serial data communication uses two methods, asynchronous and synchronous. The synchronous method transfers a block of data (characters) at a time, whereas the asynchronous method transfers a single byte at a time.

29 Asynchronous Serial Communication
In the asynchronous method, each data character is placed between start and stop bits. This is called framing. The start bit is always a 0 (low) and one bit, but the stop bit(s) is 1 (high) and can be one or two bits. When there is no transfer, the signal is 1 (high), which is referred to as mark or idle. Data D0 (LSB) goes first then the rest of the bits until the MSB (D7). Example: the ASCII character “A” (8-bit binary ) is framed between the start bit and a single stop bit.

30 Example: a) Find the overhead due to framing when transmitting the ASCII letter “A” ( ). b) Calculate the time it takes to transfer 10,000 characters as in question a) if we use 9600 bps. What percentage of time is wasted due to overhead? Solutions: a) 2 bits (one for the start bit and one for the stop bit). Therefore, for each 8-bit character, a total of 10 bits is transferred. b) 10,000 × 10 = 100,000 total bits transmitted. 100,000 / 9600 = 10.4 seconds; 2 / 10 = 20%.

31 Baud Rate in the AVR In the AVR microcontroller five registers are associated with the USART. They are UDR (USART Data Register), UCSRA, UCSRB, UCSRC (USART Control Status Register), and UBRR (USART Baud Rate Register). Desired Baud Rate = Fosc/ (16(X + 1)) where X is the value we load into the UBRR register. To get the X value for different baud rates we can solve the equation as follows: X = (Fosc/ (16(Desired Baud Rate))) – 1 Assuming that Fosc = 8 MHz, we have the following: Desired Baud Rate = Fosc/ (16(X + 1)) = 8 MHz/16(X + 1) = 500 kHz/(X + 1) X = (500 kHz/ Desired Baud Rate) – 1

32 Examples: 1) Find Baud Rate if UBRR = 67H = 103 Solution:
Desired Baud Rate = Fosc/(16(X + 1)) = 8MHz/(16(103+1)) = 4807 bps 2) Find the UBRR value needed to have the following baud rates: (a) 9600 (b) 1200 for Fosc = 8 MHz. Fosc = 8 MHz => X = (8 MHz/16(Desired Baud Rate)) – 1 => X = (500 kHz/(Desired Baud Rate)) – 1 (a) (500 kHz/ 9600) – 1 = – 1 = = 51 = 33 (hex) is loaded into UBRR (b) (500 kHz/ 1200) – 1 = – 1 = = 415 = 19F (hex) is loaded into UBRR

33 Doubling the baud rate in the AVR
Baud Rate Generation Block Diagram Doubling the baud rate in the AVR There are two ways to increase the baud rate of data transfer in the AVR: 1. Use a higher-frequency crystal (not feasible in many cases). 2. Change a bit in the UCSRA register (U2X = 1). Desired Baud Rate = Fosc / (8 (X + 1)) when U2x = 1

34 Baud Rate Error Calculation
In calculating the baud rate we have used the integer number for the UBRR register values because AVR microcontrollers can only use integer values. By dropping the decimal portion of the calculated values we run the risk of introducing error into the baud rate. One way to calculate this error Error = (Calculated value for the UBRR – Integer part) / Integer part For example, with XTAL = 8 MHz and U2X = 0 we have the following for the 9600 baud rate: UBRR value = (500,000/ 9600) – 1 = – 1 = = 51 => Error = (51.08 – 51)/ 51 = 0.16%

35 Examples: Given: XTAL = 7.3728 MHz.
a) What value should be loaded into UBRR to have a 9600 baud rate for U2X = 0, 1? Give the answers in both decimal and hex. b) What are the baud rate errors in a)? Solutions: U2X = 0: (Fosc/16(baud rate)) – 1 = ( /16(9600)) – 1 = 47or 2FH U2X = 1: (Fosc/8(baud rate)) – 1 = ( / 8 (9600)) – 1 = 94 or 5EH b) 0%

36 Memory and I/O Systems Computer system performance depends on the memory system as well as the processor microarchitecture. Early processors were relatively slow, so memory was able to keep up. But processor speed has increased at a faster rate than memory speeds. DRAM memories are currently 10 to 100 times slower than processors. The increasing gap between processor and DRAM memory speeds demands increasingly ingenious memory systems to try to approximate a memory that is as fast as the processor.

37 Diverging processor and memory performance
Adapted with permission from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 5th ed., Morgan Kaufmann, 2012.

38 Cache Hit and Cache Miss
Cache Memory To counteract this trend, computers store the most commonly used instructions and data in a faster but smaller memory, called a cache. The cache is usually built out of SRAM on the same chip as the processor. The cache speed is comparable to the processor speed, because SRAM is inherently faster than DRAM, and because the on-chip memory eliminates lengthy delays caused by traveling to and from a separate chip. Cache Hit and Cache Miss If the processor requests data that is available in the cache, it is returned quickly. This is called a cache hit. Otherwise, the processor retrieves the data from main memory (DRAM). This is called a cache miss. If the cache hits most of the time, then the processor seldom has to wait for the slow main memory, and the average access time is low.

39 Memory Hierarchy Computer System
The processor first seeks data in a small but fast cache that is usually located on the same chip. If the data is not available in the cache, the processor then looks in main memory. If the data is not there either, the processor fetches the data from virtual memory on the large but slow hard disk

40 Memory Hierarchy Components with typical characteristics in 2012

41 MEMORY SYSTEM PERFORMANCE ANALYSIS Miss and Hit rate calculation:
Memory system performance metrics are miss rate or hit rate and average memory access time. Miss and Hit rate calculation: Average memory access time (AMAT) is the average time a processor must wait for memory per load or store instruction. AMAT calculation: Note: In the typical computer system, the processor first looks for the data in the cache. If the cache misses, the processor then looks in main memory. If the main memory misses, the processor accesses virtual memory on the hard disk.

1) Suppose a program has 2000 data access instructions (loads or stores), and 1250 of these requested data values are found in the cache. The other 750 data values are supplied to the processor by main memory or disk memory. What are the miss and hit rates for the cache? Solution: The miss rate is 750/2000 = = 37.5%. The hit rate is 1250/2000 = = 1 − = 62.5%. CALCULATING AVERAGE MEMORY ACCESS TIME 2) Suppose a computer system has a memory organization with only two levels of hierarchy, a cache and main memory. What is the average memory access time given Access times and miss rates as below Memory Level Access Time (Cycles) Miss Rate Cache % Main Memory % The average memory access time is (100) =11 cycles.

43 Data Held in the Cache In particular, the cache exploits temporal and spatial locality to achieve a low miss rate. Temporal locality means that the processor is likely to access a piece of data again soon if it has accessed that data recently. Therefore, when the processor loads or stores data that is not in the cache, the data is copied from main memory into the cache. Subsequent requests for that data hit in the cache. Spatial locality means that, when the processor accesses a piece of data, it is also likely to access data in nearby memory locations. Therefore, when the cache fetches one word from memory, it may also fetch several adjacent words. This group of words is called a cache block or cache line. The number of words in the cache block, b, is called the block size. A cache of capacity C contains B = C/b blocks.

44 The principles of temporal and spatial locality have been experimentally verified in real programs.
If a variable is used in a program, the same variable is likely to be used again, creating temporal locality. If an element in an array is used, other elements in the same array are also likely to be used, creating spatial locality.

45 Multiple-Level Caches
Advanced Cache Design Modern systems use multiple levels of caches to decrease memory access time that will improve performance of the systems. Multiple-Level Caches Large caches are beneficial because they are more likely to hold data of interest and therefore have lower miss rates. However, large caches tend to be slower than small ones. Modern systems often use at least two levels of caches. The first-level (L1) cache is small enough to provide a one- or two-cycle access time. The second-level (L2) cache is also built from SRAM but is larger, and therefore slower, than the L1 cache. The processor first looks for the data in the L1 cache. If the L1 cache misses, the processor looks in the L2 cache. If the L2 cache misses, the processor fetches the data from main memory. Many modern systems add even more levels of cache to the memory hierarchy, because accessing main memory is so slow.

46 Memory Hierarchy with Two Levels of Cache

47 SYSTEM WITH AN L2 CACHE Given a system using 2 level of Cache, what is the average memory access time (AMAT) for given access time and miss rate below? Memory Level Access Time (Cycles) Miss Rate Cache L % Cache L % Main Memory % Solution: Each memory access checks the L1 cache. When the L1 cache misses (5% of the time), the processor checks the L2 cache. When the L2 cache misses (20% of the time), the processor fetches the data from main memory. AMAT = 1 cycle [10 cycle + 0.2(100 cycles)] = 2.5 cycles The L2 miss rate is high because it receives only the “hard” memory accesses, those that miss in the L1 cache. If all accesses went directly to the L2 cache, the L2 miss rate would be about 1%. AMAT = tcache + MRcache(tL2cache + MRL2cache tMM)

48 Endianness Big-endian: Most significant byte of the word is stored in the smallest address given and the least significant byte is stored in the largest. Little endian: Least significant byte is stored in the smallest address. In modern days, big-endian is generally used in computer networks, and little-endian in microprocessors. Example: The Intel processors use little-endian system and the  IBM computer networks use big-endian system.

49 There are several possible methods for determining where memory blocks are placed in the cache. Data is usually stored in cache in one of three schemes: direct mapped, associative, set associative. Tag Block/Line Offset Tag Offset Tag Set Offset

50 Cache and Memory Working Together http://www. cs. nmsu
Let's try to put together some examples of simultaneous TLB and L1 cache lookups. For example, let's look at the simplest case: we'll make both the TLB and the L1 cache direct-mapped. Let's assume the following specifications: Virtual Memory Address width: 32 bits Page size: 1 K bytes Single level page table Physical Memory 32 bit physical address space Cache Block size: 16 bytes Cache size: Associativity: Direct mapped Translation Lookaside Buffer Number of translations: 64

51 Cache and Memory Working Together http://www. cs. nmsu
The block size is 16 bytes, so the byte offset field is 4 bits The total size of the cache is 1K, so there are1K/16 = 64blocks. Since it's direct mapped we've got a six bit index field. We've used up 10 bits; since the physical address is 32 bits that tells us that we've got a 22 bit tag. So this looks like the following: Tag Cache Index 9-4 Byte Offset 3-0 Virtual Memory The page size is 1K, so the byte offset field is 10 bits. That leaves us a 22 bit virtual page number Virtual Page Number Byte Offset 9-0 TLB We get the field breakdown for the TLB by further dividing the VPN. Since we've got 64 translations and a direct-mapped organization, the 22 bit VPN gets divided into: TLB Tag TLB Index 15-10

52 Cache and Memory Working Together http://www. cs. nmsu
TLB Tag TLB Index Cache Index Byte Offset 31-16 15-10 9-4 3-0 Let's put specific numbers on this: we'll try to read one byte from virtual address 0x1234abcd. The byte offset field contains d (bits 3-0 of the address). The cache index field contains 3c (bits 9-4 of the address). The TLB index field contains 2a (bits of the address). The TLB tag field contains 1234 (bits of the address). So now we go through the following steps: We look up translation 2a in the TLB and cache line 3c in the cache. We obtain the TLB tag from the TLB and the cache tag from the cache. We ask whether: The TLB entry is valid. The TLB tag is 1234 (that's the TLB tag from our virtual address. We have permissions to perform the requested access. The cache entry is valid. The cache tag from the cache entry matches the cache tag from the TLB entry. If the answer to all of the questions in Step 3 was "yes", we've both got a valid translation and a cache hit. We can either obtain our data from the cache or write our value to the cache.

53 Transfers Between Cache and Memory http://www. cs. nmsu
We have two competing requirements: we'd like to bring an entire cache line in from memory in one transfer (for bandwidth), but we want to have as few data lines as possible (for cost). There are really three feasible solutions here: the fastest (but most expensive) approach is to use a memory bus that's as wide as a cache line. Now, any time you have a miss, you can just do a single memory transfer. The cheapest (but slowest) approach is to use a memory bus that's narrower than a cache line; then, on a miss, we take several memory transfers to bring the whole line in. The third approach is a compromise between the first two: use the narrower bus from the second approach, but find a way to overlap the memory accesses. The traditional way to implement this approach was to have several distinct memory modules: you'd start a read from each of them in turn, and the data would arrive from them on consecutive cycles. The current solution to this problem is to use fast page DRAM or synchronous DRAM. With both of these technologies, we can make a transfer from the internal DRAM cells (comparatively slow) into some substantially faster static memory on the memory chip, and then transfer the data from the static memory much more quickly than we could from DRAM. PC100 and PC133 SDRAM uses four transfers of 64 bits each to fill a cache line on a system with a 32 byte cache line.

54 References: The AVR Microcontroller and Embedded Systems: Using Assembly and C Muhammad Ali Mazidi; Sarmad Naimi; Sepehr Naimi Digital Design and Computer Architecture, 2nd Edition David Harris; Sarah Harris Computer Organization and Embedded Systems, 6th Edition Hamacher, Carl; Vranesic, Zvonko; Zaky, Safwat; Manjikian, Naraig

Download ppt "Review EE138 – SJSU."

Similar presentations

Ads by Google