EC6703 EMBEDDED AND REAL TIME SYSTEMS
UNIT I INTRODUCTION TO EMBEDDED COMPUTING AND ARM PROCESSORS Complex systems and micro processors Embedded system design process Design example: Model train controller Instruction sets preliminaries ARM Processor CPU: programming input and output supervisor mode, exceptions and traps Co-processors Memory system mechanisms CPU performance CPU power consumption.
What is an Embedded System (ES)? Embedded basically reflects the facts that they are an integral part of the system. “ It is a computer system that is built to control one or a few dedicated functions, and is not designed to be programmed by the end user in the same way that a desktop computer is” It Contains processing cores that are either Micro-controllers or DSPs. “An embedded system is some combination of computer hardware and software, either fixed in capability or programmable, that is specifically designed for a particular function.”
What is an Embedded System (ES)? Contd… “Computing systems performing specific tasks within a framework of real-world constraints” “An Embedded System is a microprocessor based system that is embedded as a subsystem, in a larger system (which may or may not be a computer system).” “An ES is designed to run on its own without human intervention, and may be required to respond to events in real time”
Application of ES Automotive: ECS, ABS; Aircraft Network Appliances: Routers, Modems Cell Phones, PDA, Mouse, E-Star Power Printers, Hand Mixers, Toasters!
What is a real time system? A real-time system is one that must process information and produce a response within a specified time, else risk severe consequences, including failure.
Modern Embedded Systems Application Specific Gates Processor Cores Analog I/O Memory DSP Code Embedded systems employ a combination of application-specific h/w (boards, ASICs, FPGAs etc.) performance, low power s/w on prog. processors: DSPs, controllers etc. flexibility, complexity mechanical transducers and actuators
INTRODUCTION TO EMBEDDED COMPUTING AND ARM PROCESSORS We first need to understand how and why microprocessors are used for control, user interface, signal processing, and many other tasks. The microprocessor has become so common that it is easy to forget how hard some things are to do without it.
COMPLEX SYSTEMS AND MICROPROCESSORS Embedded Computer System: “It is any device that includes a programmable computer but is not itself intended to be a general-purpose computer”. Thus a PC is not itself an embedded computing system, although PCs are often used to build embedded computing systems. But a fax machine or a clock built from a microprocessor is an embedded computing system. EX: Automobiles, cell phones, and even household appliances
COMPLEX SYSTEMS AND MICROPROCESSORS contd… Designers in many fields must be able to identify where microprocessors can be used, design a hardware platform with I/O devices that can support the required tasks, and implement software that performs the required processing. Embedding Computers CPU mem input output analog embedded computer Characteristics of Embedded Computing Applications: Complex algorithms User interface (Sophisticated functionality) Real time Multirate Manufacturing cost Power and energy Finally, most embedded computing systems are designed by small teams on tight deadlines.
COMPLEX SYSTEMS AND MICROPROCESSORS contd… Why Use Microprocessors? ■ Microprocessors are a very efficient way to implement digital systems. ■ Microprocessors make it easier to design families of products that can be built to provide various feature sets at different price points and can be extended to provide new features to keep up with rapidly changing markets. ■ Microprocessors execute programs very efficiently. Modern RISC processors can execute one instruction per clock cycle most of the time, and high performance processors can execute several instructions per cycle (MIPS). ■ Microprocessor manufacturers spend a great deal of money to make their CPUs run very fast. ■ Microprocessors generally dominate new fabrication lines because they can be manufactured in large volume and are guaranteed to command high prices. ■ Microprocessors are very efficient utilizers of logic. The generality of a microprocessor and the need for a separate memory may suggest that microprocessor-based designs are inherently much larger than custom logic designs
COMPLEX SYSTEMS AND MICROPROCESSORS contd… Why not use PCs for all embedded computing? PCs are widely used and provide a very flexible programming environment. Components of PCs are, in fact, used in many embedded computing systems. Real-time performance requirements often drive us to different architectures. 2. Low power and low cost also drive us away from PC architectures and toward multiprocessors. Personal computers are designed to satisfy a broad mix of computing requirements and to be very flexible. Those features increase the complexity and price of the components. They also cause the processor and other components to use more energy to perform a given function.
COMPLEX SYSTEMS AND MICROPROCESSORS contd… Challenges in Embedded Computing System Design How much hardware do we need? How do we meet deadlines? How do we minimize power consumption? How do we design for upgradability? Does it really work? (Reliability) ■ Complex testing ■ Limited observability and controllability ■ Restricted development environments
COMPLEX SYSTEMS AND MICROPROCESSORS contd… Performance in Embedded Computing In order to understand the real-time behavior of an embedded computing system, we have to analyze the system at several different levels of abstraction Those layers include: ■ CPU: The CPU clearly influences the behavior of the program, particularly when the CPU is a pipelined processor with a cache. ■ Platform: The platform includes the bus and I/O devices. The platform components that surround the CPU are responsible for feeding the CPU and can dramatically affect its performance. ■ Program: Programs are very large and the CPU sees only a small window of the program at a time. We must consider the structure of the entire program to determine its overall behavior.
Contd…. ■ Task: We generally run several programs simultaneously on a CPU, creating a multitasking system. The tasks interact with each other in ways that have profound implications for performance. ■ Multiprocessor: Many embedded systems have more than one processor— they may include multiple programmable CPUs as well as accelerators. Once again, the interaction between these processors adds yet more complexity to the analysis of overall system performance.
THE EMBEDDED SYSTEM DESIGN PROCESS
THE EMBEDDED SYSTEM DESIGN PROCESS Embedded system design process aimed at two objectives: 1. It will give us an introduction to the various steps in embedded system design before we delve into them in more detail. 2. It will allow us to consider the design methodology itself. A design methodology is important for three reasons. , First, it allows us to keep a Scorecard on a design to ensure that we have done everything we need to do, such as optimizing performance or performing functional tests. Second, it allows us to develop computer-aided design(CAD) tools. Developing a single program that takes in a concept for an embedded system and emits a completed design would be a daunting task, but by first breaking the process into manageable steps, we can work on automating (or at least semi automating) the steps one at a time. Third, a design methodology makes it much easier for members of a design team to communicate.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd..
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. The major goals of the design: ■ manufacturing cost ■ performance (both overall speed and deadlines); and ■ power consumption. We must also consider the tasks we need to perform at every step in the design process. At each step in the design, we add detail: ■ We must analyze the design at each step to determine how we can meet the specifications. ■ We must then refine the design to add detail. ■ And we must verify the design to ensure that it still meets all system goals, such as cost, speed, and so on.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 1. Requirements -Creating the architecture and components. -First, we gather an informal description from the customers known as requirements, and we refine the requirements into a specification that contains enough information to begin designing the system architecture. Requirements may be functional or nonfunctional Typical nonfunctional requirements include: ■ Performance ■ Cost (manufacturing cost ; Nonrecurring engineering(NRE) costs) ■ Physical size and weight ■ Power consumption
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. Sample requirements form.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. EX: GPS Moving Map What requirements might we have for our GPS moving map?
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. Here is an initial list: ■ Functionality: This system is designed for highway driving and similar uses, not nautical or aviation uses that require more specialized databases and functions. The system should show major roads and other landmarks available in standard topographic databases. ■ User interface: The screen should have at least 400600 pixel resolution. The device should be controlled by no more than three buttons. A menu system should pop up on the screen when buttons are pressed to allow the user to make selections to control the system. ■ Performance: The map should scroll smoothly. Upon power-up, a display should take no more than one second to appear, and the system should be able to verify its position and display the current map within 15 s. ■ Cost: The selling cost (street price) of the unit should be no more than $100. ■ Physical size and weight: The device should fit comfortably in the palm of the hand. ■ Power consumption: The device should run for at least eight hours on four AA batteries.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. Finally the Requirement form,
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 2.Specification Specification serves as the contract between the customer and the architects. Specification is probably the least familiar phase of this methodology for neophyte designers, but it is essential to creating working systems with a minimum of designer effort. The specification should be understandable enough so that someone can verify that it meets system requirements and overall expectations of the customer.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. A specification of the GPS system would include several components: ■ Data received from the GPS satellite constellation. ■ Map data. ■ User interface. ■ Operations that must be performed to satisfy customer requests. ■ Background actions required to keep the system running, such as operating the GPS receiver.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 3. Architecture Design The specification does not say how the system does things, only what the system does. Describing how the system implements those functions is the purpose of the architecture. The architecture is a plan for the overall structure of the system that will be used later to design the components that make up the architecture. The creation of the architecture is the first phase of what many designers think of as design.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 3. Architecture Design Figure below shows a sample system architecture in the form of a block diagram that shows major operations and data flows among them. Architectural descriptions must be designed to satisfy both functional and nonfunctional requirements. Not only must all the required functions be present, but we must meet cost, speed, power,and other nonfunctional constraints. Starting out with a system architecture and refining that to hardware and software architectures,
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 3. Architecture Design Hardware Software
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 4. Designing Hardware and Software Components The components will in general include both hardware—FPGAs, boards, and so on and software modules. Some of the components will be ready-made. You will have to design some components yourself.
THE EMBEDDED SYSTEM DESIGN PROCESS Contd.. 5. System Integration Bugs are typically found during system integration, and good planning can help us find the bugs quickly. By building up the system in phases and running properly chosen tests, we can often find bugs more easily. If we debug only a few modules at a time, we are more likely to uncover the simple bugs and able to easily recognize them. System integration is difficult because it usually uncovers problems. It is often hard to observe the system in sufficient detail to determine exactly what is wrong— the debugging facilities for embedded systems are usually much more limited than what you would find on desktop systems. As a result, determining why things do not stet work correctly and how they can be fixed is a challenge in itself. Careful attention to inserting appropriate debugging facilities during design can help ease system integration problems, but the nature of embedded computing means that this phase will always be a challenge.
(STRUCTURAL DESCRIPTION & BEHAVIORAL DESCRIPTION) FORMALISMS FOR SYSTEM DESIGN (STRUCTURAL DESCRIPTION & BEHAVIORAL DESCRIPTION) It is often helpful to conceptualize these tasks( Top-Down Design process) in diagrams. Luckily, there is a visual language that can be used to capture all these design tasks: the Unified Modeling Language(UML) UML is an object-oriented modeling language UML was designed to be useful at many levels of abstraction in the design process. UML is useful because it encourages design by successive refinement and progressively adding detail to the design, rather than rethinking the design at each new level of abstraction. Object-Oriented design emphasizes two concepts of importance: ■ It encourages the design to be described as a number of interacting objects, rather than a few large monolithic blocks of code. ■ At least some of those objects will correspond to real pieces of software or hardware in the system. We can also use UML to model the outside world that interacts with our system, in which case the objects may correspond to people or other machines. It is sometimes important to implement something we think of at a high level as a single object using several distinct pieces of code or to otherwise break up the object correspondence in the implementation.
The principal component of an object-oriented design is, naturally enough, the object. An object includes a set of attributes that define its internal state. When implemented in a programming language, these attributes usually become variables or constants held in a data structure. An object describing a display (such as a CRT screen) is shown in UML notation in Figure The text in the folded-corner page icon is a note; it does not correspond to an object in the system and only serves as a comment. The name is underlined to show that this is a description of an object and not of a class. A class is a form of type definition—all objects derived from the same class have the same characteristics, although their attributes may have different values. A class defines the attributes that an object may have. It also defines the operations that determine how the object interacts with the rest of the world.
There are several types of relationships that can exist between objects and classes: ■ Association occurs between objects that communicate with each other but have no ownership relationship between them. ■ Aggregation describes a complex object made of smaller objects. ■ Composition is a type of aggregation in which the owner does not allow access to the component objects. ■ Generalization allows us to define one class in terms of another.
so far what we have seen is STRUCTURAL DESCRIPTION of UML Next is BEHAVIORAL DESCRIPTION To specify the behavior of an operation is by state machine. These state machines will not rely on the operation of a clock, as in hardware; rather, changes from one state to another are triggered by the occurrence of events. 3 types of Events defined by UML: A signal is an asynchronous occurrence. It is defined in UML by an object that is labeled as a <<signal>>. 2. A call event follows the model of a procedure call in a programming language. 3. A time-out event causes the machine to leave a state after a certain amount of time. The label tm(time-value) on the edge gives the amount of time after which the transition occurs.
Design example: Model train controller Learning of UML through Model Train Controller
Model train setup rcvr motor power supply console ECC address header command
Console
Model train controller - The user sends messages to the train with a control box attached to the tracks. The control box may have familiar controls such as a throttle, emergency stop button, and so on. Since the train receives its electrical power from the two rails of the track, the control box can send signals to the train over the tracks by modulating the power supply voltage. The control panel sends packets over the tracks to the receiver on the train. The train includes analog electronics to sense the bits being transmitted and a control system to set the train motor’s speed and direction based on those commands. Each packet includes an address so that the console can control several trains on the same track; the packet also includes an error correction code (ECC) to guard against transmission errors. This is a one-way communication system—the model train cannot send commands back to the user. For design, we start with requirement first
1.Requirements Here is a basic set of requirements for the system: ■ The console shall be able to control up to eight trains on a single track. ■ The speed of each train shall be controllable by a throttle to at least 63 different levels in each direction (forward and reverse). ■There shall be an inertia control that shall allow the user to adjust the responsiveness of the train to commanded changes in speed. ■ There shall be an emergency stop button. ■ An error detection scheme will be used to transmit messages. We can put the requirements into our chart format:
2. Specifications The Digital Command Control (DCC) standard was created by the National Model Railroad Association to support interoperable digitally-controlled model trains. DCC was created to provide a standard that could be built by any manufacturer so that hobbyists could mix and match components from multiple vendors. The DCC standard is given in two documents: ■ Standard S-9.1, the DCC Electrical Standard, defines how bits are encoded on the rails for transmission. ■ Standard S-9.2, the DCC Communication Standard, defines the packets that carry information. Any DCC-conforming device must meet these specifications. The DCC standard does not specify many aspects of a DCC train system. It doesn’t define the control panel, the type of microprocessor used, the programming language to be used, or many other aspects of a real model train system. The standard concentrates on those aspects of system design that are necessary for interoperability.
Basic system commands
Typical control sequence :console :train_rcvr set-inertia set-speed set-speed estop set-speed
Conceptual Specification Digital Command Control specifies some important aspects of the system, particularly those that allow equipment to interoperate. But DCC deliberately does not specify everything about a model train control system There are clearly two major subsystems: The command unit and the train-board component as shown in Fig1. Fig1: Class diagram for the train controller messages. Fig2: UML collaboration diagram for major subsystems of the train controller system
Fig: A UML class diagram for the train controller showing the composition of the subsystems
Console physical object classes knobs* pulser* train-knob: integer speed-knob: integer inertia-knob: unsigned- integer emergency-stop: boolean pulse-width: unsigned- integer direction: boolean sender* detector* send-bit() read-bit() : integer
Panel and motor interface classes speed: integer train-number() : integer speed() : integer inertia() : integer estop() : boolean new-settings()
Transmitter and receiver classes current: command new: boolean send-speed(adrs: integer, speed: integer) send-inertia(adrs: integer, val: integer) set-estop(adrs: integer) read-cmd() new-cmd() : boolean rcv-type(msg-type: command) rcv-speed(val: integer) rcv-inertia(val:integer)
Formatter class Formatter class holds state for each train, setting for current train. The operate() operation performs the basic formatting task. formatter current-train: integer current-speed[ntrains]: integer current-inertia[ntrains]: unsigned-integer current-estop[ntrains]: boolean send-command() panel-active() : boolean operate()
Control input sequence diagram :knobs :panel :formatter :transmitter change in control settings read panel panel-active change in speed/ inertia/estop panel settings send-command read panel send-speed, send-inertia. send-estop panel settings read panel change in train number train number change in panel settings new-settings set-knobs
Formatter operate behavior update-panel() panel-active() new train number idle send-command() other
Panel-active behavior current-train = train-knob update-screen changed = true panel*:read-train() F T current-speed = throttle changed = true panel*:read-speed() F ... ...
Instruction sets preliminaries In this topic, we begin our study of microprocessors by studying instruction sets—”The programmer’s interface to the hardware” The instruction set is the key to analyzing the performance of programs. By understanding the types of instructions that the CPU provides, we gain insight into alternative ways to implement a particular function. Computer Architecture Taxonomy A Harvard architecture. A von Neumann architecture computer.
Which Architecture is Best Suited for µp and DSP? Harvard Architecture Von Neumann Architecture Stored program concept (store program code along with data)
Computer Architecture Contd… The CPU has several internal registers that store values used internally. One of those registers is the program counter (PC),which holds the address in memory of an instruction. The CPU fetches the instruction from memory, decodes the instruction, and executes it. Processing signals in real-time places great strains on the data access system in two ways: First, large amounts of data flow through the CPU; and second, that data must be processed at precise intervals, not just when the CPU gets around to it. Data sets that arrive continuously and periodically are called streaming data.
Computer Architecture in terms of Instruction Set Complex instruction set computer (CISC): many addressing modes; many operations. Different instruction formats of varying lengths. Reduced instruction set computer (RISC): - load/store; (data operands must first be loaded into the CPU and then stored back to main memory to save the results.) - Fewer and simpler instructions pipelinable instructions.
Complex Instruction Set Computers(CISC) Single instruction procedure entries and exits Variable length instruction sets with many formats Complex sequence of operations over many clock cycles Processors based on CISC were sold on the sophistication and number of their addressing modes, data types, etc Developed in the 1970’s when computers had slow main memory so processors were controlled by faster ROMs Frequently used operations are drawn from ROM as microcode sequences rather than having instructions pulled from main memory Reduced Instruction Set Computers(RISC) Pipeline execution Starting a second instruction before the first one has finished A fixed (32 bit) instruction size with few formats. A load-store architecture where instructions that process data operate only on registers and are separate from instructions that access memory A large register bank of 32-bit registers, all of which can be used for any purpose, to allow the load-store architecture to operate efficiently Hard-wired instruction decode logic Single-cycle execution
RISC Architecture Advantages/Disadvantages A smaller die size A simpler processor requires fewer transistors and less silicon area. A shorter development time Less design effort and therefore a lower cost A higher performance Simpler instructions are executed faster. Disadvantages Poor code density compared with CISC’s Doesn’t execute x86 code
Instruction set characteristics Fixed vs variable length. Addressing modes. Number of operands. Types of operations supported.
Programming model Programming model: Registers visible to the programmer. Some registers are not visible (IR).
ARM – What is it? ARM stands for Advanced RISC Machines An ARM processor is basically any 16/32bit microprocessor designed and licensed by ARM Ltd, a microprocessor design company headquartered in England, founded in 1990 by Herman Hauser A characteristic feature of ARM processors is their low electric power consumption, which makes them particularly suitable for use in portable devices. It is one of the most used processors currently on the market
Examples of ARM Based Products The Toshiba 46HM94 46-inch Television The Nano IPod Samsung S3FJ9SK Smartcard IC
History of ARM Acorn Computers: a British computer company founded in Cambridge, England, in 1978, by Hermann Hauser and Chris Curry. The company produced a number of computers which were especially popular in the UK. These included the Acorn Electron, the BBC Micro and the Acorn Archimedes. Acorn's BBC Micro computer dominated the UK educational computer market during the 1980s and early 1990s. VLSI Technology, Inc. produced the first ARM processor based on Acorn designs. ARM based PCs did not sell well, Acorn acquired by Olivetti in 1985 ARM contracted to develop for Apple for the Apple Newton Handheld built by VLSI. The company was broken up into several independent operations in 2000, one of which, notably, was ARM Holdings ARM holdings primary business model is to license its RISC based designs to other manufactures.
ARM Features The ARM7 is a low-power, general purpose 32-bit RISC microprocessor macrocell (32-bit data & address bus) for use in application or customer-specific integrated circuts (ASICs or CSICs). Its simple, elegant and fully static design is particularly suitable for cost and power-sensitive applications. The ARM7’s small die size makes it ideal for integrating into a larger custom chip that could also contain RAM, ROM, logic, DSP and other cells.
ARM Features contd… Big and Little Endian (with the lowest-order byte residing in the low-order bits of the word) operating modes High performance RISC 17 MIPS(million instruction per second) sustained @ 25 MHz (25 MIPS peak) @ 3V Low power consumption 0.6mA/MHz @ 3V fabricated in .8µm CMOS Fast interrupt response for real-time applications Virtual Memory System Support Excellent high-level language support Simple but powerful instruction set
ARM Architecture RISC features incorporated by ARM A load-store Architecture Fixed-length 32-bit instructions 3-address instruction formats RISC features not incorporated into ARM Pipelining Delayed branches Single-cycle execution of all instructions ARM7 is a von Neumann architecture machine, while ARM9 uses a Harvard architecture.
INSTRUCTION FORMAT
The Registers of ARM (Programmers model) ARM has 37 registers all of which are 32-bits long. 1 dedicated program counter 1 dedicated current program status register 5 dedicated saved program status registers 30 general purpose registers The current processor mode governs which of several banks is accessible. Each mode can access a particular set of r0-r12 registers a particular r13 (the stack pointer, sp) and r14 (the link register) the program counter, r15 (pc) the current program status register, cpsr Privileged modes (except System) can also access a particular spsr (saved program status register) The ARM architecture provides a total of 37 registers, all of which are 32-bits long. However these are arranged into several banks, with the accessible bank being governed by the current processor mode. We will see this in more detail in a couple of slides. In summary though, in each mode, the core can access: a particular set of 13 general purpose registers (r0 - r12). a particular r13 - which is typically used as a stack pointer. This will be a different r13 for each mode, so allowing each exception type to have its own stack. a particular r14 - which is used as a link (or return address) register. Again this will be a different r14 for each mode. r15 - whose only use is as the Program counter. The CPSR (Current Program Status Register) - this stores additional information about the state of the processor: And finally in privileged modes, a particular SPSR (Saved Program Status Register). This stores a copy of the previous CPSR value when an exception occurs. This combined with the link register allows exceptions to return without corrupting processor state.
Visible Registers User Addressable System Addressable
ARM Architecture Instruction Set Foundation Current Program Status Register Used in user-level programs to store the condition code bits. N: Negative; the last ALU operation which changed the flags produced a negative result Z: Zero; the last ALU operation which changed the flags produced a zero result C: Carry; the last ALU operation which changed the flags generated a carry-out. V: Overflow; the last arithmetic ALU operation which changed the flags generated an overflow into the sign bit.
ARM Organization and Implementation 3-stage pipeline organization Principal components The register bank The barrel shifter Can shift or rotate one operand by any number of bits The ALU The address register and incrementer Select and hold all memory addresses and generate sequential addresses The data registers( holds data passing to and from memory) The instruction decoder and associated control logic (refer next slide)
Process Instruction Flow In a single-cycle data processing instruction, two register operands are accessed, the value on the B bus is shifted and combined with the value on the A bus in the ALU, then the result is written back into the register bank. The program counter value is in the address register, from where it is fed into the incrementer, then the incremented value is copied back into r15(PC) in the register bank and also into the address register to be used as the address for the next instruction fetch
ARM Organization and Implementation ARM processors employ a simple 3-stage pipeline with the following pipeline stages Fetch The instruction is fetched from memory and placed in the instruction pipeline Decode The instruction is decoded and the data path control signals prepared for the next cycle. In this stage the instruction ‘owns’ the decode logic but not the data path Execute The instruction ‘owns’ the data path; the register bank is read, an operand shifted, the ALU result generated and written back into a destination register
Summary The ARM processor has a rich history both in academia and in the commercial space. It uses innovative architectural design to achieve high performance with low power consumption. It is highly utilized in mobile and embedded devices due to its power characteristics and is one of the most populous processors currently used. It utilizes the RISC instruction set to achieve this performance. It also uses a variety of organizational designs such as pipelining, in addition to the instruction set. The ARM processor is a robust development platform that will be in use for many years to come.
ARM Processor Core Current low-end ARM core for applications like digital mobile phones TDMI T: Thumb, 16-bit instruction set D: on-chip Debug support, enabling the processor to halt in response to a debug request M: enhanced Multiplier, yield a full 64-bit result, high performance I: EmbeddedICE hardware Von Neumann architecture 3-stage pipeline
DIFFERENT STATES All instructions are 32 bits wide When the processor is executing in ARM state: All instructions are 32 bits wide All instructions must be word aligned When the processor is executing in Thumb state: All instructions are 16 bits wide All instructions must be halfword aligned When the processor is executing in Jazelle state: All instructions are 8 bits wide Processor performs a word access to read 4 instructions at once ARM is designed to efficiently access memory using a single memory access cycle. So word accesses must be on a word address boundary, halfword accesses must be on a halfword address boundary. This includes instruction fetches. Point out that strictly, the bottom bits of the PC simply do not exist within the ARM core - hence they are ‘undefined’. Memory system must ignore these for instruction fetches. In Jazelle state, the processor doesn’t perform 8-bit fetches from memory. Instead it does aligned 32-bit fetches (4-byte prefetching) which is more efficient. Note we don’t mention the PC in Jazelle state because the ‘Jazelle PC’ is actually stored in r14 - this is technical detail that is not relevant as it is completely hidden by the Jazelle support code.
CPUs Input and output. Supervisor mode, exceptions, traps. Co-processors.
Input and Output Devices Input and output devices usually have some analog or Non-Electronic Component For instance, a disk drive has a rotating disk and analog read/write electronics. But the digital logic in the device that is most closely connected to the CPU very strongly resembles the logic you would expect in any computer system. Devices typically have several registers: ■ Data registers hold values that are treated as data by the device, such as the data read or written by a disk. ■ Status registers provide information about the device’s operation, such as whether the current transaction has completed. CPU status reg data mechanism
I/O Application: 8251 UART Universal asynchronous receiver transmitter (UART) : provides serial communication. 8251 functions are integrated into standard PC interface chip. Allows many communication parameters to be programmed.
Serial communication Characters are transmitted separately: no char bit 0 bit 1 bit n-1 ... start stop time
Serial communication parameters Baud (bit) rate. Number of bits per character (5 to 8). Parity/no parity. Even/odd parity. Length of stop bit (1, 1.5, 2 bits).
8251 CPU interface The UART includes one 8-bit register that buffers characters between the UART and the CPU bus. The Transmitter Ready output indicates that the transmitter is ready to accept a data character; the Transmitter Empty signal goes high when the UART has no characters to send. On the receiver side, the Receiver Ready pin goes high when the UART has a character ready to be read by the CPU. CPU 8251 status (8 bit) data serial port xmit/ rcv
Programming I/O devices Two types of instructions can support I/O: special-purpose I/O instructions; memory-mapped load/store instructions. Intel x86 provides in, out instructions. Most other CPUs use memory-mapped I/O. But ARM……………………..?
Programming I/O devices contd… 1.ARM memory-mapped I/O (Programs using normal R/W instructions to communicate with the devices) Example Define location for device: DEV1 EQU 0x1000 Read/write code: LDR r1,#DEV1 ; set up device address LDR r0,[r1] ; read DEV1 LDR r0,#8 ; set up value to write STR r0,[r1] ; write value to device
2.Poke and Peek (as like push and pop) Programming I/O devices contd… 2.Poke and Peek (as like push and pop) To write I/O devices through High Level Language Done through pointers, since C compiler hides variables address from us Traditional HLL interfaces: int peek(char *location) { return *location; } void poke(char *location, char newval) { (*location) = newval; }
Programming I/O devices contd… 3.Busy/wait output Simplest way to program device. Use instructions to test when device is ready. current_char = mystring; while (*current_char != ‘\0’) { poke(OUT_CHAR,*current_char); while (peek(OUT_STATUS) != 0); current_char++; }
INTERRUPT
INTERRUPTS
Click here for interrupt
Interrupts in ARM ARM7 supports two types of interrupts: 1.Fast interrupt requests (FIQs) and 2. Interrupt requests (IRQs). An FIQ takes priority over an IRQ. The interrupt table is always kept in the bottom memory addresses, starting at location 0. The entries in the table typically contain subroutine calls to the appropriate handler. The ARM7 performs the following steps when responding to an interrupt ■ saves the appropriate value of the PC to be used to return, ■ copies the CPSR into a saved program status register (SPSR), ■ forces bits in the CPSR to note the interrupt, and ■ forces the PC to the appropriate interrupt vector. When leaving the interrupt handler, the handler should: ■ restore the proper PC value, ■ restore the CPSR from the SPSR, and ■ clear interrupt disable flags.
ARM interrupt latency Worst-case latency to respond to interrupt is 27 cycles: Two cycles to synchronize external request. Up to 20 cycles to complete current instruction. Three cycles for data abort. Two cycles to enter interrupt handling state.
Generic interrupt mechanism continue execution intr? Assume priority selection is handled before this point. N Y intr priority > current priority? N ignore Y ack Y bus error Y N timeout? vector? Y call table[vector]
Supervisor mode Complex systems are often implemented as several programs that communicate with each other. These programs may run under the command of an operating system. It may be desirable to provide hardware checks to ensure that the programs do not interfere with each other. For example, By erroneously writing into a segment of memory used by another program. In such cases it is often useful to have a supervisor mode provided by the CPU. Normal programs run in user mode. The supervisor mode has privileges that user modes do not. For example, The Memory Management Unit (MMU) systems allow the addresses of memory locations to be changed dynamically. Control of the memory management unit (MMU) is typically reserved for supervisor mode to avoid the obvious problems that could occur when program bugs cause inadvertent changes in the memory management registers. The ARM instruction that puts the CPU in supervisor mode is called SWI: i.e, SWI CODE_1
Supervisor mode Contd…. SWI causes the CPU to go into supervisor mode and sets the PC to 0x08 or 08H. The argument to SWI is a 24-bit immediate value that is passed on to the supervisor mode code; it allows the program to request various services from the supervisor mode. In supervisor mode, the bottom 5 bits of the CPSR are all set to 1 to indicate that the CPU is in supervisor mode. The old value of the CPSR just before the SWI is stored in a register called the saved program status register (SPSR). There are in fact several SPSRs for different modes; the supervisor mode SPSR is referred to as SPSR_svc. To return from supervisor mode , the supervisor restores the PC from register r14 and restores the CPSR from the SPSR_svc.
Exceptions An exception is an internally detected error. A simple example is division by zero. One way to handle this problem would be to check every divisor before division to be sure it is not zero, but this would both substantially increase the size of numerical programs and cost a great deal of CPU time evaluating the divisor’s value. The CPU can more efficiently check the divisor’s value during execution. Since the time at which a zero divisor will be found is not known in advance, this event is similar to an interrupt except that it is generated inside the CPU. The exception mechanism provides a way for the program to react to such unexpected events. Just as interrupts can be seen as an extension of the subroutine mechanism, exceptions are generally implemented as a variation of an interrupt.
Exceptions Contd…. Since both deal with changes in the flow of control of a program, it makes sense to use similar mechanisms. However, exceptions are generated internally. Exceptions in general require both prioritization and vectoring. Exceptions must be prioritized because a single operation may generate more than one exception. for example, an illegal operand and an illegal memory access. The priority of exceptions is usually fixed by the CPU architecture. Vectoring provides a way for the user to specify the handler for the exception condition. The vector number for an exception is usually predefined by the architecture ; it is used to index into a table of exception handlers.
ARM’s Exceptions (1/6) Exceptions arise whenever the normal flow of a program has to be halted temporarily For example to service an interrupt from a peripheral. ARM supports 7 types of exception and has a privileged processor mode for each type of exception. ARM Exception vectors `
ARM’s Exceptions (2/6) When handling an exception, the ARM7TDMI: Preserves the address of the next instruction in the appropriate Link Register Copies the CPSR into the appropriate SPSR Forces the CPSR mode bits to a value which depends on the exception Forces the PC to fetch the next instruction from the relevant exception vector It may also set the interrupt disable flags to prevent otherwise unmanageable nestings of exceptions. If the processor is in THUMB state when an exception occurs, it will automatically switch into ARM state when the PC is loaded with the exception vector address.
ARM’s Exceptions (3/6) On completion, the exception handler: Moves the Link Register, minus an offset where appropriate, to the PC. (The offset will vary depending on the type of exception.) Copies the SPSR back to the CPSR Clears the interrupt disable flags, if they were set on entry
ARM’s Exceptions (4/6) Reset Undefined Instruction Prefetch Abort When the processor’s Reset input is asserted CPSR Supervisor + I + F PC 0x00000000 Undefined Instruction If an attempt is made to execute an instruction that is undefined LR_undef Undefined Instruction Address + #4 PC 0x00000004, CPSR Undefined + I Return with : MOVS pc, lr Prefetch Abort Instruction fetch memory abort, invalid fetched instruction LR_abt Aborted Instruction Address + #4, SPSR_abt CPSR PC 0x0000000C, CPSR Abort + I Return with : SUBS pc, lr, #4 When the nRESET signal goes LOW, the ARM7TDMI abandons the executing instruction and then continues to fetch instructions from incrementing word addresses. When nRESET goes HIGH again, the ARM7TDMI forces M[4:0] to 10011 (Supervisor mode), sets the I and F bits in the CPSR, and clears the CPSR’s T bit. Then, forces the PC to fetch the next instruction from address 0x00. When ARM7TDMI comes across an instruction which it cannot handle, it takes the undefined instruction trap. After emulating the failed instruction, the trap handler should execute the following irrespective of the state (ARM or Thumb): MOVS PC,R14_und. This restores the CPSR and returns to the instruction following the undefined instruction. Prefetch abort occurs during an instruction prefetch. If a prefetch abort occurs, the prefetched instruction is marked as invalid, but the exception will not be taken until the instruction reaches the head of the pipeline.
ARM’s Exceptions (5/6) Data Abort Software Interrupt Data access memory abort, invalid data LR_abt Aborted Instruction + #8, SPSR_abt CPSR PC 0x00000010, CPSR Abort + I Return with : SUBS pc, lr, #4 or SUBS pc, lr, #8 Software Interrupt Enters Supervisor mode LR_svc SWI Address + #4, SPSR_svc CPSR PC 0x00000008, CPSR Supervisor + I Return with : MOV pc, lr Data abort occurs during a data access. If a data abort occurs, the action taken depends on the instruction type. The software interrupt instruction (SWI) is used for entering Supervisor mode, usually to request a particular supervisor function. A SWI handler should return by executing the following irrespective of the state (ARM or Thumb): MOV PC, R14_svc This restores the PC and CPSR, and returns to the instruction following the SWI.
ARM’s Exceptions (6/6) Interrupt Request Fast Interrupt Request Externally generated by asserting the processor’s IRQ input LR_irq PC - #4, SPSR_irq CPSR PC 0x00000018, CPSR Interrupt + I Return with : SUBS pc, lr, #4 Fast Interrupt Request Externally generated by asserting the processor’s FIQ input LR_fiq PC - #4, SPSR_fiq CPSR PC 0x0000001C, CPSR Fast Interrupt + I + F Handler @0x1C speeds up the response time The IRQ (Interrupt Request) exception is a normal interrupt caused by a LOW level on the nIRQ input. It may be disabled at any time by setting the I bit in the CPSR, though this can only be done from a privileged (non-User) mode. Irrespective of whether the exception was entered from ARM or Thumb state, an IRQ handler should return from the interrupt by executing SUBS PC,R14_irq,#4. The FIQ exception is designed to support a data transfer or channel process. FIQ is externally generated by taking the nFIQ input LOW. Irrespective of whether the exception was entered from ARM or Thumb state, a FIQ handler should leave the interrupt by executing SUBS PC,R14_fiq,#4. FIQ may be disabled by setting the CPSR’s F flag.
Traps A Trap, also known as a software interrupt, is an instruction that explicitly generates an exception condition. The most common use of a trap is to enter supervisor mode. The entry into supervisor mode must be controlled to maintain security—if the interface between user and supervisor mode is improperly designed , a user program may be able to sneak code into the supervisor mode that could be executed to perform harmful operations. The ARM provides the SWI interrupt for software interrupts. This instruction causes the CPU to enter supervisor mode. An opcode is embedded in the instruction that can be read by the handler.
Co-processor EX: ARM allows up to 16 designer-selected co-processors. CPU architects often want to provide flexibility in what features are implemented in the CPU. One way to provide such flexibility at the instruction set level is to allow co-processors, which are attached to the CPU and implement some of the instructions. EX: Floating-point units are often structured as co-processors. ARM allows up to 16 designer-selected co-processors. The unit occupies two co-processor units in the ARM architecture, numbered 1 and 2, but it appears as a single unit to the programmer.
Co-processor contd…. To support co-processors, certain opcodes must be reserved in the instruction set for co-processor operations. Co-processor instructions can load and store co-processor registers or can perform internal operations. A CPU may, of course, receive co-processor instructions even when there is no coprocessor attached. Most architectures use illegal instruction traps to handle these situations.
CPUs Caches. Memory management.
Caches and CPUs address data cache main memory CPU controller cache
Cache operation Many main memory locations are mapped onto one cache entry. May have caches for: instructions; data; data + instructions (unified). Memory access time is no longer deterministic.
Terms Cache hit: required location is in cache. Cache miss: required location is not in cache. Working set: set of locations used by program in a time interval.
Types of misses Compulsory (cold): location has never been accessed. Capacity: working set is too large. Conflict: multiple locations in working set map to same cache entry.
Memory system performance h = cache hit rate. tcache = cache access time, tmain = main memory access time. Average memory access time: tav = htcache + (1-h)tmain
Multiple levels of cache L2 cache CPU L1 cache
Multi-level cache access time h1 = cache hit rate. h2 = hit rate on L2. Average memory access time: tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
Replacement policies Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies: Random. Least-recently used (LRU).
Cache organizations Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of n sets.
Cache performance benefits Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time. Sequential accesses are faster after first access.
Direct-mapped cache valid tag data 1 0xabcd byte byte byte ... byte cache block tag index offset = hit value
Write operations Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is removed from cache.
Direct-mapped cache locations Many locations map onto the same cache block. Conflict misses are easy to generate: Array a[] uses locations 0, 1, 2, … Array b[] uses locations 1024, 1025, 1026, … Operation a[i] + b[i] generates conflict misses.
Set-associative cache A set of direct-mapped caches: Set 1 Set 2 Set n ... hit data
Example: direct-mapped vs. set-associative
Direct-mapped cache behavior After 001 access: block tag data 00 - - 01 0 1111 10 - - 11 - - After 010 access: block tag data 00 - - 01 0 1111 10 0 0000 11 - -
Direct-mapped cache behavior, cont’d. After 011 access: block tag data 00 - - 01 0 1111 10 0 0000 11 0 0110 After 100 access: block tag data 00 1 1000 01 0 1111 10 0 0000 11 0 0110
Direct-mapped cache behavior, cont’d. After 101 access: block tag data 00 1 1000 01 1 0001 10 0 0000 11 0 0110 After 111 access: block tag data 00 1 1000 01 1 0001 10 0 0000 11 1 0100
2-way set-associtive cache behavior Final state of cache (twice as big as direct-mapped): set blk 0 tag blk 0 data blk 1 tag blk 1 data 00 1 1000 - - 01 0 1111 1 0001 10 0 0000 - - 11 0 0110 1 0100
2-way set-associative cache behavior Final state of cache (same size as direct-mapped): set blk 0 tag blk 0 data blk 1 tag blk 1 data 0 01 0000 10 1000 1 10 0111 11 0100
Example caches StrongARM: C55x: 16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block data cache (write-back). C55x: Various models have 16KB, 24KB cache. Can be used as scratch pad memory.
Scratch pad memories Alternative to cache: Software determines what is stored in scratch pad. Provides predictable behavior at the cost of software control. C55x cache can be configured as scratch pad.
Memory management units Memory management unit (MMU) translates addresses: main memory logical address memory management unit physical address CPU
Memory management tasks Allows programs to move in physical memory during execution. Allows virtual memory: memory images kept in secondary storage; images returned to main memory on demand during execution. Page fault: request for location not resident in memory.
Address translation Requires some sort of register/table to allow arbitrary mappings of logical to physical addresses. Two basic schemes: segmented; paged. Segmentation and paging can be combined (x86).
Segments and pages memory page 1 segment 1 page 2 segment 2
Segment address translation segment base address logical address + segment lower bound range error range check segment upper bound physical address
Page address translation offset page i base concatenate page offset
Page table organizations descriptor page descriptor flat tree
Caching address translations Large translation tables require main memory access. TLB: cache for address translation. Typically small.
ARM memory management Memory region types: section: 1 Mbyte block; large page: 64 kbytes; small page: 4 kbytes. An address is marked as section-mapped or page-mapped. Two-level translation scheme.
ARM address translation Translation table base register 1st index 2nd index offset 1st level table descriptor concatenate concatenate 2nd level table descriptor physical address
CPUs CPU performance CPU power consumption.
Elements of CPU performance Cycle time. CPU pipeline. Memory system.
Pipelining Several instructions are executed simultaneously at different stages of completion. Various conditions can cause pipeline bubbles that reduce utilization: branches; memory system delays; etc.
Performance measures Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency.
ARM7 pipeline ARM 7 has 3-stage pipe: fetch instruction from memory; decode opcode and operands; execute.
ARM pipeline execution add r0,r1,#5 fetch decode fetch execute decode fetch execute decode sub r2,r3,r6 execute cmp r2,#3 time 1 2 3
Pipeline stalls If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput.
ARM multi-cycle LDMIA instruction r0,{r2,r3} fetch decode ex ld r2 ex ld r3 sub r2,r3,r6 fetch decode ex sub cmp r2,#3 fetch decode ex cmp time
Control stalls Branches often introduce stalls (branch penalty). Stall time may depend on whether branch is taken. May have to squash instructions that already started executing. Don’t know what to fetch until condition is evaluated.
ARM pipelined branch ex bne ex bne ex add bne foo sub r2,r3,r6 foo add fetch decode ex bne bne foo sub r2,r3,r6 foo add r0,r1,r2 ex bne fetch decode ex add time
Delayed branch To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not.
Memory system performance Caches introduce indeterminacy in execution time. Depends on order of execution. Cache miss penalty: added time due to a cache miss.
Types of cache misses Compulsory miss: location has not been referenced before. Conflict miss: two locations are fighting for the same block. Capacity miss: working set is too large.
CPU power consumption Most modern CPUs are designed with power consumption in mind to some degree. Power vs. energy: heat depends on power consumption; battery life depends on energy consumption.
CMOS power consumption Voltage drops: power consumption proportional to V2. Toggling: more activity means more power. Leakage: basic circuit characteristics; can be eliminated by disconnecting power.
CPU power-saving strategies Reduce power supply voltage. Run at lower clock frequency. Disable function units with control signals when not in use. Disconnect parts from power supply when not in use.
Power management styles Static power management: does not depend on CPU activity. Example: user-activated power-down mode. Dynamic power management: based on CPU activity. Example: disabling off function units.
Application: PowerPC 603 energy features Provides doze, nap, sleep modes. Dynamic power management features: Uses static logic. Can shut down unused execution units. Cache organized into subarrays to minimize amount of active circuitry.
PowerPC 603 activity Percentage of time units are idle for SPEC integer/floating-point: unit Specint92 Specfp92 D cache 29% 28% I cache 29% 17% load/store 35% 17% fixed-point 38% 76% floating-point 99% 30% system register 89% 97%
Power-down costs Going into a power-down mode costs: time; energy. Must determine if going into mode is worthwhile. Can model CPU power states with power state machine.
Application: StrongARM SA-1100 power saving Processor takes two supplies: VDD is main 3.3V supply. VDDX is 1.5V. Three power modes: Run: normal operation. Idle: stops CPU clock, with logic still powered. Sleep: shuts off most of chip activity; 3 steps, each about 30 ms; wakeup takes > 10 ms.
SA-1100 power state machine Prun = 400 mW run 10 ms 160 ms 90 ms 10 ms 90 ms idle sleep Pidle = 50 mW Psleep = 0.16 mW