zetom.info Design Modern Processor Design Fundamentals Of Superscalar Processors 2013 Pdf


Monday, April 29, 2019

[PDF] DOWNLOAD Modern Processor Design: Fundamentals of Superscalar Processors by John Paul Shen [PDF] Waveland Pr Inc Language: English ISBN ISBN ; 3. Modern-Processor-Design-Fundamentals-of-Superscalar- ProcessorsPhpapppdf - Ebook download as PDF File .pdf), Text File. Modern Processor Design Fundamentals of Superscalar Processors Phpapp02 - Ebook download as PDF File .pdf), Text File .txt) or read book.

Modern Processor Design Fundamentals Of Superscalar Processors 2013 Pdf

Language:English, Spanish, Arabic
Genre:Business & Career
Published (Last):05.06.2015
ePub File Size:23.78 MB
PDF File Size:13.30 MB
Distribution:Free* [*Registration Required]
Uploaded by: NATHANIAL

Modern processor design: fundamentals of superscalar processors. by John Paul Shen. Print book. English. Long Grove: Waveland Press. 2. Modern. Modern processor design fundamentals of superscalar processors shen pdf. Modern Lipasti, Waveland Press, , ISBN: (S&L). Modern. Modern processor design fundamentals of superscalar processors. Material. Type. Book. Language English. Title. Modern processor design fundamentals of.

Intel's P6 Microarchitecture. Though not all such techniques have yet been adopted in real machines. This program characteristic was discovered recently. Advanced Register Data Flow Techniques This chapter highlights emerging microarchitectural techniques for increasing performance by exploiting the program characteristic of value locality.

It contains fascinating information that can't be found elsewhere. ECE Department. Deepak Limaye. Chris Newbum.

This predefined instruction set is also called the instruction set architecture ISA. Including embedded microprocessors and microcontrollers.

Intel Labs. Some of the problems at the end of each chapter were actually contributed by students at the University of Wisconsin.

Alex Dean. They include Bryan Black. Carnegie Mellon University Mikko H. Ryan Rakvic. A microprocessor's functionality is fully characterized by the instruction set that it is capable of executing. John Faistl. Microarchitecture Research. This course has been taught at CMU since Since Chris Nelson. Currently more than million microprocessors are sold each year for the mobile. In the past three decades. An ISA serves as an interface between software and hardware. Assistant Professor.

We appreciate their test driving of this book. Kyle Oppenheim. A draft version of this textbook has also been used at the University of Wisconsin since Scott Cape. These innovations include embedded microcontrollers. All the programs that run on a microprocessor are encoded in that instruction set. Adjunct Professor. Andrew Huang. Derek Noonburg. We both are indebted to the nurturing we experienced while at CMU.

In terms of processor design methodology. John Paul Shen. Yuan Chou. Hundreds of students have taken this course at CMU.

Many teaching assistants of this course have left their indelible touch in the contents of this book. Trung Diep. Its performance has improved at the astounding rate of doubling every 18 months. Microprocessors are instruction set processors ISPs. Most technologists agree that Moore's law will continue to rule for at least 10 to 15 years more.

The three decades of the history of microprocessors tell a truly remarkable story of technological advances in the computer industry. These workstations in turn became the workhorses for the design of subsequent generations of even more powerful microprocessors. It presents existing and proposed microarchitecture techniques in a systematic way and imparts foundational principles and insights.

The was a 4-bit processor consisting of approximately transistors with a clock frequency of just over kilohertz kHz. The 8-bit microprocessor also became the heart of a new popular computing platform called the personal computer PC and ushered in the PC era of computing.

download for others

We can also expect new innovations in a number of areas. This book focuses on contemporary superscalar microprocessor design at the microarchitecture level.

Deeply pipelined machines capable of achieving extremely high clock frequencies and sustaining multiple instructions executed per cycle became popular. These narrow bit-width microprocessors evolved into selfcontained microcontrollers that were produced in huge volumes and deployed in numerous embedded applications ranging from washing machines.

Major techniques that have been. During the first decade. Architectural features that historically belong to large systems. High-end microprocessors. The importance of having an instruction set architecture that facilitates efficient hardware implementation and that can leverage compiler optimizations was recognized. We are now heading into the fourth decade of microprocessors.

The objective of this book is to introduce the fundamental principles of microprocessor design at the microarchitecture level. Power consumption will become a dominant performance impediment and will require new solutions at all levels of the design hierarchy.

Within a few years microprocessors will be clocked at close to 10 GHz and each will contain several hundred million transistors. Instruction pipelining and fast cache memories became standard microarchitecture techniques.

Many traditional "macroarchitecture" issues will now become microarchitecture issues. Personal computers and workstations became ubiquitous and essential tools for productivity and communication. The year marks the thirtieth anniversary of the birth of microprocessors. The decade of the s witnessed major advances in the architecture and microarchitecture of bit microprocessors.

Its primary application was for building calculators. In that same period. The evolution of the microprocessor has pretty much followed the famed Moore's law. As with all forms of engineering design. Extremely aggressive microarchitecture techniques were devised to achieve unprecedented levels of microprocessor performance.

During the decade of the s. By the end of the third decade of microprocessors. Out-of-order execution of instructions and aggressive branch prediction techniques were introduced to avoid or reduce the number of pipeline stalls. Powerful scientific and engineering workstations based on bit microprocessors were introduced. By Such phenomenal performance improvement is unmatched by that in any other industry.

Instruction set design issues became the focus of both academic and industrial researchers. The clock frequency of the fastest microprocessors exceeded that of the fastest supercomputers.

In a little more than 30 years. In each of the three decades of its existence. In the past two decades. Optimization objectives include the reduction of the number of states and the complexity of the associated combinational logic circuits.

The design process for a microprocessor is more complex and less straightfor. In digital systems design. A number of innovative techniques recently proposed by researchers are also highlighted. Specification for a combinational logic circuit takes the form of boolean functions that specify the relationship between input and output variables.

This book attempts to codify a large body of knowledge into a systematic framework. The primary focus of this book is on 'I microarchitecture design. Synthesis attempts to find an implementation based on the specification. Logic minimization and state minimization software tools are. This description typically uses a register transfer language RTL to specify all the major modules in the machine organization and the interactions between these modules.

Usually a performance model is used as an analysis tool to assess the effectiveness of these techniques. Verilog is one such language. The implementation is typically an optimized two-level AND-OR design or a multilevel network of logic gates. Analysis examines an implementation to determine whether and how well it meets the specification.

In a classic textbook on computer architecture by Blaauw and Brooks [] the authors definecfthree fundamental and distinct levels of abstraction: It specifies an instruction set that characterizes the functional behavior of an instruction set processor?

ReaiizgtianAs the physical structure that embodies the implementation. The critical task of analysis is essential in determining the correctness and effectiveness of a design.

Concepts and techniques that may appear quite complex and difficult to decipher are distilled into a format that is intuitive and insightful. The performance model accurately models the behavior of the machine at the clock cycle granularity and is able to quantify the number of machine cycles required to execute a benchmark program. During the logic desjgnjtep. Synthesis is the more creative task that searches for possible solutions and performs various tradeoffs and design optimizations to arrive at the best solution.

The ISA plays several crucial roles in instruction set processor design. Implementation is the logical structure or organization that performs the architecture. A design is described as a schematic. For example. Microarchitecture design involves developing and defining the key techniques for achieving the targeted performance.

Critical to an instruction set processor is the instruction set architecture. These tools can verify the logic correctness of a design and determine the critical delay path and hence the maximum clocking rate of the state machine.

The optimization attempts to reduce the number of logic gates and the number of levels of logic used in the d e s i g n j p r sequential circuit design.

The end result of microarchitecture design is a high-level description of the organization of the microprocessor. Specification is a behavioral description of what is desired and answers the question "What does it do? Logic and state machine simulation tools are used to assist the analysis task. We hope this book will play a role in producing a new generation of microprocessor designers who will help write the history for the fourth decade of microprocessors.

Programs can be developed that target the ISA without requiring knowledge of the actual machine implementation. Every new microarchitecture must be validated against the ISA to ensure that it performs the functional requirements specified by the ISA.

Attributes of a realization. Examples of some well-known architectures a r e J B M Having the ISA also ensures software portability.

An architecture can have many implementations in the lifetime of that ISA. ISA serves as the specification for processor designers. Besides serving as a reference targeted by software developers or compilers. For a given implementation. Microprocessor design starts with the ISA and produces a microarchitecture that meets this specification.

Such program portability significantly reduces the cost of software development and increases the longevity of software. While new extensions to an existing ISA can occur from time to time to accommodate new emerging applications. Successful ISAs. Special mstfiictions. PowerPC An assembly language program constitutes a sequence of assembly instructions. Typically an ISA defines a set of instructions called assembly instructions.

Ujrfortunately this same benefit also makes migration to a a i e w ISAvery difficult. This is extremely important to ensure that existing software can run correctly on the new microarchitecture. AhiibmesassocTated with an implementation include pipeline design. They differ in how operations and operands are specified. Attributes associated with a realization include die size.

Duplicate citations

To develop these features is the job of the microprocessor designer or the microarchitect. During the s. All implementations of an architecture can execute any program encoded in that ISA.

Each instruction specifies an operation and one or more operands. The development of effective compilers and operating systems for a new ISA can take on the order of UH. Typically a twofold performance increase is needed before software developers will be willing to pay the overhead to recompile their existing applications. An implementation is a specific design of an architecture. Attributes of an architecture can significantly impact the design complexity and the design effort of an implementation.

These realizations can vary and differ in terms of the clock frequency. For a microprocessor. Since the advent of computers. Motorola A realization of an implementation is a specific physical embodiment of a design. Other early ISAs assume that operands are stored in a stack [last in. These attributes are all part of the ISA and exposed to the software as perceived by the compiler or the programmer. A program only needs to be developed once for that ISA. Issues related to architecture and realization are also important.

Every program is compiled into a sequence of instructions in this instruction set. Architecture serves as the specification for the implementation. In an accumulator-based architecture. This can facilitate optimizations by the compiler that lead to reduced hardware complexity. Ideally the DSI should be placed at a level that achieves the best synergy between static techniques and dynamic techniques.

The lowering of the DSI by promoting a former implementation feature to the architecture level effectively exposes part of the original microarchitecture to the software. The DSI provides an important separation between architecture and implementation. This DSI placement becomes a real challenge because of the constantly evolving hardware technology and compiler technology.

In the history of ISA design. On the other hand. As implementation styles and techniques change.

In contrast. This figure is intended not to be highly rigorous but simply to illustrate that the DSI can be placed at different levels. Violation of this separation can become problematic. In addition to these two critical roles. All implementations must meet the requirements and support the functionality specified in the ISA.

A conceptual illustration of possible placements of the DSI is shown in Figure The placement of the DSI is correlated Figure 1. As an ISA evolves and extensions are added. It is quite likely that the few ISAs that have dominated the microprocessor landscape in the past decades will continue to do so for the coming decade. Inherent in the definition ofevery ISA is an a s s o c P ated definition of an interface that separates what is done statically aTcorfiple time ' versus what is done dynamically at run time.

In between the application program written in a high-level language at the top and the actual hardware of the machine at the bottom.

As stated earlier. Improving performance becomes a real challenge involving subtle tradeoffs and delicate balancing acts. Such types of performance optimization techniques are always desirableand jihould be employed if the cost isnot prohibitive.

Since all future implementations must support the entire ISA to ensure the portability of all existing code. We can examine these techniques byTooking at the reduction of each of the three terms in the processor performance equation. Such mistakes have been made with real ISAs. As another example. The relationship between the three terms cannot be easily characterized.

The third term indicates the length of time of each machine cycle. If CPI is reduced. There are other important design objectives. If cycle time can be reduced. Each new generation of microarchitecture seeks to significantly improve on the performance of the previous generation. Section 1.

For these techniques. It is exactly this challenge that makes processor design fascinating and at times more of an art than a science. The focus of this book is not on ISA design but on microarchitecture techniques. In recent years. The three termsare notaU independent. If the instruction count can be reduced. The reduction of any one term can potentially increase the magnitude of the other two terms.

This equation has come to be known as the iron law of processor performance. The total number of instructions executed can decrease significantly. It might seem from this equation that improving performance is quite trivial.

Looking at Equation 1. If some of these older features were promoted to the ISA level. ISA features can inflifc. It is this body of knowledge that this book is attempting to codify. It is the goal. This technique of getting higher frequency via deeper pipelining has served us well for more than a decade. As a pipeline gets deeper. It requires interesting tradeoffs involving many and sometimes very subtle issues.

CPI can go up in three ways. Performance simulators model the microarchitecture of a design and are used to measure the number of machine cycles required to execute a program.

By employing deeper pipelines. As can be seen in Table 1. Trace-driven performance simulators process pregenerated traces to determine the cycle count for executing the instructions in the traces. While the instruction count may go down. By eliminating-calls and retains. This can increase the average latency of memory operations and thus increase the overall CPI. This increases the number of penalty cycles incurred when branches are mispredicted.

The u s e of cachejnernorv to reduce the average memory access latency in terms of number of clock cycles VyiIl also reduce the CPL When a conditional branch is taken. There is a downside to increasing the clock frequency through deeper pipelining.

Compared to a nonpipelined design and assuming identical cycle times. One way is via software instrumentation. The most talented microarchitects and processor designers in the industry all seem to possess the intuition and the insights that enable them to make such tradeoffs better than others.

Functional simulators actually interpret or execute the instructions of a program. Performance simulators can be either trace-driven or execution-driven. A pipelinecTprocessor can overlap the processing of multiple instructions. Of course. Branch prediction techniques can reduce the number of such stalled cycles. By being able to sustain the execution of multiple instructions in every macfirrrecycle. Funrtinnql emulators model a machine at the architecture ISA level and are used to verify the correct execution of a program.

It is not clear how much further we can push it before the requisite complexity and power consumption become prohibitive. AT we have already mentioned. Usually performance simulators are concerned not with the semantic correctness of instruction execution. A shallower pipeline. As can be concluded from this discussion. It is also important to do post-silicon validation of the performance model so that it can be used as a good starting point for the next-generation design.

These performance models actually simulate the movement of instructions through the various pipeline stages.

During simulation. Typically the performance model or simulator is implemented in the early phase of the design and is used to do initial tradeoffs of various microarchitecture features. The checkpoint capability allows the simulation. The third trace generation method uses a functional simulator to simulate the execution of a program. Hardware instrumentation requires the monitoring hardware and is seriously limited by the buffering capacity of the monitoring hardware.

The performance simulator then tracks the timing of these instructions and their movement through the pipeline stages. Most performance simulators used in academic research are never validated. More specifically. Another way is via hardware instrumentation. During the entire design process. In trace-driven simulation.

Software instrumentation can significantly increase the code size and the program execution time. It is quite likely that a large fraction of the performance data published in many research papers using unvalidated performance models is completely erroneous.

While many performance simulators claim to be "cycle-accurate. Most modern performance simulators employ the execution-driven paradigm. The most advanced execution-driven performance simulators are supported by functional simulators that are capable of performing full-system simulation. These simulators can be quite complex and. Some performance models are merely cycle counters that assume unlimited resources and simply calculate the total number of cycles needed for the execution of a trace.

It has fhejjbililyjoisjsue directives to thp. Instead of using pregenerated traces. Execution-driven simulators also alleviate the need to store long traces. For all three methods. Other than the difficulty of validating their accuracy. Black argues convincingly for more rigorous validation of processor simulators [Black and Shen. The actual implementation of the microarchitecture model in a performance simulator can vary widely in terms of the amount and details of machine resources that are explicitly modeled.

During this phase there isn't a reference that can be used to validate the performance model. Others explicitly model the organization of the machine with all its component modules. The operating system functions that are invoked by the application program effectively increase the total number of instructions executed in carrying out the execution of the program.

Looking at this equation. Increasing the number of pipeline stages can also facilitate higher clocking frequencies by reducing the number of logic gate levels in each pipe stage.

Instruction count is riptprminpd hy thrpp rnntrihnting factors: In other words. The bulk of the microarchitecture techniques presented in this book target thelmprovement of IPC. As many instructions as there are pipeline stages can be concurrently in flight at any onetime. A leading instruction is completed before the next instruction is processed.

Traditional sequential processors execute one instruction at a time. Frequency is strongly affected by the fabrication technology and circuit techniques.

Traditional pipelines can have up to 20 levels of logic gates in each pipe stage. There i s a complex tradeoff between making pipelines wider and makingthem deeper. This subsection presents the basis and motivation for evolving from scalar to superscalar processor implementations. Traditional sequential CISC processors can require an average of about 10 machine cycles for processing each instruction. A more aggressive form of instruction-level parallel processing is possible that involves fetching and initiating multiple instructions into a wider pipelined processor every machine cycle.

A large set of simulation runs can sometimes take many days to complete. With this hmitation. In Section 1. We can rewrite that equation to directly represent performance as a product of the inverse of instruction count. To achieve high IPC in superscalardesigns. The performance penalties due to various forms of pipeline stalls can be cleanly stated as different CPI overheads. Average IPC instructions per cycle reflects the average instruction throughput achieved by the processor and is a key measure of microarchitecture effectiveness.

CPI cycles per instruction. The ISA and the amount of work encoded into each instruction can strongly influence the total number of instructions executed for a program. With scalar pipelined processors. Processors capable of IPC greater than oneare termed superscalar pro. The effectiveness of the compiler can also strongly influence the number of instructions executed.

To a certain extent. During the early phase of the design. For execution-driven performance simulators that have fairly detailed models of a complex machine. This section presents the overview of mstruction-level parallel processing and provides the bridge between scalar pipelined processors and their natural descendants.

The widening of the pipeline increases the hardware complexity and the signal propagation delay of each pipe stage. As we consider the parallel processing of instructions in increasing processor performance. Most contemporary performance evaluations involve the simulation of many benchmarks and a total of tens to hundreds of billion instructions.

That equation actually represents the inverse of performance as a product of instruction count. Back then. The machine parallelism parameter N is now the depth of the pipeline. The efficiency of a parallel processor drops off very quickly as the number of processors is increased Furthermore. This means that almost all the ccarjputation time is taken up with scalar computation.

During vector computation all N processors are used to perform operations on array data.

Follow the Author

Harold Stone proposed that a performance model similar to that for parallel processors can be developed for pipelined processors [Stone. A typical execution profile of a pipelined processor is shown in Figure 1. There are three phases. Another formulation of this same principle is based on the amount of work that can be done in the vector computation mode.

As shown in Figure 1. As N increases or as the machine parallelism increases. As N becomes very large. If T is the totaltime required to run the program. This is commonly referred to as the sequential bottleneck. Traditional supercomputers are parallel processors that perform both scalar and vector computations. As N becomes large. During scalar computation only one processor is used. This can be done as shown in Figure 1. Instead of being the number of processors.

Equation 1. N is now the number of pipeline stages. Unlike the idealized pipeline execution profile. The total amount of work remains the same. In otherwords. Based on this observation. Now the modified profile of Figure 1. Note that the TYP pipeline has a load penalty t N. The number of pipeline stages is N. The parameter g now becomes the fractioiiLpf time whenJhe r4pejine. Instead of remaining in the pipeline full phase for the duration of the entire execution.

The second phase is the pipeline full phase. The first phase is the pipeline filling phase during which the first sequence of N instructions enters the pipeline.

Be the first to like this. No Downloads. Views Total views. Actions Shares.

Embeds 0 No embeds. No notes for slide. Fundamentals of Superscalar Processors 1. Fundamentals of Superscalar Processors 2. Book details Author: John Paul Shen Pages: Waveland Pr Inc Language: English ISBN Description this book Please continue to the next pagenone http: If you want to download this book, click link in the last page 5.

The topics covered include historical, currently used, and proposed advanced future techniques for branch prediction, as well as high-bandwidth and high-frequency fetch architectures like trace caches. Though not all such techniques have yet been adopted in real machines, future designs are likely to incorporate at least some form of them.

Customers who bought this item also bought

Chapter Advanced Register Data Flow Techniques This chapter highlights emerging microarchitectural techniques for increasing performance by exploiting the program characteristic of value locality. This program characteristic was discovered recently, and techniques ranging from software memoization, instruction reuse, and various forms of value prediction are described in this chapter.

Though such techniques have not yet been adopted in real machines, future designs are likely to incorporate at least some form of them. Chapter Executing Multiple Threads This chapter provides an introduction to thread-level parallelism TLP , and provides a basic introduction to multiprocessing, cache coherence, and high-performance implementations that guarantee either sequential or relaxed memory ordering across multiple processors.

It discusses single-chip techniques like multithreading and on-chip multiprocessing that also exploit thread-level parallelism. Finally, it visits two emerging technologiesimplicit multithreading and preexecutionthat attempt to extract thread-level parallelism automatically from single-threaded programs. In summary, Chapters 1 through 5 cover fundamental concepts and foundational techniques. Chapters 6 through 8 present case studies and an extensive survey of actual commercial superscalar processors.

Chapter 9 provides a thorough overview of advanced instruction flow techniques, including recent developments in advanced branch predictors. Chapters 10 and 11 should be viewed as advanced topics chapters that highlight some emerging techniques and provide an introduction to multiprocessor systems.

This is the first edition of the book; An earlier beta edition was published in with the intent of collecting feedback to help shape and hone the contents and presentation of this first edition. Through the course of the development of the book, a large set of homework and exam problems have been created. A subset of these problems are included at the end of each chapter.

This course has been taught at CMU since Many teaching assistants of this course have left their indelible touch in the contents of this book. Hundreds of students have taken this course at CMU; many of them provided inputs that also helped shape this book. We both are indebted to the nurturing we experienced while at CMU, and we hope that this book will help perpetuate C M U ' s historical reputation of producing some of the best computer architects and processor designers.

A draft version of this textbook has also been used at the University of Wisconsin since Some of the problems at the end of each chapter were actually contributed by students at the University of Wisconsin.

We appreciate their test driving of this book. Its performance has improved at the astounding rate of doubling every 18 months. In the past three decades, microprocessors have been responsible for inspiring and facilitating some of the major innovations in computer systems.

These innovations include embedded microcontrollers, personal computers, advanced workstations, handheld and mobile devices, application and file servers, web servers for the Internet, low-cost supercomputers, and large-scale computing clusters. Currently more than million microprocessors are sold each year for the mobile, desktop, and server markets. Including embedded microprocessors and microcontrollers, the total number of microprocessors shipped each year is well over one billion units.

Microprocessors are instruction set processors ISPs. A microprocessor's functionality is fully characterized by the instruction set that it is capable of executing. All the programs that run on a microprocessor are encoded in that instruction set. This predefined instruction set is also called the instruction set architecture ISA. An ISA serves as an interface between software and hardware, or between programs and processors.We also acknowledge his coauthors. These unused or idling pipeline stages introduce another form of pipeline inefficiency that can be called external fragmentation of pipeline stages.

This course has been taught at CMU since Identical Computations. References P1. Adjunct Professor. Aside from the implication of having no external fragmentation. It is quite likely that the few ISAs that have dominated the micro- processor landscape in the past decades will continue to do so for the coming decade. For each answer there may be P1. Many anticipate that many more advancements can be expected in the compilation domain.