The 3D Rendering Pipeline
GPUs are designed as specific purpose processors implementing a specific 3D rendering algorithm. The 3D rendering algorithm implemented takes as input a stream of vertices that define the geometry of the scene. The input vertex stream passes through a computation stage that transforms and computes some of the vertex attributes generating a stream of transformed vertices. The stream of transformed vertices is assembled into a stream of triangles, each triangle keeping the attributes of its three vertices. The stream of triangles may pass through a stage that performs a clipping test. Then each triangle passes through a rasterizer that generates a stream of fragments, discrete portions of the triangle surface that correspond with the pixels of the rendered image. Fragment attributes are derived from the triangle vertex attributes.
This stream of fragments may pass through a number of stages performing a number of visibility tests (stencil, depth, alpha and scissor) that will remove non visible fragments and then will pass through a second computation stage. The fragment computation stage may modify the fragment attributes using additional information from n-dimensional arrays stored in memory (textures). Textures may not be accessed as stream. The stream of shaded fragments will, finally, update the framebuffer. Figure 1 shows a high level abstraction of the rendering pipeline for the described rendering algorithm.
Modern GPUs implement the two described computation stages as programmable stages named vertex shading and fragment shading. The programmability of these stages and the streaming nature of the rendering algorithm allows the implemention of other stream based algorithms over modern GPUs [1, 2]. However those implementations may not be optimal. The non programmable stages are configurable using a limited and predefined set of parameters.
The shading stages are programmed using a shader, or shader program, a relatively small program written in assembly-like (legacy) or high level C-like languages for graphics that describes how the input attributes of a processing element (a vertex or a fragment) are used to compute its output attributes.
Graphics applications use software APIs (OpenGL or Direct3D) that present an interface for the described rendering algorithm and map the algorithm to the modern GPU hardware capabilities.
The 3D rendering algorithm is embarrasingly parallel and shows parallelism at multiple levels. The largest source of parallelism comes from the data and control independency of the processing elements: vertices are independent of each other, triangles are mostly independent (except for transparenct surfaces) and fragments from the same triangle are independent.
GPUs exploit three forms of parallelism: the pipeline is divided into hundreds of single cycle stages to increase the throughput and the GPU clock frequency (pipeline parallelism); the pipeline stages are replicated to process in parallel multiple vertices, triangles and fragments (data parallelism); and independent instructions in a shader program may be executed in parallel (instruction level parallelism).
We will now briefly describe the ATTILA implementation of the 3D rendering pipeline. We have blended techniques and ideas from different vendors and publications  and we have made educated guesses in those areas where information was specially scarce. Our implementation correlates in most aspects with current real GPUs.
Attila Architecture (Unified Shader Model)
The ATTILA architecture supports both hard partitioning of vertex and fragment shaders (the norm in current GPUs) or an unified shader model. Figure 2 shows the ATTILA GPU graphic pipeline for the unified shader model. The input and output processing elements, the bandwidth and the latency of the different ATTILA stages can be found at Table 1. Table 2 shows the sizes of some of the input queues in those stages and the number of threads supported in the vertex and fragment/unified shader units. The diagram and the table data corresponds to a reference architecture implementing 4 vertex shaders (non unified), 2 shader units (fragment or unified), 2 ROPs and 4 64-bit DDR channels.
Two GPU units are not shown in Figure 2, the Command Processor that controls the whole pipeline, processing the commands received from the system main processor and the DAC unit that consumes bandwidth for screen refreshes and outputs the rendered frames into a file. The Streamer unit reads streams of vertex input attributes from GPU or system memory and feeds them to a pool of vertex or unified shader units (Figure 2). The streamer also supports an indexed mode that allows to reuse vertices shaded and stored in a small post shading cache. After shading the Primitive Assembly stage converts the shaded vertices into triangles and the Clipper stage performes a trivial triangle rejection test.
The rasterizer stages generate fragments from the input triangles. The rasterization algorithm is based on the 2D Homogeneous rasterization algorithm which allows for unclipped triangles to be rasterized. The Triangle Setup stage calculates the triangle edge equations and a depth interpolation equation while the Fragment Generator stage traverses the whole triangle generating tiles of fragments. ATTILA supports two fragment generation algorithms: a tile based fragment scanner and a recursive algorithm.
After fragment generation a Hierarchical Z buffer is used to remove non visible fragment tiles at a fast rate without accessing GPU memory. The HZ buffer is stored as on chip memory and supports resolutions up to 4096x4096 (256 KB). The processing element for the next stages is the fragment quad, a tile of 2x2 fragments. Most modern GPUs use this working unit for memory locality and the computation of the texture lod in the Texture Unit.
The Z and stencil test stage removes as early as possible non visible fragments thereby reducing the computational load in the fragment shaders. Figure 2 shows the datapath for early fragment rejection. However another path exists to performe the tests after fragment shading. ATTILA only supports a depth and stencil buffer mode: 8 bits for stencil and 24 bits buffer for depth. The Z and Stencil test unit implements a 16 KB 64 lines 4-way set associative cache. The cache supports fast depth/stencil buffer clear and depth compression. The architecture is derived from the methods described for ATI GPUs.
The Interpolator unit uses perspective corrected linear interpolation to generate the fragment attributes from the triangle attributes. However other implementations may interpolate the fragment attributes in the Fragment Shader. The interpolated fragment quads are fed into the fragment or unified shader pool. The Texture Unit attached to each fragment or unified shader supports n-dimensional and cubemap textures, mipmapping, bilinear, trilinear and anisotropic filtering. The Texture Cache architecture is configured as a 64 lines 4-way set associative 16 KB cache. Relatively small texture caches are known to work well. Compressed textures are also supported.
The Color Write stage basic architecture is similar to the Z and Stencil test stage but color compression may not be supported.
The Memory Controller interfaces with the ATTILA memory and the main computer memory system. The ATTILA memory interface simulates a simplified (G)DDR memory but banks are not being simulated. The memory access unit is a 64 byte transaction: 8 cycle transfer from a 64-bit channel. The number of channels and the channel interleaving is configurable. Read to write and write to read penalties are implemented. A number of queues and dedicated buses conform a complex crossbar that services the memory requests for the different GPU stages.
Our shader architecture uses as a base the OpenGL ARB specifications for vertex and fragment shader programs.
The ARB vertex and fragment program specifications define assembly alike instructions that can be used to program how the vertex and fragment output registers can be calculated from per vertex and fragment input registers and a set of per batch constant parameters. There are four defined register banks (as shown in Figure 3): the input register bank, a read only bank, stores the vertex and fragment input attributes; the output register bank, write only, stores the vertex and fragment output attributes; the temporal register bank, supports reading and writing, is used to store intermediate values; and a constant parameter bank stores parameters that are constant for a whole frame batch. A shader register is a 4 component 32 bit float point vector, limiting the ARB shader program models to support only float point data. The programming model doesn’t support any kind of execution flow control. The ARB shader program models are quite limited but only when our OpenGL library implements support for a glSlang (a HLSL or high level shader language) compiler our architecture will be able to go beyond the limited ARB shader program model.
The glSlang programming language virtualizes all the hardware resources available for the shader tasking the compiler and optimizer are accommodating the required resources with the resources available in the target architecture. Current glSlang implementations for modern GPUs like those of ATI and NVidia are allowed to fail when programs require resources beyond the available hardware resources. The glSlang shader language is losely based on a C syntax with additional data types and operations (for example SIMD data types and operations) that are better suited for shader processing. Loops, subroutine calls and conditional statements are supported, as expected, but architecture support may be missing in current GPUs, as is the case of our current GPU architecture, only supporting ‘static’ (constant based) branching and code replication for constant loops. We plan to add support for true branching in the next iteration of our shader architecture.
The ARB instructions are defined as an opcode, a destination operand and up to three source operands. The source operands support full swizzling of their 4 components, a negation and an absolute value modifiers. The destination operand supports full swizzling and masking of the instruction result. There are two main types of operations performing scalar or vectorial (SIMD) computations. The vectorial operations supported are: addition (ADD), compare (CMP), dot point (DP3, DP4, DPH), distance vector (DST), floor (FLR), fraction (FRC), compute light coefficients (LIT), linear interpolation (LRP), multiply and add (MAD), maximum (MAX), minimum (MIN), move (MOV), multiplication (MUL), set great or equal (SGE), set less (SLT) and subtract (SUB). The scalar operations supported are: cosine (COS), exponential base 2 (EX2), logarithm base 2 (LG2), exponentiate (POW), reciprocal (RCP), reciprocal square root (RSQ) and sine (SIN). All can be implemented with a 4 component SIMD ALU and a special ALU for some of the scalar operations like the RCP and RSQ instructions.
There are a few differences between the vertex and fragment program specifications. Fragments can access texture data with the TEX, TXB, and TXP instructions while vertex can’t. Texture instructions, in our architecture, use the SIMD ALU for the texture address computation and then the texture request is issued to the Texture Unit that access the Texture Cache and memory and performs the filtering of the sampled texels (as described in section 2). For fragment programs a KILL instruction is defined, used to ‘stop’ (marks the fragment as to be culled) the processing of a fragment Texture and KILL instructions use vectorial operands. An additional instruction modifier _SAT is defined only for fragment programs to inexpensively implement the required clamping (to the [0, 1] range) of color result values.
Our unified shader architecture implements the superset of both vertex and fragment program models, however we are currently limited to the ARB vertex and fragment program capabilities in our current OpenGL library. Future support for glSlang programs will enable all our additional shader capabilities (for example vertex texturing) to be used. The unification of the vertex and fragment programming models is a target for future APIs (for example Shader Model 4.0  in Direct3D and OpenGL glSlang) and GPU architectures. Our current legacy support for a non unified shader pipeline is performed capping an unified shader unit to work as a vertex shader unit from a current GPU. Shader unification not only creates a coherent programming model for both fragment and vertex processing but also simplifies the architecture design, and allows a better use of the shading hardware as more shader units can be allocated to process vertices or fragments as their work load balance changes from batch to batch.
The Shader unit works on groups of four threads (each thread corresponding with a vertex or fragment input) because of a requirement of fragment processing (texture lod derivative computation). The same instruction for the 4 threads in a group is fetched and sent to the decode stage. A group of threads may be ready (fetch allowed), blocked (no fetch allowed) or finished (waiting for the thread results to be sent to the next rendering pipeline stage).
Our shader architecture supports the fetch and execution of a configurable number (instruction way) of instructions per cycle and shader execution thread. The current implementation doesn’t discriminate between the SIMD and special operation ALUs and both are considered replicated for a n-way configuration. Textures instructions are only supported at one per execution thread and cycle and the shader thread is always blocked after the texture request is issued to the Texture Unit, expecting a large memory latency. The shader instruction decoder detects dependencies and conflicts accessing the register bank ports and requests the shader fetch unit to refetch instructions that stall the pipeline.
The shader execution pipeline consists of the followin single cycle stages: a fetch stage; a decode stage; a register read stage; a variable number (instruction dependant) ranging from 1 to 9 of execution stages and a register write stage. Instructions are always fetched in order. Separated hardware pipelines are implemented to receive the shader inputs (vertex and fragment input attributes) and send the shader results to the next rendering stages (vertex and fragment output attributes). The instructions are fetched from a small sized (not over 512 instructions) instruction memory where shader programs are explicitly loaded before starting the batch rendering. Shader program length limitations will be removed in the future implementing an instruction cache that will read transparently the instructions from memory.
The high level of parallelism inherent to shader processing (all processing elements are always independent) is exploited implementing multithreading to hide texture (memory) access latency with up to 256 threads currently configured in our architecture (non unified vertex shaders only implement a few threads for hiding instruction execution latency as they don’t support texture access). In the future we will implement per batch (static) or even dynamic (register renaming) allocation of temporal registers from a single physical register file to each shader thread. The number of threads on execution will change as the shader program requirements for live temporal registers change. However, in the experiments presented in the next section, most fragment shaders don’t require more than four live registers, not the whole 12+ temporal registers required for the ARB specification, keeping the hardware requirements in line (2048 registers) with what could be implementable.
The architectures of the shader units in current GPUs have a large degree of variation when putting aside that they implement a similar set of instructions. Fragment and vertex shaders can be quite different in the number and arrangement of ALUs, the number of supported threads, the support for branching or loops and the access to memory (textures). Shaders from different companies are also quite different and their true architectures and limitations are never fully disclosed. One of the characteristics that they share, and that our architecture doesn’t fully support yet, is the capability of launching multiple instructions per cycle and execution thread to different ALUs, similar how a VLIW processor would do. Available information puts in as much as 5 or 6 different ARB like instructions the number that can be launched at a time. That is possible thanks to ALUs arranged in cascade, multiple paths for texture and scalar instructions, and special SIMD ALUs that support splited vector inputs for two different operations.
- General-Purpose Computation Using Graphics Hardware http://www.gpgpu.org/
- K. Fatahalian , J. Sugerman , P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, August 29-30, 2004, Grenoble, France
- Stanford University CS488a Fall 2001 Real-Time Graphics Architecture. Kurt Akeley, Path Hanrahan.
- Microsoft Meltdown 2003, DirectX Next Slides
Attila architecture parameters
The described architecture has been modeled with the highly configurable Attila Simulator that allows to edit each of a hundred architectural parameters. See at the following document: