ATTILA configuration parameters
From AttilaWiki
The ATTILA configuration file is named bGPU.ini for both versions of the simulator binaries and includes configurations parameters to control the simulation process, the gathering of statistics, the generation of images or other outputs and the configuration of the simulated GPU architecture. The same configuration file is used for both the non-unified and unified simulator binaries. The configuration file must be present in the working directory were the simulator binary is started.
The ATTILA configuration file is divided into sections. Each section starts with the section name under brackets '['/']' and is followed by a list of parameters names and their associated values. The parameters can be of one of three types: natural numbers (0 to N), boolean values (using the TRUE and FALSE keywords) and string (between quotes '"'). The character '#' can be used to include comments in the configuration file. All the characters after a '#' character are ignored by the configuration parser.
Example:
[SECTION NAME] parameter1 = 1235 parameter2 = TRUE parameter3 = "output.txt"
Due to the primitive parameter reading capabilities of the ConfigurationLoader class sections can only appear once in the file and parameters for a section can only appear between the start of the section and the start of the next section (or the end of the configuration file).
There are no predefined values for most of the parameters in the configuration file so if they are not present they will take as a value whatever is the content of the memory associated with the first read of the parameters at start up (likely 0).
The current version (ATTILA rei) of the simulator supports the following sections in the configuration file:
- SIMULATOR : parameters related with the simulator functionality and output.
- GPU : parameters related to the global configuration in the ATTILA GPU architecture.
- COMMANDPROCESSOR : parameters related with the configuration of the Command Processor unit/stage in the ATTILA GPU architecture.
- MEMORYCONTROLLER : parameters related with the configuration of the Memory Controller unit/stage in the ATTILA GPU architecture.
- STREAMER: parameters related with the configuration of the Streamer unit/stage in the ATTILA GPU architecture.
- VERTEXSHADER : parameters related with the configuration of the Vertex Shader unit/stage in the ATTILA GPU architecture.
- PRIMITIVEASSEMBLY : parameters related with the configuration of the Primitive Assembly unit/stage in the ATTILA GPU architecture.
- CLIPPER : parameters related with the configuration of the Clipper unit/stage in the ATTILA GPU architecture.
- RASTERIZER : parameters related with the configuration of the Rasterizer units/stages in the ATTILA GPU architecture.
- FRAGMENTSHADER : parameters related with the configuration of the Fragment Shader unit/stage in the ATTILA GPU architecture.
- ZSTENCILTEST : parameters related with the configuration of the Z and Stencil Test unit/stage in the ATTILA GPU architecture.
- COLORWRITE : parameters related with the configuration of the Color Write unit/stage in the ATTILA GPU architecture.
- DAC : parameters related with the configuration of the DAC unit/stage in the ATTILA GPU architecture.
Additionally, a sample of a reference baseline configuration for the ATTILA Architecture is included.
SIMULATOR Section
The SIMULATOR section is used to configurate the simulation process and the different outputs of the simulator, for example the generation of statistics or the signal traffic dump trace.
The parameters that can be used in the SIMULATOR section are:
- InputFile
- Type : string
- Description : Name and path of the input OpenGL trace file. The equivalent command line parameter overrides the value in the configuration file.
- Note : In the current version of the simulator the path is not used to search the MemoryRegions.dat and BufferDescriptors.dat files that may be associated with the input tracefile so they must reside in the simulation working directory.
- SimCycles
- Type : number
- Description : Number of cycles to simulate. The equivalent command line parameter overrides the value in the configuration file.
- SimFrames
- Type : number
- Description : Number of frames to simulate. The equivalent command line parameter overrides the value in the configuration file.
- Note: When either of the SimCycles and SimFrames are non zero the simulation will end only when both conditions are true.
- SignalDumpFile
- Type : string
- Description : Name of the signal traffic dump trace file.
- StatsFile
- Type : string
- Description : Name of the statistics file (per cycle rate).
- StatsFilePerFrame
- Type : string
- Description : Name of the statistics file (per frame).
- StatsFilePerBatch
- Type : string
- Description : Name of the statistics file (per batch).
- StartFrame
- Type : number
- Description : Frame of the input OpenGL trace file at which simulation (rendering) will start. For all precedent frames data that is meant to be resident in GPU memory will be transfered to the GPU memory but no rendering operations will be performed. The equivalent command line parameter overrides the value in the configuration file.
- StartSignalDump
- Type : number
- Description : Cycle at which the signal traffic dump starts.
- SignalDumpCycles
- Type : number
- Description : Number of cycles that the signal traffic will be dumped counting from the start cycle.
- StatisticsRate
- Type : number
- Description : Rate at which statistics are generated. This parameter represents the number of simulated cycles between outputs to the the statistics file.
- Note: Statistics are 'averaged' or aggregated for the for the number of cycles determined by this parameter.
- DumpSignalTrace
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the output of the signal traffic dump file.
- Statistics
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the output of the per cycle period statistics file.
- StatisticsPerFrame
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the output of the per frame statistics file.
- StatisticsPerBatch
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the output of the per batch statistics file.
- GenerateFragmentMap
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the output of the the fragment/quad map
- Note : The fragment/quad map is a 2D graphic frame in PPM format that outputs a given property (color, depth complexy, latency) for each quad (2x2 fragment tile) of the final framebuffer. This map is generated for each frame and it's dumped at the same time that the color buffer frame file is dumped.
- FragmentMapMode
- Type : number
- Description : The property or information that is generated for the fragment/quad map.
- Codes :
- 0 => Color (color of the first fragment of the last fragment quad wirtten over the color buffer pixel quad).
- 1 => Overdraw (number of fragment quads that have been written over a color buffer pixel quad).
- 2 => Latency of the fragment/quad since it's generated at rasterization until it's written or blended with the Color Buffer. Information for the last fragment/quad that updates a given framebuffer position.
- 3 => Latency of the fragment/quad since it enters the shader unit it exists the shader unit. Information for the last fragment/quad that updates a given framebuffer position.
- DoubleBuffer
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) a different color buffer for the backbuffer.
- ForceMSAA
- Type : boolean
- Description : Forces the ATTILA Driver to use multisampling antialiasing.
- MSAASamples
- Type : number
- Description : Sets the number of multisampling samples to use when MSAA is force by the driver.
- ForceFP16ColorBuffer
- Type : boolean
- Description : Forces the ATTILA Driver to set the color buffer format to 16 bit float point format (GPU_RGBA16F). Default is 8-bit normalized format (GPU_RGBA8888).
- ObjectSize0
- Type : number
- Description : Size in bytes of an element in the first bucket of the simulator memory manager.
- Note : To speed up the allocation of dynamic objects (such the ones that are sent through signals from one pipeline stage to the other) the ATTILA simulator uses a fast allocation and deallocation memory manager. The manager is quite simple and the allocated objects must fit in a given bucket size. To reduce the amount of wasted memory up to three buckets are supported (only two with the fastest implementation). The object sizes of the memory manager bucket elements must be tailored to the sizes of the objects that are dinamically created and destroted in the simulator code.
- BucketSize0
- Type : number
- Description : Number of elements in the first bucket of the simulator memory manager.
- Note : The simulator fast memory manager is quite simple and the total memory space for a given bucket is statically allocated at start time. Dynamic allocation of the bucket memory is not implemented. The number of elements in a bucket is determined by the maximum number of dynamic objects that can be active for a given bucket element size at any time in the simulated ATTILA GPU pipeline. When the number of GPU units and stages, queue sizes or pipeline stages grows this number must also grow.
- ObjectSize1
- Type : number
- Description : Size in bytes of an element in the second bucket of the simulator memory manager.
- Note : For the fastest memory manager the element size of the first bucket must be smaller than the element size of the second bucket.
- BucketSize1
- Type : number
- Description : Number of elements in the second bucket of the simulator memory manager.
- ObjectSize2
- Type : number
- Description : Size in bytes of an element in the third bucket of the simulator memory manager.
- Note : The third bucket is only used when the fastest memory manager is disabled. The fastest memory manager is enabled by default in the simulator code. To disable comment the definition of the 'FAST_NEW_DELETE' constant in the 'support/OptimizedDynamicMemory.cpp' file.
- BucketSize2
- Type : number
- Description : Number of elements in the third bucket of the simulator memory manager.
- UseACD
- Type : boolean
- Description : When set to TRUE the new (experimental) ATTILA OpenGL library is used to translate OpenGL traces to ATTILA GPU commands.
GPU Section
The GPU section is used to configure global parameters for the simulated ATTILA GPU architecture.
The parameters that can be used in the GPU section are:
- NumVertexShaders
- Type : number
- Description : Number of instances of the vertex shader unit/stage in the simulated ATTILA GPU architecture.
- Note : For the unified ATTILA architecture this number determines only affects the number of vertices that can be received per cycle to the pre shading vertex queue in the Fragment FIFO stage of the ATTILA GPU architecture.
- NumFragmentShaders
- Type : number
- Description : Number of instances of the fragment or unified shader unit/stage in the simulated ATTILA GPU architecture.
- NumStampPipes
- Type : number
- Description : Number of instances of the Z and Stencil Test and Color Write unit/stages in the simulated ATTILA GPU architecture.
COMMANDPROCESSOR Section
The COMMANDPROCESSOR section is used to configure the Command Processor stage of the simulated ATTILA GPU architecture.
The parameter that can be used in the GPU section is:
- PipelinedBatchRendering
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the pipelining of up to two rendering batches (draw commands) in the ATTILA GPU pipeline with the update of register state in the ATTILA GPU stages and memory transferences from system memory to GPU memory.
- Note : The current implementation supports two rendering batches working concurrently. The first batch would have completely finished the processing in the geometry stages of the pipeline (up to the Clipper stage) and the second would be starting processing in the geometry stages of the pipeline but no processing could start in the triangle and fragment stages of the pipeline until the first rendering batch finishes. Updates to the GPU units/stages can be stored while both batches are being processed and the updates will be correctly issued later to the different stages before the next rendering batch requires processing in the corresponding stages. The number of register updates is limited to the value assigned to the MAX_REGISTER_UPDATES constant in the 'CommandProcessor.h' file. Memory transferences between system and GPU memory (which in the current implementation are simulated by the Command Processor) can also happen in parallel with batch rendering as long as the transferenfe is not dependant on previous batches or operations (dependant transactions must be marked as 'locked' using the 'locked' parameter/attribute in the AGP Transaction class).
MEMORYCONTROLLER Section
The MEMORYCONTROLLER section configures the Memory Controller unit/stage of the ATTILA GPU architecture.
The MEMORYCONTROLLER sections allows to select a new memory controller that we call Memory Controller V2 (define MemoryControllerV2 = TRUE to select this memory controller). Most of the parameters described here are ignored when using the MCv2. Memory Controller V2 uses specific parameters, its description can be found here: Memory Controller V2 parameters description. The only parameters shared by both memory controllers are the parameters that define the bus width among the memory controller and the GPU units.
The parameters that can be used in the MEMORYCONTROLLER section are:
- MemorySize
- Type : number
- Description : Size of the GPU local GDDR memory in bytes.
- MemoryClockMultiplier
- Type : number
- Description : Not implemented.
- MemoryFrequency
- Type : number
- Description : Not implemented.
- MemoryBusWidth
- Type : number
- Description : Ignored/not implemented.
- MemoryBuses
- Type : number
- Description : Number of 64-bit (two chips working in parallel) channels to the simulated GDDR memory.
- Note : In the current implementation this number is the only configurable parameter that can affect the maximum memory bandwidth provided to the simulated ATTILA GPU.
- Future Changes : This parameter should de renamed to 'MemoryChannels'.
- SharedBanks
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) an idealized model of the GDDR memory simulation in which all the memory can be accessed from all the channels.
- Note : This parameter corresponds with a legacy implementation that was not removed at the time but has not been tested in ages. No sane experiment would require this feature to be enabled.
- BankGranurality
- Type : number
- Description : Interleaving in bytes used to assign GPU physical 32-bit memory addresses to the different simulated GDDR channels. A value of 64 means, for example, that bits 0 to 5 of the physical address are offsets inside a 64 byte aligned line in a memory channel and bits 6 and up (determined by the log2 of the number of configured GDDR channels) determine the channel that is being accessed.
- Future Changes : This parameters should be renamed to 'Channel Interleaving'.
- BurstLength
- Type : number
- Description : Ignored/not implemented.
- ReadLatency
- Type : number
- Description : Number of GDDR command cycles of latency since a read command (burst) is sent (initiated) to the GDDR chip until the first data word is present in the data pins of the GDDR chip.
- Note : The simulated GDDR and GPU clock frequencies are the same in the current implementation. The value of this parameter is related with the model and frequency of the GDDR memory that is being simulated.
- WriteLatency
- Type : number
- Description : Number of GDDR command cycles of latency since a write command (burst) is sent (issued) to the GDDR chip until the data to write can change in the data pins of the GDDR chip.
- Note : The simulated GDDR and GPU clock frequencies are the same in the current implementation. The value of this parameter is related with the model and frequency of the GDDR memory that is being simulated.
- WriteToReadLatency
- Type : number
- Description : Minimum distance in GDDR command cycles between a write command and the next read command to the same channel. The value of this parameter is related with the model and frequency of the GDDR memory that is being simulated.
- Note : The simulated GDDR and GPU clock frequencies are the same in the current implementation. The value of this parameter is related with the model and frequency of the GDDR memory that is being simulated.
- MemoryPageSize
- Type : number
- Description : Size in bytes of a page in the simulated GDDR memory. Current implementation is twice the size as it has to account for two GDDR chips working in lock step.
- Note : The value of this parameter is related with the model of the GDDR memory that is being simulated.
- OpenPages
- Type : number
- Description : Number of banks in the simulated GDDR memory. This value is applicable per memory channel as the two GDDR chips associated with each channel work in lock step and have the same pages open in their corresponding banks.
- Note : The value of this parameter is related with the model of the GDDR memory that is being simulated.
- PageOpenLatency
- Type : number
- Description : Minimum distance in GDDR command cycles since an open page command (active) has been issued to the GDDR chip until the next read or write command can be issued.
- Note : The simulated GDDR and GPU clock frequencies are the same in the current implementation. The value of this parameter is related with the model and frequency of the GDDR memory that is being simulated.
- MaxConsecutiveReads
- Type : number
- Description : Maximum number of consecutive read transactions that can be processed in a memory channel before starting the processing of a pending write transaction.
- Note : This parameter and the MaxConsecutiveReads parameter control how much priority is given to read or write transactions in a memory channel. The current implementation of the Memory Controller has two transaction queues that serve the request for the GDDR chips. One queue is for read transactions and the other is for write transactions.
- MaxConsecutiveWrites
- Type : number
- Description : Maximum number of consecutive write transactions that can be processed in a memory channel before starting the processing of a pending read transaction.
- Note : This parameter and the MaxConsecutiveWrites parameter control how much priority is given to read or write transactions in a memory channel. The current implementation of the Memory Controller has two transaction queues that serve the request for the GDDR chips. One queue is for read transactions and the other is for write transactions.
- CommandProcessorBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Command Processor and the Memory Controller.
- Note : This bus is shared for read and write transferences so they can not happen concurrently. In the current implementation the Command Processors imulates the transferences (DMA?) between system and the GPU GDDR memory and this data bus width parameter must be set to the bandwidth available through the AGP or PCIE connection to the CPU and system memory.
- StreamerFetchBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Streamer Fetch unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This data bus is used to read vertex indices from system or GPU GDDR memory.
- StreamerLoaderBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Streamer Loader unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This data bus is used to read vertex attribute data from system or GPU GDDR memory.
- ZStencilBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Z and Stencil Test unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This data bus is used to transfer compressed and uncompressed z and stencil data from and to the system or GPU GDDR memory. As there is a single data bus read and write transferences can not happen in parallel. This data bus is dedicated per instance of the Z and Stencil Test unit/stage so the actual number of data buses is the value assigned to the NumStampPipes parameter in the GPU section.
- ColorWriteBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Color Write unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This data bus is used to transfer compressed and uncompressed color data from and to the system or GPU GDDR memory. As there is a single data bus read and write transferences can not happen in parallel. This data bus is dedicated per instance of the Color Write unit/stage so the actual number of data buses is the value assigned to the NumStampPipes parameter in the GPU section.
- DACBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the DAC unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This bus is used to read color data from the system or GPU GDDR memory.
- TextureUnitBusWidth
- Type : number
- Description : Bandwidth in bytes per cycle of the data bus between the Texture Unit unit/stage of the ATTILA GPU architecture and the Memory Controller.
- Note : This bus is used to read compressed and uncompressed texture data from system or GPU GDDR memory. This data bus is dedicated per instance of the Texture Unit unit/stage so the actual number of data buses is the multiplied value assigned to the NumFragmentShaders paramater of the GPU section and TextureUnits parameter of the FRAGMENTSHADER section.
- MappedMemorySize
- Type : number
- Description : Size in bytes of the system memory that can be accessed by the GPU.
- Note : This is the size of the system memorey region allocated for the 'AGP aperture' (or related parameter for PCIE).
- ReadBufferLines
- Type : number
- Description : Number of buffer lines for read transactions pending to be serviced to the requesting unit/stage of the ATTILA GPU pipeline.
- Note : This number limits the number of pending read transactions in the Memory Controller. The size of a buffer line corresponds with the size of the maximum transaction size (defined by the MAX_TRANSACTION_SIZE constant in MemoryController.h).
- WriteBufferLines
- Type : number
- Description : Number of buffer lines for pending write transactions received from the units/stages of the ATTILA GPU pipeline.
- Note : This number limits the number of pending write transactions in the Memory Controller. The size of a buffer line corresponds with the size of the maximum transaction size (defined by the MAX_TRANSACTION_SIZE constant in MemoryController.h).
- RequestQueueSize
- Type : number
- Description : Size of the transaction request queue that stores the information for transactions received from the different units/stages of the ATTILA GPU pipeline.
- Note : This number corresponds with a 'global' transaction queue which space is shared by all the memory channels so a single channel can use all the entries in the queue.
- ServiceQueueSize
- Type : number
- Description : Size of the serviceq ueue that stores pending read transactions to the requesting units/stages of the ATTILA GPU architecture.
- Note : This number correspond with the sizes of the a set of distributed queues per unit/stage of the ATTILA GPU architecture. So there are sepparated queues for the Command Processor, Streamer Fetch, Streamer Loader, Z and Stencil Test, Color Write, DAC and Texture Unit units/stages of the GPU pipeline. Instances of the same unit/stage share the queue.
The next parameters are only supported for the new Memory Controller model written by Carlos (not public release)
- MemoryControllerV2
- Type : boolean
- Description : When set to TRUE the new and more accurate Memory Controller model is used.
- V2MemoryChannels
- Type : number
- Description : Number of memory channels to GDDR memory. Each memory channel has an associated memory channel controller that can issue independent memory operations.
- V2BanksPerMemoryChannel
- Type : number
- Description : Number of memory banks per memory channel. This number is actually the number of banks in a given GDDR memory specification.
- V2MemoryRowSize
- Type : number
- Description : Size in bytes of a memory row (GDDR memory page that it's mapped to one of the GDDR banks).
- V2BurstElementsPerCycle
- Type : number
- Description : Number of burst elements that are transmitted through a memory channel in a GPU cycle. The size of a burst element is given by the GDDR specification (in GDDR3 it's 16 bytes).
- V2ChannelInterleaving
- Type : number
- Description : Defines the interleaving in bytes used to map physical memory addresses to memory channels.
- V2BankInterleaving'
- Type : number
- Description : Defines the interleaving in bytes used to map physical memory addresses between bank pages of the same memory channel.
- V2SecondInterleaving
- Type : boolean
- Description : The new Memory Controller allows to define two different interleaving configurations to map memory physical addresses to memory channels and memory bank pages. The first interleaving configuration is used when this parameter is set to FALSE. When the parameter is set to TRUE the first interleaving configuration is used for vertex, color and depth data memory and the second interleaving configuration is used for texture data memory.
- V2SecondChannelInterleaving
- Type : number
- Description : Defines the interleaving in bytes used to map physical memory addresses to memory channels. Parameter for the second interleaving configuration.
- V2SecondBankInterleaving
- Type : number
- Description : Defines the interleaving in bytes used to map physical addresses between bank pages of the same memory channel. Parameter for the second interleaving configuration.
- V2MaxChannelTransactions
- Type : number
- Description :
- V2ChannelScheduler
- Type : number
- Description : Defines the scheduler to be used by the memory channel controllers. The new Memory Controller implements four schedulers:
- 0 => FIFO (single command queue)
- 1 => Read Write FIFO (one command queue for read operations and another for write operations)
- 2 => Per Bank Queue
- 3 => Per Bank Read and Write Queues
- V2PagePolicy
- Type : number
- Description : Defines the memory bank policy to use. A 0 means that the page in a bank is closed after the last access to the page is performed (close page policy). A 1 means that the page remains open until a different page forces to change the page (open page policy).
- V2PerfectMemory
- Type : boolean
- Description : When this parameter is set to TRUE the Memory Controller models a perfect memory (no latency and no restrictions) that provides the maximum theorical bandwidth for the defined configuration.
STREAMER Section
The STREAMER section configures the Streamer Fetch, Streamer Loader, Streamer Output Cache and Streamer Commit stages of the ATTILA GPU architecture.
The parameters that can be used in the STREAMER section are:
- IndicesCycle
- Type : number
- Description : Number of vertex indices that can be read and processed per cycle by the Streamer pipeline of the ATTILA GPU architecture.
- IndexBufferSize
- Type : number
- Description : Size in bytes of the buffer that stores vertex index data for the Streamer Fetch unit/stage of the ATTILA GPU architecture.
- InputRequestQueueSize
- Type : number
- Description : Size of the queue that stores information about pending data transferences for vertex indices in the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : Each entry corresponds with a single index. Each index is associated with up to 16 vertex attribute buffers or streams.
- AttributesCycle
- Type : number
- Description : Number of vertex attributes (up to 4x32 bit FP values per attribute) that can be processed and read per cycle in the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : This parameter affects the number of read ports in the Input Cache of the Streamer Loader unit/stage.
- InputCacheLines
- Type : number
- Description : Number of cache lines in the Input Cache associated with the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : The Input Cache is a fully associative cache.
- InputCacheLineSize
- Type : number
- Description : Size in bytes of a cache line in the Input Cache associated with the Streamer Loader unit/stage of the ATTILA GPU architecture.
- InputCachePortWidth
- Type : number
- Description : Bandwidth in bytes per cycle of each of the read and write ports of Input Cache associated with the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : The actual bandwidth from Input Cache to Streamer Loader is this parameter multiplied by the AttributesCycle parameter. There is a single write port for the transferences from the Memory Controller.
- InputCacheRequestQueueSize
- Type : number
- Description : Number of entries in the pending cache miss queue in the Fetch Cache associated with the Input Cache associated with the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : Each entry corresponds with a single Input Cache line.
- InputCacheInputQueueSize
- Type : number
- Description : Number of entries of the pending cache line fill in the Input Cache associated with the Streamer Loader unit/stage of the ATTILA GPU architecture.
- Note : Each entry corresponds with a single Input Cache that is pending from being filled with data from the Memory Controller. The actual number of pending Input Cache lines that have pending misses is the sum of this parameter and the InputCacheRequestQueueSize parameter.
- OutputFIFOSize
- Type : number
- Description : Number of entries in the Output FIFO in the Streamer Commit unit/stage of the ATTILA GPU architecture.
- Note : Each entry of the Output FIFO corresponds with an index in the current index stream. The Output FIFO is a reorder queue used to keep the order of the vertices issues to the Primitive Assembly unit/stage and keeps information about all the indexes that are being processed in any stage of the Streamer pipeline. The value assigned to this parameter limits how many vertices are being processed at any given time at the Streamer and Vertex Shader (or unified shader processors) unit/stages of the ATTILA GPU architecture.
- OutputMemorySize
- Type : number
- Description : Number of entries in the Output Memory in the Streamer Commit unit/stage of the ATTILA GPU architecture.
- Note : Each entry of the Output Memory corresponds with the full set of attributes of a vertex associated with a specific vertex index number. The number of entries in the Output Memory can be smaller than the number of entries in the Output FIFO to take into account the fact that some index numbers may be repeated. This Output Memory acts as the storage for the Vertex Post-Shading cache in the ATTILA GPU architecture.
- VerticesCycle
- Type : number
- Description : Number of vertices that can be commited and issued to Primitive Assembly from the Streamer Commit unit/stage of the ATTILA GPU architecture.
- AttributesSentCycle
- Type : number
- Description : Number of vertex attributes (up to 4x32 bits of data) per vertex that can be transfered to Primitive Assembly from the Streamer Commit unit/stage of the ATTILA GPU architecture.
- Note : The total bandwidth of the data buses from Streamer to Primitive Assembly is the multiplication of this parameter by the VerticesCycle parameter and the largest vertex attribute size (16 bytes).
VERTEXSHADER Section
The VERTEXSHADER sections configures the Vertex Shader stage of the non unified version of the ATTILA GPU architecture. This section is ignored by the unified version of the ATTILA GPU architecture. The parameters that can be used in the VERTEXSHADER section are:
- ExecutableThreads
- Type : number
- Description : Number of threads supported by the vertex shader processor in the ATTILA GPU architecture.
- InputBuffers
- Type : number
- Description : Numbers of temporal buffers where to store vertex attribute data from the Streamer Loader unit/stage.
- Note : This are implemented as extra 'threads' that are not executable. When a real shader thread finishes and one of this extra threads has valid data it's transfered to a real thread and can start execution.
- ThreadResources
- Type : number
- Description : Number of 'resources' shared by all the threads (including the buffering 'threads' defined by the InputBuffers parameter) in the Vertex Shader processor of the ATTILA GPU architecture. In the current implementation this corresponds with the number of SIMD 4x32-bit float point registers that can be used to store vertex attribute data, temporal values and vertex output data for the executable and buffer threads in the Vertex Shader processor.
- Note : The allocation of those registers is performed statically. Each vertex that enters the processors reserves for itself a fixed number of register which is determined by the maximum of this three values: the number of vertex input attributes, the maximum number of live registers at any point of the execution of the current vertex shader program and the number of vertex output attributes.
- ThreadRate
- Type : number
- Description : Number of threads that are being 'executed' in parallel. This parameter corresponds to the number of executable threads from which instructions can be fetched, decoded and executed per cycle.
- FetchRate
- Type : number
- Description : Number of instructions that are being 'executed' in parallel. This parameter corresponds with the number of instructions that can be fetched, decoded and executed per thread and cycle.
- Note : So in a configuration where ThreadRate is 2 and FetchRate is 1 the Vertex Shader processors are fetching up to two instructions from two different threads each cycle, while a configuration with ThreadRate set at 1 and FetchRate set at 2 is fetching up to two consecutive instruction from a single thread each cycle.
- ThreadGroup
- Type : number
- Description : Number of threads that form a group. The threads in a group share the thread state (same PC, thread state flags).
- Note : This parameter only makes sense with the LockedExecutionMode parameter set to TRUE.
- LockedExecutionMode
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) locked execution of threads in a group. It is actually enabling shader threads to be processed in groups.
- Note : For a single threaded Vertex Shader processor (equivalente to NVidia MIMD Vertex Shaders) this value should be set to FALSE. For a SIMD Fragment Shader processing fragment quads this value should be TRUE and the ThreadGroup parameter set to 4.
- ScalarALU
- Type : boolean
- Description : When disabled (FALSE) the base ALU of the Vertex Shader processor is a SIMD 4x32-bit float point ALU with support for scalar and special instructions. When enabled (TRUE) an additional scalar ALU is enabled (for example like the 4+1 configuration in ATI Vertex Shaders) to execute scalar and special instructions. When enabled the SIMD ALU can still execute scalar and special instructions. When enabled the FetchRate parameter must be set to at least 2. Any additional ALU required for the value assigned to the FetchRate parameter will be a SIMD ALU.
- ThreadWindow
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) out of order execution of the threads (groups when LockedExecutionMode is set to TRUE). When this mode is enabled the next thread (or group) from which to fetch an instruction is selected from a thread window with as many entries as the value of the ExecutableThreads parameter (or ExecutableThreads divided by ThreadGroup when LockedExecutionMode is enabled). The selection logic implements a Round Robin first ready policy. When the parameter is set as FALSE the thread are processed in order. The next thread (or group) in the queue is the one selected for fetching and in the case this thread (or thread) is not ready the Vertex Shader processor stalls until it becomes ready.
- FetchDelay
- Type : number
- Description : Number of cycles since instructions are fetched from a thread until the thread can become ready again.
- Note : This parameter can be used to force threads to remain inactive for a number of cycles and eliminates the risk of dependencies between the execution of instructions of the same thread if set to the value of the maximum Vertex Processor ALU latency. The latencies for the different shader instructions are defined in the latencyTable array defined in the ShaderDecodeExecute.h file.
- SwapOnBlock
- Type : boolean
- Description : When this parameter is set to FALSE a new ready thread (or group) is selected for fetching every cycle. When the parameter is set to TRUE a new thread (or group) is only selected when the thread (or group) becomes blocked.
- Note : The only instructions that can block a thread are the TEX/TXP/TXB instructions and the END instruction and end program flag.
- InputsPerCycle
- Type : number
- Description : Number of vertices with the associated attribute data (shader inputs) that can be received and processed per cycle by the Vertex Shader processor.
- Note : This rate is not limited by the number of attributes associated with the vertex, so the potential bandwidth is the maximum number of input vertex attributes multiplied by 16 bytes.
- OutputsPerCycle
- Type : number
- Description : Number of vertices and associated shaded attribute data (shader outputs) that can be processed and transfered to the Streamer Commit unit/stage from the Vertex Shader processor per cycle.
- Note : This rate is not limited by the number of attributes associated with the vertex, so the potential bandwidth is the maximum number of output vertex attributes multiplied by 16 bytes.
- OutputLatency
- Type : number
- Description : Latency in cycles of the data bus or path from the Vertex Shader processor to the Streamer Commit unit/stage in the ATTILA GPU architecture.
PRIMITIVEASSEMBLY Section
The PRIMITIVEASSEMBLY section configures the Primitive Assembly unit/stage of the ATTILA GPU architecture.
The parameters that can be set in the PRIMITIVEASSEMBLY section are:
- VerticesCycle
- Type : number
- Description : Number of vertices (with the associated shaded attributes) that can be received from the Streamer Commit unit/stage and processed per cycle in the Primitive Assembly unit/stage of the ATTILA GPU architecture.
- Note : The value of this number must be the same that the value of the VerticesCycle parameter in the STREAMER section.
- Future Changes : Use a single parameter.
- TrianglesCycle
- Type : number
- Description : Number of triangles that can be issued to the Clipper unit/stage per cycle from the Primitive Assembly unit/stage of the ATTILA GPU architecture.
- InputBusLatency
- Type : number
- Description : Latency in cycles of the data bus that is used to transfer vertex data from the Streamer Commit unit/stage to the Primitive Assembly unit/stage of the ATTILA GPU architecture.
- AssemblyQueueSize
- Type : number
- Description : Number of vertices (with the associated attribute data) that can be stored in the assembly queue of the Primitive Assembly unit/stage of the ATTILA GPU architecture.
- Note : This number must be at least sum of VerticesCycle and 4 (the 4 vertex that are required to form a quad primitive).
CLIPPER Section
The CLIPPER section configures the Clipper unit/stage of the ATTILA GPU architecture.
The parameters that can be set in the CLIPPER section are:
- TrianglesCycle
- Type : number
- Description : Number of triangles that can be received from Primitive Assembly and sent to the Rasterizer per cycle in the Clipper unit/stage of the ATTILA GPU architecture.
- Note : The value assigned to this parameter must match the value assigned to the TrianglesCycle parameter in the PRIMITIVEASSEMBLY section.
- Future Changes : Use a single parameter.
- ClipperUnits
- Type : number
- Description : Number of triangles that can start the clipping operation per cycle.
- Note : The current implementation of the Triangle Clipper only implements a simple rejection test for triangles completely outside the view frustrum so no new triangles are generated.
- StartLatency
- Type : number
- Description : Minimum of cycles between the start of the clipping operation of two consecutive triangles (this parameter applies per Clipper unit).
- ExecLatency
- Type : number
- Description : Latency in cycles of the Clipper unit pipeline (this parameter applies per Clipper unit). Corresponds with the time in cycles since a triangle starts the clipping test until the result is obtained and the triangle discarded or stored in the queue for clipped triangles.
- ClipBufferSize
- Type : number
- Description : Number of entries in the queue that stores clipped (non rejected) triangles in the Clipper unit/stage of the ATTILA GPU architecture.
- Note : Each entry corresponds with a single triangle and all the associated data (three vertices and associated attribute data).
RASTERIZER Section
The RASTERIZER section configures the Triangle Setup, Fragment Generation (Triangle Traversal box in the simulator), Hierarchical Z and Fragment FIFO stages of the ATTILA GPU architecture.
The parameters that can be set in the RASTERIZER section are:
- TrianglesCycle
- Type : number
- Description : Number of triangles that can be received from the Clipper unit/stage per cycle in the Triangle Setup unit/stage of the ATTILA GPU architecture. It also corresponds with the number of triangles that can be issued to the Fragment Generator unit/stage per cycle.
- Note : The value assigned to this parameter must match the value assigned to the TrianglesCycle parameter in the CLIPPER and PRIMITIVEASSEMBLY sections.
- Future Changes : Use a single parameter.
- SetupFIFOSize
- Type : number
- Description : Number of entries in queue that stores triangles that have been processed by the Triangle Setup unit/stage of the ATTILA GPU architecture.
- Note : Each entry in the queue corresponds with a setup triangle and all it's associated data (attributes for three vertices, equation coefficients for the edge and z interpolation equations).
- SetupUnits
- Type : number
- Description : Number of triangles that can start processing in the Triangle Setup unit/stage per cycle.
- SetupLatency
- Type : number
- Description : Latency in cycles of the hardware pipeline that performs the setup operation for a single triangle.
- SetupStartLatency
- Type : number
- Description : Minimum number of cycles that must pass since the setup operation is started for a triangle until the setup operation can be started for the next triangle (per Triangle Setup unit).
- TriangleInputLatency
- Type : number
- Description : Latency in cycles of the data bus used to transfer triangle data from the Clipper unit/stage to the Triangle Setup unit/stage of the ATTILA GPU architecture.
- TriangleOutputLatency
- Type : number
- Description : Latency in cycles of the data bus used tot transfer triangle data from the Triangle Setup unit/stage to the Fragment Generation unit/stage of the ATTILA GPU architecture.
- TriangleSetupOnShader
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the usage of a shader program that performs part of the Triangle Setup operations in the shader processors. It's basically removing part of the Triangle Setup logic and using the shader processor to perform the operations.
- Note : This parameter only applies for the unified shader version of the ATTILA GPU architecture. This feature has not been tested in a while and my be broken in the current implementation of the simulator.
- TriangleShaderQueueSize
- Type : number
- Description : Number of entries in the reorder queue that keeps the state for triangles that are pending to finish the Triangle Setup shader program in the shader processors.
- Note : This parameter only applies for the unified shader version of the ATTILA GPU architecture and when the TriangleSetupOnShader parameter is set to TRUE. Each entry corresponds with a full triangle and associated data. The value of this parameter limits how many triangles can the shader processors be processing at a given time.
- StampsPerCycle
- Type : number
- Description : Number of 'stamps' or quads (2x2 tile of fragments) that are generated per cycle by the Fragment Generation unit/stage of the ATTILA GPU architecture.
- Note : The value assigned to this parameter should be a multiple of the NumStampPipes parameter in the GPU section.
- MSAASamplesCycle
- Type : number
- Description : Number of multisampling samples that can be generated and processed per fragment.
- OverScanWidth
- Type : number
- Description : Size in scan tiles of a 'over scan' tile in the horizontal axis.
- Note : The over scan tile can be used to increase locality when accessing GDDR pages containing frame buffer data.
- OverScanHeight
- Type : number
- Description : Size in scan tiles of a 'over scan' tile in the vertical axis.
- ScanWidth
- Type : number
- Description : Size in fragments of a scan tile in the horizontal axis.
- Note : The scan tile is used as the workload distribution unit to the Fragment Shader processors and Z and Stencil Test and Color Write unit/stages of the ATTILA GPU architecture. It's similar to how ATI GPUs tile the whole framebuffer with a checkerboard pattern and assign tiles to their Fragment Shader and ROP pipelines.
- ScanHeight
- Type : number
- Description : Size in fragments of a scan tile in the vertical axis.
- GenWidth
- Type : number
- Description : Size in fragments of a generation tile in the horizontal axis.
- Note : The generation tile corresponds with the size of a Hierarchical Z block, a Z and Stencil Cache line and a Color Cache line. It is also the work unit that is sent from Fragment Generation unit/stage to the Hierarchical Z unit/stage.
- GenHeight
- Type : number
- Description : Size in fragments of a generation tile in the vertical axis.
- RasterizationBatchSize
- Type : number
- Description : Number of triangles that are rasterized in parallel in the Fragment Generation unit/stage of the ATTILA GPU architecture.
- Note : This parameter only applies for the Recursive rasterization algorithm which can traverse groups of triangles in parallel.
- BatchQueueSize
- Type : number
- Description : Number of triangles that can be stored in the queue of triangles pending to be traversed in the Fragment Generation unit/stage of the ATTILA GPU architecture.
- RecursiveMode
- Type : number
- Description : Enables (TRUE) or disables (FALSE) the Recursive rasterization algorithm. When enabled the Fragment Generation unit/stage implements a recursive algorithm somewhat similar to the the one described in "Incremental and Hierarchical Hilbert Order Edge Equation Polygon Rasterization" (McCool et al). When disables a tiled triangle scan algorithm is implemented similar to the one described for the Compaq Neon graphic processor.
- DisableHZ
- Type : number
- Description : Disables (TRUE) or enables (FALSE) the early z test and rejection of fragment quads in the Hierarchical Z unit/stage of the ATTILA GPU architecture.
- Note : The value of this parameter must be TRUE if z compression is disabled in the Z and Stencil Cache or the cache line doesn't corresponds with a generation tile.
- StampsPerHZBlock
- Type : number
- Description : Number of fragment quads per Hierarchical Z block.
- Note : The value of this number must correspond with the number of fragment quads in a generation tile.
- HierarchicalZBufferSize
- Type : number
- Description : Size of the on-die Hierarchical Z memory.
- Note : Each element corresponds with a single representative Z for a given Z buffer block. In the current implementation each element corresponds with a 24-bit Z value.
- HZCacheLines
- Type : number
- Description : Number of lines in the fast access representative Z cache in the Hierarchical Z unit/stage of the ATTILA GPU architecture.
- Note : The current implemenation of the on-die Hierarchical Z memory has limited bandwidth and multicycle access latency so a fast access cache is required to sustain the early test and rejection rate of multiple HZ blocks per cycle.
- HZCacheLineSize
- Type : number
- Description : Number of representative Z values stored per line in the fast access cache in the Hierarchical Z unit/stage of the ATTILA GPU architecture.
- EarlyZQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads pending from early z test and rejection in the Hierarchical Z unit/stage of the ATTILA GPU architecture.
- Note : Each entry stores the data associated with four preshading fragments in a quad (fragment cull flags, inside triangle flags, interpolated z value, barycentric coordinates).
- HZAccessLatency
- Type : number
- Description : Latency in cycles of an access to the on-die Hierarchical Z memory.
- HZUpdateLatency
- Type : number
- Description : Latency in cycles of the data buses (one per Z and Stencil Test unit instance) that carry representative Z values for blocks that have been expulsed from the Z Cache.
- HZBlocksClearedPerCycle
- Type : number
- Description : Rate at which elements of the on-die Hierarchical Z memory can be written per cycle.
- Note : This value is used only to determine how much time is required to reset the Hierarchical Z memory to the clear Z value.
- NumInterpolators
- Type : number
- Description : Number of interpolator units in the Interpolator unit/stage of the ATTILA GPU architecture. This number limits how many fragment attributes (4x32-bit float point values) can be interpolated from fragment barycentric coordinates and triangle vertex attributes per cycle.
- Note : This value applies to the fragment quad rate of the Fragment Generation pipeline defined by the NumStampPipes parameter in the GPU section. The total number of fragment attributes that can be interpolated per cycles is the value assigned to NumStampPipes multiplied by the value assigned to this parameter.
- ShaderInputQueueSize
- Type : number
- Description : Number of entries in the shader input queues in the Fragment FIFO unit/stage of the ATTILA GPU architecture.
- Note : The Fragment FIFO distributes the shader processing workload between the Fragment Shader processors (non unified version of the ATTILA GPU architecture) or the Unified Shader processors (unified version of the ATTILA GPU architecture). A queue per Fragment or Unified Shader processor stores the shader inputs that have been distributed but are pending from being transfered to the assigned shader processor. This parameter defines the size of those queues. An entry of a shader input queue must store all the data associated with a shader input (up to 16 attributes, each attribute being a 4x32-bit float point value).
- ShaderOutputQueueSize
- Type : number
- Description : number of entries in the shader output queues in the Fragment FIFO unit/stage of the ATTILA GPU architecture.
- Note : The Fragment FIFO distributes the shader processing workload between the Fragment Shader processors (non unified version of the ATTILA GPU architecture) or the Unified Shader processors (unified version of the ATTILA GPU architecture). A queue per Fragment or Unified Shader processor stores the shader inputs that have been fully processed by the assigned shader processor and are pending from being sent to the processing next stage (Streamer Commit for vertices, Triangle Setup for triangles or Color Write for fragments). This parameter defines the size of those queues. An entry of a shader input queue must store all the data associated with a shader input (up to 16 attributes, each attribute being a 4x32-bit float point value).
- ShaderInputBatchSize
- Type : number
- Description : This defines the size of the unit workload size that is used to distribute fragments between the different Fragment Shader processors in the Fragment FIFO unit/stage of the ATTILA GPU architectur.
- Note : This parameter only applies when the TiledShaderDistribution is set to FALSE. This feature has not been tested for a while and may not work. The value assigned to the this parameter must be a multiple of the ThreadGroup parameter in the FRAGMENTSHADER section.
- TiledShaderDistribution
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) the tile based distribution of fragment workload between the Fragment or Unified Shader processors.
- Note : The scan tile is the distribution workload unit and the assignment to the different Fragment or Unified shader processors implements a Morton (also called Z) order. When disabled batches of consecutive fragments (the size of a batch is defined by the ShaderInputBatchSize parameter) from any tile or location are issued to the different Fragment of Unified Shader processors.
- VertexInputQueueSize
- Type : number
- Description : Size of the queue that stores vertices pending from being shaded in the Fragment FIFO unit/stage of the ATTILA GPU pipeline.
- Note : This parameter only applies for the unified version of the ATTILA GPU architecture. The vertices received from the Streamer Loader unit/stage are stored in this queue before being assigned and issued to a Unified Shader processor. This parameter must be a multiple of the Unified Shader ThreadGroup parameter in the FRAGMENTSHADER section. Each entry stores all the data required for the input attributes (up to 16, each being 4x32-bit float point values) of a single vertex.
- ShadedVertexQueueSize
- Type : number
- Description : Size of the queue that stores vertices that have been shaded in the Fragment FIFO unit/stage of the ATTILA GPU pipeline.
- Note : This parameter only applies for the unified version of the ATTILA GPU architecture. The vertices that have been fully processed in the Unified Shader processors are received and stored in this queue. Later they are transfered to the Streamer Commit unit/stage. This parameter must be a multiple of the Unified Shader ThreadGroup parameter in the FRAGMENTSHADER section. Each entry stores all the data required for the output attributes (up to 16, each being 4x32-bit float point values) of a single shaded vertex. The size of this queue limits the number of vertices that are being processed at any given time in the Unified Shader processors.
- TriangleInputQueueSize
- Type : number
- Description : Size of the queue that stores triangles pending from being shaded in the Fragment FIFO unit/stage of the ATTILA GPU pipeline.
- Note : This parameter only applies for the unified version of the ATTILA GPU architecture and when the TriangleSetupOnShader parameter is set to TRUE. The triangles are received from the Triangle Setup unit/stage and stored in this queue before being assigned and issued to a Unified Shader processor. This parameter must be a multiple of the Unified Shader ThreadGroup parameter in the FRAGMENTSHADER section. Each entry stores all the data required for the input attributes (three 4x32 bit float point values) of a single triangle.
- TriangleOutputQueueSize
- Type : number
- Description : Size of the queue that stores triangles that have been shaded in the Fragment FIFO unit/stage of the ATTILA GPU pipeline.
- Note : This parameter only applies for the unified version of the ATTILA GPU architecture and when the TriangleSetupOnShader parameter is set to TRUE. The triangles that have been fully processed in the Unified Shader processors are received and stored in this queue. Later they are transfered back to the Triangle Setup unit/stage. This parameter must be a multiple of the Unified Shader ThreadGroup parameter in the FRAGMENTSHADER section. Each entry stores all the data required for the output attributes (three 4x32 bit float point values and a 32-bit float point value) of a single triangle. The size of this queue limits the number of triangles that are being processed at any given time in the Unified Shader processors.
- GeneratedStampQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that have been received from the Hierarchical Z unit/stage in the Fragment FIFO unit/stage of the ATTILA GPU architecture.
- Note : Each fragment quad pipeline has a sepparated queue. The number of quad pipelines is defined by the NumStampPipes parameter in the GPU section. Each entry of the queues stores all the data associated with a pre-shading fragment quad (cull, inside triangle, framebuffer coordinates, triangle identifier and barycentric coordinates for four fragments).
- EarlyZTestedStampQueueSize
- Type : number
- Description : Size of the queue that stores fragments that have been received from the Z and Stencil Test unit/stage in the Fragment FIFO unit/stage of the ATTILA GPU architecture.
- Note : Each fragment quad pipeline has a sepparated queue. The number of quad pipelines is defined by the NumStampPipes parameter in the GPU section. Each entry of the queues stores all the data associated with a pre-shading fragment quad (cull and inside triangle flags, framebuffer coordinates, triangle identifier and barycentric coordinates for four fragments).
- InterpolatedStampQueueSize
- Description : Size of the queue that stores fragments that have been received from the Interpolator unit/stage in the Fragment FIFO unit/stage of the ATTILA GPU architecture.
- Note : Each fragment quad pipeline has a sepparated queue. The number of quad pipelines is defined by the NumStampPipes parameter in the GPU section. Each entry of the queues stores all the data associated with an interpolated fragment quad (cull and inside triangle flags, framebuffer coordinates, triangle identifier and input attributes, up to 16, for four fragments).
- ShadedStampQueueSize
- Description : Size of the reorder queue for fragment quads that have been shaded in the Fragment or Unified Shader processors.
- Note : Each fragment quad pipeline has a sepparated queue. The number of quad pipelines is defined by the NumStampPipes parameter in the GPU section. Each entry of the queues stores all the data associated with a shaded fragment quad (cull and inside triangle flags, framebuffer coordinates, triangle identifier and outputt attributes, one 4x32-bit float point value and a extra 32-bit float point Z value, for four fragments). The size of this queues limits the number of fragment quads that are being processed in the Fragment or Unified Shader processors at any given time.
- EmulatorStoredTriangles
- Type : number
- Description : Maximum number of setup triangle that can be active in the Rasterizer Emulator (and the ATTILA GPU pipeline) at any given time.
- Note : The value of this number is related to the number of setup triangles that are active at any pipeline stage of the Rasterizer and Fragment pipelines at a given time. The Rasterizer Emulator must keep the data for a setup triangle (edge and z interpolation equations, attributes for the three triangle vertices) until the attributes for the last fragment in the triangle have been interpolated.
FRAGMENTSHADER Section
The FRAGMENTSHADER section configures the Fragment Shader processors in the non-unified version of the ATTILA GPU architecture or the Unified Shader processors in the unified version of the ATTILA GPU architecture.
The parameters that can be used in the FRAGMENTSHADER section are:
- ExecutableThreads
- Type : number
- Description : Number of threads supported by the fragment or unified shader processor in the ATTILA GPU architecture.
- InputBuffers
- Type : number
- Description : Numbers of temporal buffers where to store shader input attribute data.
- Note : This are implemented as extra 'threads' that are not executable. When a real shader thread finishes and one of this extra threads has valid data it's transfered to a real thread and can start execution.
- ThreadResources
- Type : number
- Description : Number of 'resources' shared by all the threads (including the buffering 'threads' defined by the InputBuffers parameter) in the Fragment or Unified Shader processor of the ATTILA GPU architecture. In the current implementation this corresponds with the number of SIMD 4x32-bit float point registers that can be used to store shader input attribute data, temporal values and shader output attribute data for the executable and buffer threads in the Fragment or Unified Shader processor.
- Note : The allocation of those registers is static. Each shader input that enters the processors reserves for itself a fixed number of register which is determined by the maximum of this three values: the number of shader input attributes, the maximum number of live registers at any point of the execution of the current shader program and the number of shader output attributes.
- ThreadRate
- Type : number
- Description : Number of threads that are being 'executed' in parallel. This parameter corresponds to the number of executable threads from which instructions can be fetched, decoded and executed per cycle. For example when configuring a Fragment Shader processor similar to the ones implemented in the ATI R520 GPU this parameter would be set to 4 and for a ATI R580 this parameter would be set to 12.
- FetchRate
- Type : number
- Description : Number of instructions that are being 'executed' in parallel. This parameter corresponds with the number of instructions that can be fetched, decoded and executed per thread and cycle.
- Note : So in a configuration where ThreadRate is 2 and FetchRate is 1 the Fragment or Unified Shader processors are fetching up to two instructions from two different threads each cycle, while a configuration with ThreadRate set at 1 and FetchRate set at 2 is fetching up to two consecutive instruction from a single thread each cycle. For a configuration similar to the NVidia G70 which implements two consecutive SIMD ALUs per shader processor this parameter would be set to 2.
- ThreadGroup
- Type : number
- Description : Number of threads that form a group. The threads in a group share the thread state (same PC, thread state flags).
- Note : This parameter only makes sense with the LockedExecutionMode parameter set to TRUE. For shader processors that process fragments this parameter must be set to a multiple of 4 (the number of fragments in a fragment quad). For example when configuring a Fragment Shader processor similar to the implemented in the ATI R580 or RV530 GPU this parameter would be set to 48, a Fragment Shader processors similar to the R520 or RV515 would be set to 16 and a Fragment Shader processor similar to the G70 would be set to 1024.
- LockedExecutionMode
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) locked execution of threads in a group. It is actually enabling shader threads to be processed in groups.
- Note : For a single threaded Vertex Shader processor (equivalente to NVidia MIMD Vertex Shaders) this value should be set to FALSE. For a SIMD Fragment Shader processing fragment quads this value should be TRUE and the ThreadGroup parameter set to a multiple of 4.
- ScalarALU
- Type : boolean
- Description : When disabled (FALSE) the base ALU of the Fragment or Unified Shader processor is a SIMD 4x32-bit float point ALU with support for scalar and special instructions. When enabled (TRUE) an additional scalar ALU is enabled (for example like the 4+1 configuration in ATI Xenos GPU) to execute scalar and special instructions. When enabled the SIMD ALU can still execute scalar and special instructions. When enabled the FetchRate parameter must be set to at least 2. Any additional ALU required for the value assigned to the FetchRate parameter will be a SIMD ALU.
- ThreadWindow
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) out of order execution of the threads (groups when LockedExecutionMode is set to TRUE). When this mode is enabled the next thread (or group) from which to fetch an instruction is selected from a thread window with as many entries as the value of the ExecutableThreads parameter (or ExecutableThreads divided by ThreadGroup when LockedExecutionMode is enabled). The selection logic implements a Round Robin first ready policy. When the parameter is set as FALSE the thread are processed in order. The next thread (or group) in the queue is the one selected for fetching and in the case this thread (or thread) is not ready the Fragment or Unified Shader processor stalls until it becomes ready.
- FetchDelay
- Type : number
- Description : Number of cycles since instructions are fetched from a thread until the thread can become ready again.
- Note : This parameter can be used to force threads to remain inactive for a number of cycles and eliminates the risk of dependencies between the execution of instructions of the same thread if set to the value of the maximum Fragment or Unified Shader processor ALU latency. The latencies for the different shader instructions are defined in the latencyTable array defined in the ShaderDecodeExecute.h file.
- SwapOnBlock
- Type : boolean
- Description : When this parameter is set to FALSE a new ready thread (or group) is selected for fetching every cycle. When the parameter is set to TRUE a new thread (or group) is only selected when the thread (or group) becomes blocked.
- Note : The only instructions that can block a thread are the TEX/TXP/TXB instructions and the END instruction and end program flag.
- InputsPerCycle
- Type : number
- Description : Number of shaer inputs with the associated attribute data that can be received and processed per cycle by the Fragment or Unified Shader processor.
- Note : This rate is not limited by the number of attributes associated with the shader input, so the potential bandwidth is the maximum number of shader input attributes multiplied by 16 bytes.
- OutputsPerCycle
- Type : number
- Description : Number of shader outputs and associated shaded attribute data that can be processed and transfered to the Fragment FIFO unit/stage from the Fragment or Unified Shader processor per cycle.
- Note : This rate is not limited by the number of attributes associated with the shader output, so the potential bandwidth is the maximum number of shader output attributes multiplied by 16 bytes.
- OutputLatency
- Type : number
- Description : Latency in cycles of the data bus or path from the Fragment or Unified Shader processor to the Fragment FIFO unit/stage in the ATTILA GPU architecture.
- TextureUnits
- Type : number
- Description : Number of instances of the Texture Unit unit/stage attached to a Fragment or Unified Shader processor.
- Note : Each Texture Unit has the througput rate to process a single texture request for a fragment quad per cycle. If you want to keep a 1:1 ratio between Fragment Shader ALU processing and texture processing this parameter must be set to the value of the ThreadRate parameter divided by 4.
- TextureRequestRate
- Type : number
- Description : Number of texture request for a fragment quads that can be issued per cycle from the Fragment or Unified Shader processor to the attached Texture Unit unit/stages.
- TextureRequestGroup
- Type : number
- Description : Number of consecutive texture requests for fragment quads that are batched, assigned and sent to a single Texture Unit before assigning work to the next Texture Unit attached to the Fragment or Unified Fragment Shader processor.
- Note : This parameter is similar to the ShaderInputBatchSize parameter in the RASTERIZER section.
- AddressALULatency
- Type : number
- Description : Pipeline depth in cycles (latency) of the data path block that computes texel addresses from the texture coordinates of a texture request for a fragment quad.
- Note : The address data path can compute the texel addresses for a single bilinear sample for a fragment quad per cycle. Texture requests that require multiple bilinear samples are sequenced in up to 32 consecutive cycles.
- FilterALULatency
- Type : number
- Description : Pipeline depth in cycles (latency) of the data path that filters the texels read for a texture request for a fragment quad.
- Note : The filter data path can filter texels for a single bilinear sample for a fragment quad per cycle. Texture requests that require multiple bilinear samples are sequenced in up to 32 consecutive cycles.
- AnisotropyAlgorithm
- Type : number
- Description : Selects the anisotropy algorithm to be used by the simulator. Current implemented algorithms are:
- 0 => Two Axis Algorithm (algorithm used in Intel chipset GPUs, ATI Rx2xx GPUs and defined in the OpenGL extension for anisotropic filtering)
- 1 => Four Axis Algorithm (used in ATI Rx3xx+ and NVidia NV3x+ GPUs)
- 2 => Rectangular Algorithm (our own relatively cheap non-angle dependant anisotropy algorithm)
- 3 => EWA Algorithm (EWA anisotropy algorithm as defined by Heckbert's thesis)
- 4 => Experimental Algorithm (another non-angle dependant anisotropy algorithm)
- TextureBlockDimension
- Type : number
- Description : Defines the size in 32-bit texels of a texture tile or block. The block size is 2^n x 2^n where n is the value assigned to the parameter.
- Note : The value of this parameter should be related with the size of a Texture Cache line.
- TextureSuperBlockDimension
- Type : number
- Description : Defines the size in 32-bit texels of a texture super tile or super block. The super block size is 2^n x 2^n where n is the value assigned to the parameter.
- Note : The value of this parameter should be related either to the size of the Texture Cache or the GDDR page size.
- TextureRequestQueueSize
- Type : number
- Description : Size of the queue that stores texture request for fragment quads recieved from the associated Fragment or Unified shader processor and are pending from being processed in the Texture Unit unit/stage of the ATTILA GPU architecture.
- Note : Each entry of this queue stores all the required data for a texture request for a fragment quad (4x32-bit float point coordinates per fragment in the quad).
- TextureAccessQueue
- Type : number
- Description : Size of queues that store texture requests for fragment quads that are being processed by the Texture Unit unit/stage of the ATTILA GPU architecture.
- Note : The actual implementation of these queues is too complex and intricated to be explained here. In any case this parameter limits the number of texture requests for which addresses are begin calculated, texels are fetched, read and filtered. Each entry of this queues may contain different data for a bilinear sample for a fragment quad (texture coordinates, texel addresses, texel data or filtered data).
- TextureResultQueue
- Type : number
- Description : Size of the queue that stores texture requests for fragments quads that have been fully processed in the Texture Unit unit/stage and are to be sent to the Fragment or Unified Shader processor that generated the request.
- Note : Each entry of this queue stores the required data for the result of a texture request (a 4x32 bit float point sampled value for each fragment in the quad).
- TextureWaitReadWindow
- Type : number
- Description : Number of entries in the out-of-order window for texel accesses that miss the Texture Cache.
- Note : The current implementation of the Texture Unit has limited capabilites to disorder the accesses to memory and the processing of texture requests. The actual implementation details won't be implemented here. This parameter limits the amount of disorder in the memory accesses and texture request processing.
- TwoLevelTextureCache
- Type : boolean
- Description : Enables (TRUE) or disables (FALSE) a two level Texture Cache architecture. The zero level cache stores uncompressed texture data and the first level stores compressed texture data.
- Note : When enabled each instance of the Texture Unit unit/stage has a sepparated first level (L1) Texture Cache. There is no communication between the Texture Caches of different instances of the Texture Unit unit/stage.
- TextureCacheLineSize
- Type : number
- Description : Bytes per Texture Cache line.
- Note : When TwoLevelTextureCache is set TRUE this parameter is associated with the L1 Texture Cache, when set to FALSE this parameter is associated with the single level (L0) Texture Cache. Each entry stores all the information required to serve a cache line fill.
- TextureCacheWays
- Type : number
- Description : Number of ways in the Texture Cache.
- Note : When TwoLevelTextureCache is set TRUE this parameter is associated with the L1 Texture Cache, when set to FALSE this parameter is associated with the single level (L0) Texture Cache. Each entry stores all the information required to serve a cache line fill.
- TextureCacheLines
- Type : number
- Description : Number of lines per Texture Cache way.
- Note : When TwoLevelTextureCache is set TRUE this parameter is associated with the L1 Texture Cache, when set to FALSE this parameter is associated with the single level (L0) Texture Cache. Each entry stores all the information required to serve a cache line fill.
- TextureCachePortWidth
- Type : number
- Description : Bandwidth in bytes per cycle of a port of the first level Texture Cache.
- Note : The actual total bandwidth of the zero level (L0) Texture Cache is five times this number as the first level Texture Cache implements four read ports and a write port.
- TextureCacheRequestQueueSize
- Type : number
- Description : Size of the queue that stores pending misses in the Fetch Cache associated with the Texture Cache.
- Note : When TwoLevelTextureCache is set TRUE this parameter is associated with the L1 Texture Cache, when set to FALSE this parameter is associated with the single level (L0) Texture Cache. Each entry stores all the information required to serve a cache line fill.
- TextureCacheInputQueue
- Type : number
- Description : Size of the queue that stores pending cache line fills in the Texture Cache.
- Note : When TwoLevelTextureCache is set TRUE this parameter is associated with the L1 Texture Cache, when set to FALSE this parameter is associated with the single level (L0) Texture Cache. Each entry stores all the information required to serve a cache line fill.
- TextureCacheMissesPerCycle
- Type : number
- Description : Number of misses per cycle that can be accepted without stalling in the zero level Texture Cache.
- TextureCacheDecompressLatency
- Type : number
- Description : Latency (pipeline depth) in cycles of the data path that decompresses compressed texture data read from the first level Texture Cache or memory when filling the zero level Texture Cache (which only contains uncompressed data).
- TextureCacheLineSizeL1
- Type : number
- Description : Size in bytes of a cache line of the first level Texture Cache.
- Note : This parameter only applies when TwoLevelTextureCache is set to TRUE.
- TextureCacheWaysL1
- Type : number
- Description : Number of ways in the first level Texture Cache.
- Note : This parameter only applies when TwoLevelTextureCache is set to TRUE.
- TextureCacheLinesL1
- Type : number
- Description : Number of lines per way in the first level Texture Cache.
- Note : This parameter only applies when TwoLevelTextureCache is set to TRUE.
- TextureCacheInputQueueL1
- Type : number
- Description : Size of the queue that stores the pending cache line fills for the first level Texture Cache.
- Note : In the current implementation the same value is uses as the size of the pending miss queue in the Fetch Cache associated with the first level Texture Cache. Each entry of either queue stores enough data to serve a miss or cache line fill.
ZSTENCILTEST Section
The ZSTENCILTEST section configures the Z and Stencil Test unit/stage of the ATTILA GPU architecture.
The parameters that can be set in the ZSTENCILTEST section are:
- StampsPerCycle
- Type : number
- Description : Number of fragment quads that can be received and processed per cycle in the Z and Stencil Test unit/stage of the ATTILA GPU architecture.
- BytesPerPixel
- Type : number
- Description : Number of bytes stored in the Z and Stencil buffer of the framebuffer per pixel.
- Note : The current implementation only supports 4 bytes per pixel (8-bit stencil and 24-bit stencil).
- DisableCompression
- Type : boolean
- Description : Disables (TRUE) or enables (FALSE) compression of depth values when evicting lines out of the Z cache.
- Note : This parameter must be set to TRUE for the early z rejection test in the Hierarchical Z unit/stage to work. Compression of Z cache lines only works when the generation tile size (configured in the RASTERIZER section) is set to 8x8 fragments and the Z cache line size is 256 bytes.
- ZCacheWays
- Type : number
- Description : Number of ways in the Z cache.
- ZCacheLines
- Type : number
- Description : Number of lines per Z cache way.
- ZCacheStampsPerLine
- Type : number
- Description : Number of pixel quad Z and Stencil values that can be stored in a Z cache line.
- Note : The actual size of the cache line is the value assigned to this parameter multiplied by 4 and by the value of the BytesPerPixel parameter.
- ZCachePortWidth
- Type : number
- Description : Bandwidth in bytes per cicle of a Z Cache port.
- ZCacheExtraReadPort
- Type : boolean
- Description : When set to TRUE the Z cache implements two read ports, when set to FALSE only one.
- ZCacheExtraWritePort
- Type : boolean
- Description : When set to TRUE the Z cache implements two write ports, when set to FALSE only one.
- ZCacheRequestQueueSize
- Type : number
- Description : Size of the queue that stores pending misses in the Fetch Cache associated with the Z cache.
- Note : Each entry of the queue stores all the information required to process a miss.
- ZCacheInputQueueSize
- Type : number
- Description : Size of the queue that stores pending Z cache line fills.
- Note : Each entry of the queue stores all the information required to process a cache line fill operation.
- ZCacheOutputQueueSize
- Type : number
- Description : Size of the queue that stores pending Z cache line spills.
- Note : Each entry of the queue stores all the information required to process a cache line spill operation.
- BlockStateMemorySize
- Type : number
- Description : Size of the block state memory used by the Fast Z and Stencil Clear and Z Compression algorithms to keep track of the state of a given block (equivalent to a generation tile) of pixels in the Z and Stencil buffer of the framebuffer.
- Note : Each element of the block state memory requires 2 bits. For a framebuffer of 1024x1024 and a generation tile sized as 8x8 fragments the size of this block state memory must be 16K elements. The maximum framebuffer resolution supported by the current implementation of the simulator is defined in the MAX_DISPLAY_RES_X and MAX_DISPLAY_RES_Y constants in the 'sim/GPU.h' file of the simulator source code. The value assigned to this parameter must be large enough to support the maximum framebuffer resolution.
- BlocksClearedPerCycle
- Type : number
- Description : Number of elements in the block state memory that can be written per cycle.
- Note : The value of this parameter is only used to determine how fast a clear operation of the Z and Stencil buffer is performed.
- CompressionUnitLatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that decompresses compressed Z cache lines received from the Memory Controller.
- Note : The current implementation of the Z Compression algorithm has three compression ratios: no compression, 1/2 and 1/4.
- DecompressionUnitLatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that compresses Z cache lines before transfering the data to the Memory Controller.
- Note : The current implementation of the Z Compression algorithm has three compression ratios: no compression, 1/2 and 1/4.
Queue parameter for the public release version of the simulator:
- ZQueueSize
- Type : number
- Description : Size of a queue that stores all the fragment quads that are being processed in the Z and Stencil Test unit/stage of the ATTILA GPU architecture.
- Note : Each entry of this queue stores all the information associated with a fragment quad that must perform a Z and/or Stencil Test (coordinates, cull flags, inside triangle flags, interpolated z and final color for four fragments).
Queue parameters for the internal version of the simulator:
- InputQueueSize
- Type : number
- Description : Size of the queue that stores incoming fragment quads to be processed in the Z Stencil Test unit. The fragment quads wait for the Z Cache fetch unit to be available.
- FetchQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that are waiting to read data from the Z Cache. The fragment quads enter the queue after all the Z Cache lines that contain data for the quad have been fetched and reserved.
- ReadQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that have read data from the Z Cache. After reading the Z Cache data the fragment quads wait for the Z and Stencil Test unit to perform the comparisons set in the current render state that evaluate if the fragments in the quad can pass down the pipeline or not.
- OpQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that are waiting to write data into the Z Cache. The fragment quads enter the queue after passing the evaluation in the Z and Stencil Test unit.
- WriteQueueSize
- Type : number
- Description : Size of the queue that holds fragment quads until they can be issued to the next stage of the ATTILA GPU pipeline. They fragment quads enter the queue after writing the resulting Z and Stencil data into the Z Cache.
- ZALUTestRate
- Type : number
- Description : Rate at which fragment quads can be tested in the Z and Stencil Test unit/stage of the ATTILA GPU architecture. This parameter is the minimum number of cycles that must pass between a fragment quad that enters the datapath that performs the comparisons for Z and Stencil values until the next fragment quad can enter.
- Note : The Z cache is not affected by this parameter.
- ZALULatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that performs the comparison for the Z and Stencil test.
- Note : The pipeline stages that implement the access to the Z cache and memory are not accounted by this parameter.
COLORWRITE Section
The COLORWRITE section configures the Color Write unit/stage of the ATTILA GPU architecture.
The parameters that can be set in the COLORWRITE section are:
- StampsPerCycle
- Type : number
- Description : Number of fragment quads that can be received and processed per cycle in the Color Write unit/stage of the ATTILA GPU architecture.
- BytesPerPixel
- Type : number
- Description : Number of bytes stored in the Color buffer of the framebuffer per pixel.
- Note : The current implementation only supports 4 bytes per pixel (8-bit per component RGBA color).
- DisableCompression
- Type : boolean
- Description : Disables (TRUE) or enables (FALSE) compression of color values when evicting lines out of the Color cache.
- Note : Compression of Color cache lines only works when the generation tile size (configured in the RASTERIZER section) is set to 8x8 fragments and the Color cache line size is 256 bytes. The supported compression ratios are: no compression, 1/2 and 1/4. The Color Compression algorithm will only achieve good compression ratios for plain colored blocks.
- ColorCacheWays
- Type : number
- Description : Number of ways in the Color cache.
- ColorCacheLines
- Type : number
- Description : Number of lines per Color cache way.
- ColorCacheStampsPerLine
- Type : number
- Description : Number of pixel quad color values that can be stored in a Color cache line.
- Note : The actual size of the cache line is the value assigned to this parameter multiplied by 4 and by the value of the BytesPerPixel parameter.
- ColorCachePortWidth
- Type : number
- Description : Bandwidth in bytes per cicle of a Color Cache port.
- ColorCacheExtraReadPort
- Type : boolean
- Description : When set to TRUE the Color cache implements two read ports, when set to FALSE only one.
- ColorCacheExtraWritePort
- Type : boolean
- Description : When set to TRUE the Color cache implements two write ports, when set to FALSE only one.
- ColorCacheRequestQueueSize
- Type : number
- Description : Size of the queue that stores pending misses in the Fetch Cache associated with the Color cache.
- Note : Each entry of the queue stores all the information required to process a miss.
- ColorCacheInputQueueSize
- Type : number
- Description : Size of the queue that stores pending Color cache line fills.
- Note : Each entry of the queue stores all the information required to process a cache line fill operation.
- ColorCacheOutputQueueSize
- Type : number
- Description : Size of the queue that stores pending Color cache line spills.
- Note : Each entry of the queue stores all the information required to process a cache line spill operation.
- BlockStateMemorySize
- Type : number
- Description : Size of the block state memory used by the Fast Color Clear and Color Compression algorithms to keep track of the state of a given block (equivalent to a generation tile) of pixels Color buffer of the framebuffer.
- Note : Each element of the block state memory requires 2 bits. For a framebuffer of 1024x1024 and a generation tile sized as 8x8 fragments the size of this block state memory must be 16K elements. The maximum framebuffer resolution supported by the current implementation of the simulator is defined in the MAX_DISPLAY_RES_X and MAX_DISPLAY_RES_Y constants in the 'sim/GPU.h' file of the simulator source code. The value assigned to this parameter must be large enough to support the maximum framebuffer resolution.
- BlocksClearedPerCycle
- Type : number
- Description : Number of elements in the block state memory that can be written per cycle.
- Note : The value of this parameter is only used to determine how fast a clear operation of the Color buffer is performed.
- CompressionUnitLatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that decompresses compressed Color cache lines received from the Memory Controller.
- Note : The current implementation of the Color Compression algorithm has three compression ratios: no compression, 1/2 and 1/4.
- DecompressionUnitLatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that compresses Color cache lines before transfering the data to the Memory Controller.
- Note : The current implementation of the Color Compression algorithm has three compression ratios: no compression, 1/2 and 1/4.
Queue parameter for the public version of the simulator:
- ColorQueueSize
- Type : number
- Description : Size of a queue that stores all the fragment quads that are being processed in the Color Write unit/stage of the ATTILA GPU architecture.
- Note : Each entry of this queue stores all the information associated with a fragment quad that must update the Color Buffer (coordinates, cull flags, inside triangle flags and final color for four fragments).
Queue parameters for the internal version of the simulator:
- InputQueueSize
- Type : number
- Description : Size of the queue that stores incoming fragment quads to be processed in the Color Write/Blend unit. The fragment quads wait for the Color Cache fetch unit to be available.
- FetchQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that are waiting to read data from the Color Cache. The fragment quads enter the queue after all the Color Cache lines that contain data for the quad have been fetched and reserved.
- ReadQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that have read data from the Color Cache. After reading the Color Cache data the fragment quads wait for the Color Blend unit to perform the blending operations set in the current render state.
- OpQueueSize
- Type : number
- Description : Size of the queue that stores fragment quads that are waiting to write data into the Color Cache. The fragment quads enter the queue after performing the blending operation in the Color Blend unit.
- WriteQueueSize
- Type : number
- Description : Size of the queue that holds fragment quads until they can be issued to the next stage of the ATTILA GPU pipeline. They fragment quads enter the queue after writing the resulting color data into the Color Cache.
- Note : There are no other stages after Color Write/Blend so fragment quads actually are destroyed when they arrive at this queue.
- BlendALUTestRate
- Type : number
- Description : Rate at which fragment quads can be blender or update the Color buffer in the Color Write unit/stage of the ATTILA GPU architecture. This parameter is the minimum number of cycles that must pass between a fragment quad that enters the datapath that performs the blending operation until the next fragment quad can enter.
- Note : The Color cache is not affected by this parameter.
- BlendALULatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that performs the blending operations in the Color Write unit/stage of the ATTILA GPU architeture.
- Note : The pipeline stages that implement the access to the Color cache and memory are not accounted by this parameter.
DAC Section
The DAC section configures the DAC unit/stage of the ATTILA GPU architecture.
The parameters that can be set in the DAC section are:
- BytesPerPixel
- Type : number
- Description : Number of bytes stored in the Color buffer of the framebuffer per pixel.
- Note : The current implementation only supports 4 bytes per pixel (8-bit per component RGBA color).
- BlockSize
- Type : number
- Description : Size in bytes of a Color buffer block.
- Note : This number must correspond with the size of a Color cache line and the generation tile size multiplied by the BytesPerPixel parameter.
- BlockUpdateLatency
- Type : number
- Description : Latency of the data buses that carry color block state data from the different Color Write unit/stage instances to the DAC unit/stage in the ATTILA GPU architecture.
- BlocksUpdatedPerCycle
- Type : number
- Description : Bandwidth of the data buses that carry color block state data from the different Color Write unit/stage instances to the DAC unit/stage in the ATTILA GPU architecture.
- Note : This value assigned to this parameter is the number of block state elements that can be received per cycle. Each color block state element takes 2-bits.
- BlockRequestQueueSize
- Type : number
- Description : Size of the queue that stores pending Color buffer block requests to the Memory Controller.
- Note : Each entry of this queue stores all the information required for requesting a Color block to the Memory Controller.
- DecompressionUnitLatency
- Type : number
- Description : Pipeline depth (in cycles) of the datapath that decompresses compresses Color buffer blocks received from the Memory Controller.
- RefreshRate
- Type : number
- Description : This parameter sets the screen refresh frequency in cycles.
- Note : This parameter only applies when the SynchedRefresh parameter is set to FALSE and the RefreshFrame parameter is set to TRUE.
- SynchedRefresh
- Type : boolean
- Description : When this parameter is set to TRUE the DAC dumps the content of the color front buffer after each a SWAP command (end of frame) that is received by the ATTILA GPU as a PPM file. When the this parameter is set to FALSE the color front buffer is dumped to a PPM file with the frequency configured in the RefreshRate parameter.
- Note : This parameter is usually set to TRUE to verify the correctness of the frames rendered by the simulator. In this mode and while the frame is being dumped the other units/stages of the ATTILA GPU architecture remain stalled. When set to FALSE the simulator may not work as expected as this Feature has not been tested in a while. When set to FALSE the DAC operation does not stall the other unit/stages of the ATTILA GPU architecture. The behaviour of the DAC when this parameter is set to FALSE is not the correct behaviour of a DAC unit in a real GPU as the bandwidth consumption of the simulated DAC unit is not limited by the output bandwidth of a RGB or DVI port to a screen display. The DACBusWidth parameter in the MEMORYCONTROLLER section may be used to limit this bandwidth.
- RefreshFrame
- Type : boolean
- Description : When this parameter is set to TRUE the DAC unit simulates a screen refresh and dumps the current contents of the color frontbuffer to a PPM file based on the refresh mode specified by the SynchedRefresh parameter (dump after end of frame when set to TRUE, dump at a fixed frequency when set to FALSE).
- Note : Setting this parameter to FALSE when the SynchedRefresh parameter is set to TRUE saves the few simulation cycles that the DAC takes to read the color front buffer between the rendering of two frames. When this parameter is set to FALSE and the SynchedRefresh parameter is set to FALSE only the extra bandwidth consumed by the DAC unit at the given screen refresh frequency is saved.
ATTILA baseline configuration
The sample bGPU.ini configuration file bellow corresponds to the baseline configuration of the ATTILA architecture described here:
Overall Pipeline Configuration
Parameter | value |
---|---|
Vertex Shader Units (non-unified version only) | 4 |
Fragment Shader Units (vertex shader units also for unified version) | 2 |
ROP Pipelines | 2 |
Texture Rate per Fragment Shader | 4 |
GPU Units configuration
GPU Unit | Input BW | Output BW | Input Queue Size | Input Queue Element Width | Latency |
---|---|---|---|---|---|
Streamer | 1 index | 1 vertex | 48 | 16x4x32 bits | Mem cycles |
Primitive Assembly | 1 vertex | 1 triangle | 8 | 3x16x4x32 bits | 1 cycle |
Clipping | 1 triangle | 1 triangle | 4 | 3x4x32 bits | 6 cycles |
Triangle Setup | 1 triangle | 1 triangle | 12 | 3x4x32 bits | 10 cycles |
Fragment Generation | 1 triangle | 2x64 fragments | 16 | 3x4x32 bits | 1 cycle |
Hierarchical Z | 2x64 fragments | 2x64 fragments | 64 | (2x16+4x32)x4 bits | 1 cycle |
Z Test | 4 fragments | 4 fragments | 64 | (2x16+4x32)x4 bits | 2 + Mem cycles |
Interpolator | 2x4 fragments | 2x4 fragments | - | - | 2 to 8 cycles |
Color Write | 4 fragments | - | 64 | (2x16+4x32)x4 bits | 2 + Mem cycles |
Shader (vertex) | 1 vertex | 1 vertex | 12+4 | 16x4x32 bits | variable |
Shader (fragment/unified) | 4 fragments | 4 fragments | 112+16 | 10x4x32 bits | variable |
Memory Configuration
Parameter | value |
---|---|
Memory Size | 64 MB |
Memory Bus Width (Per Channel) | 64 bits |
Memory Channels | 4 |
System memory region size | 16 MB |
# # bGPU Simulator # # Configuration File # # 30/11/2004 # [SIMULATOR] InputFile = "gltrace-sphere" SimCycles = 10000 SimFrames = 1 SignalDumpFile = "signaltrace.txt" StatsFile = "stats.csv" StatsFilePerFrame = "stats.frame.csv" StatsFilePerBatch = "stats.batch.csv" StartFrame = 0 StartSignalDump = 0 SignalDumpCycles = 10000 StatisticsRate = 1000 DumpSignalTrace = FALSE Statistics = FALSE PerFrameStatistics = FALSE PerBatchStatistics = FALSE GenerateFragmentMap = FALSE # # Latency map modes # # 0 : latency of the fragment since it was generated until it was written into the # color buffer. # FragmentMapMode = 0 DoubleBuffer = FALSE ForceMSAA = FALSE MSAASamples = 4 ForceFP16ColorBuffer = FALSE ObjectSize0 = 512 BucketSize0 = 32768 ObjectSize1 = 4096 BucketSize1 = 2048 ObjectSize2 = 64 BucketSize2 = 32768 [GPU] NumVertexShaders = 4 NumFragmentShaders = 2 NumStampPipes = 2 [COMMANDPROCESSOR] PipelinedBatchRendering = TRUE [MEMORYCONTROLLER] MemorySize = 67108864 MemoryClockMultiplier = 1 MemoryFrequency = 1 MemoryBusWidth = 64 MemoryBuses = 4 SharedBanks = FALSE BankGranurality = 1024 BurstLength = 8 ReadLatency = 10 WriteLatency = 6 WriteToReadLatency = 6 MemoryPageSize = 4096 OpenPages = 1 PageOpenLatency = 20 MaxConsecutiveReads = 16 MaxConsecutiveWrites = 16 CommandProcessorBusWidth = 8 StreamerFetchBusWidth = 64 StreamerLoaderBusWidth = 64 ZStencilBusWidth = 64 ColorWriteBusWidth = 64 DACBusWidth = 64 TextureUnitBusWidth = 64 MappedMemorySize = 16777216 ReadBufferLines = 32 WriteBufferLines = 32 RequestQueueSize = 64 ServiceQueueSize = 32 MemoryControllerV2 = FALSE # Parameters only for Memory Controller V2 V2MemoryChannels = 4 V2BanksPerMemoryChannel = 8 V2MemoryRowSize = 4096 V2BurstElementsPerCycle = 4 V2ChannelInterleaving = 1024 V2BankInterleaving = 1024 # 0 = fifo V2ChannelScheduler = 0 # 0 = close, 1 = open V2PagePolicy = 1 # flag that allows to use a memory model without timing constraints (only signal latency overhead) V2PerfectMemory = FALSE [STREAMER] IndicesCycle = 1 IndexBufferSize = 1024 InputRequestQueueSize = 8 AttributesCycle = 4 InputCacheLines = 32 InputCacheLineSize = 64 InputCachePortWidth = 16 InputCacheRequestQueueSize = 4 InputCacheInputQueueSize = 4 OutputFIFOSize = 64 OutputMemorySize = 48 VerticesCycle = 1 AttributesSentCycle = 4 [VERTEXSHADER] ExecutableThreads = 12 InputBuffers = 4 ThreadResources = 128 ThreadRate = 1 FetchRate = 1 ThreadGroup = 1 LockedExecutionMode = FALSE # # Enabling the scalar ALU requires FetchRate to be 2. # ScalarALU = FALSE ThreadWindow = TRUE FetchDelay = 0 SwapOnBlock = FALSE InputsPerCycle = 1 OutputsPerCycle = 1 OutputLatency = 11 [PRIMITIVEASSEMBLY] VerticesCycle = 1 TrianglesCycle = 1 InputBusLatency = 10 AssemblyQueueSize = 8 [CLIPPER] TrianglesCycle = 1 ClipperUnits = 1 StartLatency = 1 ExecLatency = 6 ClipBufferSize = 4 [RASTERIZER] TrianglesCycle = 1 SetupFIFOSize = 12 SetupUnits = 1 SetupLatency = 10 SetupStartLatency = 4 TriangleInputLatency = 2 TriangleOutputLatency = 2 TriangleSetupOnShader = FALSE TriangleShaderQueueSize = 8 StampsPerCycle = 2 MSAASamplesCycle = 2 OverScanWidth = 4 OverScanHeight = 4 ScanWidth = 16 ScanHeight = 16 GenWidth = 8 GenHeight = 8 RasterizationBatchSize = 4 BatchQueueSize = 16 RecursiveMode = TRUE DisableHZ = FALSE StampsPerHZBlock = 16 HierarchicalZBufferSize = 262144 HZCacheLines = 8 HZCacheLineSize = 16 EarlyZQueueSize = 128 HZAccessLatency = 5 HZUpdateLatency = 4 HZBlocksClearedPerCycle = 256 NumInterpolators = 4 ShaderInputQueueSize = 16 ShaderOutputQueueSize = 16 ShaderInputBatchSize = 256 TiledShaderDistribution = TRUE # # This two parameters are only for the unified shader version. # VertexInputQueueSize = 16 ShadedVertexQueueSize = 48 TriangleInputQueueSize = 8 TriangleOutputQueueSize = 8 GeneratedStampQueueSize = 128 EarlyZTestedStampQueueSize = 32 InterpolatedStampQueueSize = 16 ShadedStampQueueSize = 128 EmulatorStoredTriangles = 32 [FRAGMENTSHADER] ExecutableThreads = 240 InputBuffers = 16 ThreadResources = 240 ThreadRate = 4 FetchRate = 1 ThreadGroup = 4 LockedExecutionMode = TRUE # # Enabling the scalar ALU requires FetchRate to be 2. # ScalarALU = FALSE ThreadWindow = TRUE FetchDelay = 0 SwapOnBlock = FALSE InputsPerCycle = 4 OutputsPerCycle = 4 OutputLatency = 11 TextureUnits = 1 TextureRequestRate = 1 TextureRequestGroup = 64 AddressALULatency = 6 FilterALULatency = 4 AnisotropyAlgorithm = 1 TextureBlockDimension = 3 TextureSuperBlockDimension = 3 TextureRequestQueueSize = 4 TextureAccessQueue = 64 TextureResultQueue = 4 TextureWaitReadWindow = 32 TwoLevelTextureCache = FALSE TextureCacheLineSize = 256 TextureCacheWays = 4 TextureCacheLines = 16 TextureCachePortWidth = 4 TextureCacheRequestQueueSize = 4 TextureCacheInputQueue = 4 TextureCacheLineSizeL1 = 256 TextureCacheWaysL1 = 4 TextureCacheLinesL1 = 16 TextureCacheInputQueueL1 = 4 TextureCacheMissesPerCycle = 1 TextureCacheDecompressLatency = 4 [ZSTENCILTEST] StampsPerCycle = 1 BytesPerPixel = 4 DisableCompression = FALSE ZCacheWays = 4 ZCacheLines = 16 ZCacheStampsPerLine = 16 ZCachePortWidth = 32 ZCacheExtraReadPort = TRUE ZCacheExtraWritePort = TRUE ZCacheRequestQueueSize = 8 ZCacheInputQueueSize = 8 ZCacheOutputQueueSize = 8 BlockStateMemorySize = 262144 BlocksClearedPerCycle = 1024 CompressionUnitLatency = 8 DecompressionUnitLatency = 8 #ZQueueSize = 64 InputQueueSize = 8 FetchQueueSize = 64 ReadQueueSize = 16 OpQueueSize = 4 WriteQueueSize = 8 ZALUTestRate = 1 ZALULatency = 2 [COLORWRITE] StampsPerCycle = 1 BytesPerPixel = 4 DisableCompression = FALSE ColorCacheWays = 4 ColorCacheLines = 16 ColorCacheStampsPerLine = 16 ColorCachePortWidth = 32 ColorCacheExtraReadPort = TRUE ColorCacheExtraWritePort = TRUE ColorCacheRequestQueueSize = 8 ColorCacheInputQueueSize = 8 ColorCacheOutputQueueSize = 8 BlockStateMemorySize = 262144 BlocksClearedPerCycle = 1024 CompressionUnitLatency = 8 DecompressionUnitLatency = 8 #ColorQueueSize = 64 InputQueueSize = 8 FetchQueueSize = 64 ReadQueueSize = 16 OpQueueSize = 4 WriteQueueSize = 8 BlendALURate = 1 BlendALULatency = 2 [DAC] BytesPerPixel = 4 BlockSize = 256 BlockUpdateLatency = 1 BlocksUpdatedPerCycle = 1024 BlockRequestQueueSize = 32 # # While we use the DAC just to dump the frame after each swap # we can dismiss the real decompression latency to speed up the # dumping. # #DecompressionUnitLatency = 8 DecompressionUnitLatency = 1 RefreshRate = 5000000 SynchedRefresh = TRUE RefreshFrame = TRUE