ATTILA Configuration File Public

From AttilaWiki

Jump to:navigation, search

Contents

SIMULATOR section

Trace Parameters

InputFile

Description

Name of the OGL (GLInterceptor, only API call trace file) or D3D9 PIX trace file to simulate. The parameter can be overriden by passing the trace file name as an argument to the simulator binary.

Format

String


SimCycles

Description

Number of cycles to simulate. When multiple clock domains are supported the GPU or main clock domain is used. The parameter can be overriden by passing the number of cycles (>10K) or frames (<10K) to simulate as an argument to the simulator binary. The parameter is overriden by the SimFrames parameter.

Format

Integer


SimFrames

Description

Number of frames to simulate. This parameter overrides the SimCycles parameter. The parameter can be overriden by passing the number of cycles (>10K) or frames (<10K) to simulate as an argument to the simulator binary.

Format

Integer


StartFrame

Description

First frame to simulate from the input OGL or D3D9 trace. The simulator will skip (GPU state and memory state is updated but no rendering will be performed) as many frames as defined by the parameter. The parameter can be overriden by passing the start frame as an argument to the simulator binary.

Format

Integer

Signal Trace Parameters

DumpSignalTrace

Description

Enables dumping a trace of the traffic circulating between boxes through signals.

Format

Boolean (TRUE/FALSE)


StartSignalDump

Description

Defines the simulation cycle at which point the dumping of the signal trace will start.

Format

Integer


SignalDumpCycles

Description

Defines the number of cycles, starting at the defined simulation cycle, for which the signal trace will be dump. Be aware that the signal trace generates large uncompressed text files and 100K cycles are likely to require more than 1 GB.

Format

Integer


SignalDumpFile

Description

Name of the file that will be generated for the signal trace.

Format

String

Statistics Parameters

Statistics

Description

Enables the generation of GPU statistics.

Format

Boolean (TRUE/FALSE)


PerCycleStatistics

Description

Enables the generation of GPU statistics sampled at a fixed rate defined in a number of cycles. The rate is defined by the StatisticsRate parameter. Statistics generation must be enabled using the Statistics parameter.

Format

Boolean (TRUE/FALSE)


PerFrameStatistics

Description

Enables the generation of GPU statistics sampled per frame. Statistics generation must be enabled using the Statistics parameter.

Format

Boolean (TRUE/FALSE)


PerBatchStatistics

Description

Enables the generation of GPU statistics sampler per drawcall/batch. Statistics generation must be enabled using the Statistics parameter.

Format

Boolean (TRUE/FALSE)


StatisticsRate

Description

Defines the rate, in cycles, at which statistics sampled at a fixed cycle rate will be sampled.

Format

Integer


StatsFile

Description

Name of the file that will be generated for statistics sampled at a fixed cycle rate.

Format

String


StatsFilePerFrame

Description

Name of the file that will be generated for statistics sampled per frame.

Format

String


StatsFilePerBatch

Description

Name of the file that will be generated for statistics sampled per drawcall/batch.

Format

String

Stall Detection Parameters

DetectStalls

Description

Enables the simulator stall detection logic implemented in some of the GPU boxes. When a the logic detects a stall (no progress) the simulator will stop and generate a stall report (written to StallReport.txt). The current implementation of the stall detection logic is still a prototype and may detect false stalls.

Format

Boolean (TRUE/FALSE)

Fragment Map Parameters

GenerateFragmentMap

Description

Enables the generation of a fragment map with each frame. A fragment map stores per quad (2x2 fragment) information in PPM format (3 channel, 8 bits per channel). Information related with a single fragment will always be for the last fragment written into the corresponding frame position.

Format

Boolean (TRUE/FALSE)


FragmentMapMode

Description

Defines the kind of information that will store the fragment map. Currently supported values:

Format

Integer (see above for valid values)

API/Driver Parameters

ForceMSAA

Description

Forces the driver to enable multisampling antialiasing (MSAA).

Format

Boolean (TRUE/FALSE)


MSAASamples

Description

Defines the number of samples that will be used when the driver forces multisampling antialiasing. Valid values in the current implementation are: 2, 4 and 8.

Format

Integer (for valid values see above)


ForceFP16ColorBuffer

Description

Forces the driver to use a 16-bit float point color buffer.

Format

Boolean (TRUE/FALSE)


DoubleBuffer

Description

Forces the driver to create separated buffers for the front and back color buffers.

Format

Boolean (TRUE/FALSE)


EnableDriverShaderTranslation

Description

Enabled shader program translation and transformations by the driver. Must be enabled to support a number of new features in the Vector Shader model: LDA for attribute load, SOA ALU architecture, wait points.

Format

Boolean (TRUE/FALSE)


UseACD

Description

When enabled the OpenGL API implemented using the ACD is used to translate OpenGL traces into Attila commands.

Format

Boolean (TRUE/FALSE)

Dynamic Memory Parameters

ObjectSize0

Description

Defines the size in bytes of allocation blocks in the first bucket of the OptimizedDynamicMemory manager. The manager can only allocate one block per object so this bucket will be used for objects which size is smaller than the defined size.

Format

Integer


BucketSize0

Description

Defines the number of blocks in the first bucket of the OptimizedDynamicMemory manager.

Format

Integer


ObjectSize1

Description

Defines the size in bytes of allocation blocks in the second bucket of the OptimizedDynamicMemory manager. The size of blocks in the second bucket must be larger than the size of block in the first bucket. The manager can only allocate one block per object so this bucket will be used for objects which size is larger than the first bucket blocks and smaller than the defined size.

Format

Integer


BucketSize1

Description

Defines the number of blocks in the second bucket of the OptimizedDynamicMemory manager.

Format

Integer


ObjectSize2

Description

Defines the size in bytes of allocation blocks in the third bucket of the OptimizedDynamicMemory manager. The size of blocks in the third bucket must be larger than the size of block in the third bucket. The manager can only allocate one block per object so this bucket will be used for objects which size is larger than the second bucket blocks and smaller than the defined size. The third bucket is only available if OptimizedDynamicMemory is compiled with FAST_NEW_DELETE not defined.

Format

Integer


BucketSize2

Description

Defines the number of blocks in the third bucket of the OptimizedDynamicMemory manager.

Format

Integer


GPU section

NumVertexShaders

Description

For the legacy non-unified shader model the parameter defines the number of vertex shaders implemented in the simulated architecture. For the unified shader model the parameter defines the number of 'signals' or 'channels' between the Streamer and the Shader Work Distributor (FragmentFIFO) limiting the maximum number of vertices that can start or finish shading in a cycle. Other parameters may reduce that maximum limit though.

Any value is allowed but it has not been tested beyond 8. The legacy non-unified shader model has not been tested for years.

Format

Integer

NumFragmentShaders

Description

For the legacy non-unified shader model the parameter defines the number of fragment shaders implemented in the simulated architecture. For the unified shader model the parameter defines the number of shader processors implemented in the simulated architecture.

When tiled based fragment distribution is enabled (see RASTERIZATION section) the only values permited for this parameter are 1, 2, 4 and 8. When 'batch' based distribution is enabled any value should be valid. The parameter has not been tested with a value larger than 4.

Format

Integer

NumStampPipes

Description

Defines the number of ROPs in the simulated architecture. A ROP for Z and Stencil Test (ZSTENCILTEST) is always paired with a ROP for Color Write and Blending (COLORWRITE).

Due to how fragment distribution is implemented currently the only allowed values are 1, 2, 4 and 8. A value of 8 has never been tested.

Format

Integer

GPUClock

Description

Frequency in MHz of the GPU clock domain (also called main clock domain). The GPU clock domain is used for all units except the shader processor and memory (channel schedulers, interface and GDDR) if the frequencies specified for those two are different than the GPU clock frequency.

Values up to 1 THz (1M MHz) are allowed as the frequency is internally converted to picoseconds by the simulator. Due to the conversion to picoseconds the actual simulated frequency (and the ratio with the other clock domain frequencies) may not be exact.

Format

Integer

ShaderClock

Description

Defines the frequency of the shader clock domain. The frequency only applies to the Shader Processor in the unified shader model (excluding interface with Shader Work Distributor/FragmentFIFO and Texture Unit).

Shader clock domain is only implemented for the Vector Shader model.

Values up to 1 THz (1M MHz) are allowed as the frequency is internally converted to picoseconds by the simulator. Due to the conversion to picoseconds the actual simulated frequency (and the ratio with the other clock domain frequencies) may not be exact.

Format

Integer

MemoryClock

Description

Defines the frequency of the memory clock domain. The frequency applies to the channel schedulers, interface and GDDR modules but not to the interconnect network with the different GPU units, request and service queues and the transaction splitter.

Values up to 1 THz (1M MHz) are allowed as the frequency is internally converted to picoseconds by the simulator. Due to the conversion to picoseconds the actual simulated frequency (and the ratio with the other clock domain frequencies) may not be exact.

Format

Integer

COMMANDPROCESSOR section

PipelinedBatchRendering

Description

Enables pipelining the start of draw call with the end of the previous draw call. Draw call rendering/processing is overlapped if the first draw call has finished all geometry processing (up to Clipper stage) and the second draw call can start. Updates to the GPU memory from the CPU (AGP_WRITE) are also overlapped if possible to advance work. Register updates that may affect the previous draw call are stored and applied when the next draw call starts or reached the fragment processing stages.

Format

Boolean (TRUE/FALSE)

DumpShaderPrograms

Description

When enabled the Command Processor will dump to files the binary code of the shader programs being loaded. The files will be named vprogramXXXX.out and fprogramXXXX.out for vertex and fragment programs and XXXX defines the order of the shader programas as it was loaded starting at 0000. A shader program may not be executed even if it's loaded but the current implementation of the libraries I think links the program load with an actual draw call.

Format

Boolean (TRUE/FALSE)

MEMORYCONTROLLER

Common parameters

MemorySize

Description

Defines the size in MBs of the GPU memory.

For the MemoryControllerV2 model there may be limitations with this size related with page size (row size), number of channels, banks, etc defined for the model.

In the current implementation the GPU address space is limited to 32-bit and only the lower 2 GBs can be used for GPU memory.

Format

Integer

MappedMemorySize

Description

Defines the size of the CPU (system) memory mapped to the GPU address space.

In the current implementation the GPU address space is limited to 32-bit and only the upper 2 GBs can be used for mapped memory.

Format

Integer

BurstLength

Description

The purpose of the parameter was to define the number of data cycles per burst request to GDDR memory (for GDDR there are 2 data cycles per source/command clock).

Passed to legacy Memory Controller but not used. A constant is used to defined the GDDR data burst.

Passed to Memory Controller V2. Use?

Valid values for GDDR2 are 4 or 8. Later GDDR3 and GDDR4 specs only allow 8.

Format

Integer

MaxConsecutiveReads

Description

Defines the maximum number of consecutive read requests that are issued to GDDR before forcing to issue write requests (if there is any queued).

Applies to both the legacy Memory Controller and V2. In Memory Controller V2 it may only apply to configurations with a scheduler implementing separated queues for read and write requests.

Recommended value for performance is still under discussion. This includes discussion about implementing a dynamic algorithm in Memory Controller V2. The usual value we have been using is 16.

Format

Integer

MaxConsecutiveWrites

Description

Defines the maximum number of consecutive write requests that are issued to GDDR before forcing to issue read requests (if there is any queued).

Applies to both the legacy Memory Controller and V2. In Memory Controller V2 it may only apply to configurations with a scheduler implementing separated queues for read and write requests.

Recommended value for performance is still under discussion. This includes discussion about implementing a dynamic algorithm in Memory Controller V2. The usual value we have been using is 16.

Format

Integer

CommandProcessorBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) of the connection between the Memory Controller and the Command Processor.

The bandwidth defined for this bus actually also limits the bandwidth from system to GPU (AGP or PCIe bus) for data writes or reads (reads from GPU to system are not currently implemented). In this aspect the Command Processor could be considered to be acting as a simple DMA controller. For itself the Command Processor only consumes BW when reading shader programs from memory and loading them in the shader instruction memory at the shader processors.

Normal values for this parameter are 8 or 16 bytes per cycle to set the limit to something resembling for the AGP/PCIe bus. Take into account that the BW is at GPU or main clock frequency. The value may be revised to take into account changes in PCIe spec and different GPU/main clock frequencies.

Format

Integer

StreamerFetchBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between the Streamer Fetch unit (tasked with reading vertex indices from memory) and the Memory Controller.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. As this units usually consumes very little bandwith smaller numbers may not affect performance.

Format

Integer

StreamerLoaderBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between a Streamer Loader and the Memory Controller. Notice that the current implementation garantees a dedicated bus between each Streamer Loader instance and the Memory Controller so the aggregated bandwith may be much larger.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. When a single Streamer Loader is defined maxing the bandwidth is advisable to prevent performance degradation on vertex fetch limited cases.

Format

Integer

ZStencilBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between a Z Stencil Test unit (ROPZ) and the Memory Controller. Notice that the current implementation garantees a dedicated bus between each ROPZ instance and the Memory Controller so the aggregated bandwith may be much larger.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. Current recommendation is to always used the maximum bandwidth available as ROPZ is a large bandwith consumer.

Format

Integer

ColorWriteBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between a Color Write unit (ROPC) and the Memory Controller. Notice that the current implementation garantees a dedicated bus between each ROPC instance and the Memory Controller so the aggregated bandwith may be much larger.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. Current recommendation is to always used the maximum bandwidth available as ROPC is a large bandwith consumer.

Format

Integer

DACBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between the DAC and the Memory Controller.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. As we are mostly ignoring real screen refresh bandwidth consumption and the DAC is only used to dump the framebuffer for verification purposes we want to use the maximum bandwidth to reduce the number of cycles spent on the process.


Format

Integer

TextureUnitBusWidth

Description

Defines the bandwidth in bytes per cycle (GPU/main clock) between a Texture Unit and the Memory Controller. Notice that the current implementation garantees a dedicated bus between each Texture Unit instance and the Memory Controller so the aggregated bandwith may be much larger.

The usual value for the parameter is the maximum available bandwidth per cycle (64 until recently) as we aren't interested on simulating interconnection network limitations. Current recommendation is to always used the maximum bandwidth available as the Texture Unit is a large bandwith consumer.


Format

Integer

ReadBufferLines

Description

Defines the number of lines in a buffer used to hold data already read from memory and pending to be issued (served) to the requesting GPU unit. A line corresponds with the maximum transaction size (currently defined as a constant with value 64 bytes).

Recommendation for this parameter is 128 or 256 lines as limits the number of requests that can be issued to memory due to lack of back pressure mechanism (lines are reserved before issuing the request to memory). The parameter affects the backpressure mechanism between GPU units and the Memory Controller. Due to the backpressure implementation at least 8 to 16 lines (depending on the number of GPU units attached to the Memory Controller) may be required and it must be taken into account that those entries will remain unused.


Format

Integer

WriteBufferLines

Description

Defines the number of lines in a buffer used to hold data that is pending to be written to memory. A line corresponds with the maximum transaction size (currently defined as a constant with value 64 bytes).

Recommendation for this parameter is 128 or 256 lines as limits the number of write transactions that can be pending in the Memory Controller. The parameter affects the backpressure mechanism between GPU units and the Memory Controller. Due to the backpressure implementation at least 8 to 16 lines (depending on the number of GPU units attached to the Memory Controller) may be required and it must be taken into account that those entries will remain unused.


Format

Integer

RequestQueueSize

Description

Defines the number of requests (transactions) that can be stored in the queue for memory transactions pending to be processed.

Check if in Memory Controller V2 this parameter affects transactions after or before splitting.

Recommendation for this parameter is 256 or 512 transactions. It must be noted due to the backpressure mechanism between the Memory Controller and the GPU units at least 8 to 32 entries (depending on the number of units attached to the Memory Controller) would be required for the backpreassure mechanism to actually work. Count that those 8-32 entries won't be ever filled due to the mechanism. This parameter limits the number of pending transactions and has been proven that small numbers have a noticeable effect on performance.


Format

Integer

ServiceQueueSize

Description

Defines the number of transactions that can be pending to be serviced to a GPU unit. The Memory Controller implements a queue per type of GPU unit (Streamer Fetch, Streamer Loader, ROPZ, ROPC, Texture Unit, Command Processor, DAC).

Recommendation for this parameter is a value of 32 or 64 entries.

Format

Integer

Legacy Memory Controller parameters

MemoryClockMultiplier

Description

Unused. Will be eventually removed.

Format

Integer

MemoryFrequency

Description

Unused. Will be eventually removed.

Format

Integer

MemoryBusWidth

Description

Unused. May be used to replace some constants in the legacy Memory Controller that define the burst size and cycles per burst.


Format

Integer

MemoryBuses

Description

Defines the number of buses or channels to GPU memory devices in the legacy Memory Controller.

The usual values for this parameter are 1, 2 or 4 or 8 buses/channels (64-bit, 128-bit, 256-bit or 512-bit memory interfaces). At some point other values were tested and non power of two values are allowed due to the loose rules implemented in the legacy Memory Controller.

Format

Integer

SharedBanks

Description

In the legacy Memory Controller is used to allow any memory page to be accessed using any of the memory buses/channels as if there was a single memory device with multiple ports.

Parameter exists due to a legacy implementation of the legacy Memory Controller. Could be useful to simulate some kind of ideal memory.

The parameter has not been set to TRUE in years. Not much point on doing that. The legacy Memory Controller right now is only useful for the simulator debug mode.


Format

Boolean (TRUE/FALSE)

BankGranurality

Description

The parameter actually defines how the memory devices (bus/channels, banks and pages) are mapped to linear GPU memory addresses. The parameter defines the bank interleaving and implicitly bus/channel interleaving in the legacy Memory Controller. The defined value is the interleaving in bytes for banks of different bus channels. The bus/channel interleaving is applied after the bank interleaving :

aaaa aaaa ccbb bbbb

Where aaaaaaaa would be extra address bits, cc would define the bus/channel being accessed and bbbbbb would address data inside a bank in the defined bus/channel.

The usual value for this parameter is 1024 bytes. Only power of two values are likely to work correctly. The value must be smaller than the value defined for MemoryPageSize.


Format

Integer

ReadLatency

Description

Defines the latency in cycles of a read request (request to data available) to the memory device (aka CAS latency) in the legacy memory controller.

Use values from GDDR2/GDDR3/GDDR4 device specifications.

Format

Integer

WriteLatency

Description

Defines the latency in cycles of a write request to a memory device, request to first data in, (aka write latency WL) in the legacy Memory Controller.

Use values from GDDR2/GDDR3/GDDR4 device specifications.

Format

Integer

WriteToReadLatency

Description

Defines the penalty in cycles from issuing a write to read request to a memory device (aka tWTR) in the legacy Memory Controller. For the read to write penalty the value for read latency is used.

Use values from GDDR2/GDDR3/GDDR4 device specifications.

Format

Integer

MemoryPageSize

Description

Defines the size in bytes of a page in a memory device (aka row size) in the legacy Memory Controller.

The usual values are 4096 or 8192 bytes. Non power of two values shouldn't work.

Format

Integer

OpenPages

Description

The parameter actually defines the number of banks per memory device (bus/channel) in the legacy Memory Controller.

Normal values are 4 or 8 banks. Use values from GDDR2/GDDR3/GDDR4 device specifications.

Format

Integer

PageOpenLatency

Description

Defines the latency in cycles of 'opening' a page in one of the banks of the memory device (bus/channel) in the legacy memory controller. This parameter is the number of cycles between the request for opening the new page and when the first request can be issued to that page. In a real GDDR model (Memory Controller V2) this corresponds with precharge, ACT and RAS and their associated latencies.

Use values derived from GDDR2/GDDR3/GDDR4 device specifications.

Format

Integer

New Memory Controller parameters (aka MCV2)

MemoryControllerV2

Description

The parameter is used to select between the legacy Memory Controller (FALSE) and the detailed GDDR based Memory Controller V2.

Always use MemoryControllerV2 when simulating for obtaining real performance numbers. The legacy memory controller is required for some debug mode features.

Format

Boolean (TRUE/FALSE)

V2MemoryChannels

Description

Defines the number of memory channels in the GPU memory controller V2. Each channel has an independent memory scheduler and GDDRX chip. The interface of the GDDRX is fixed to 32-bit.

The usual values for this parameter are 1, 2, 4, 8 or 16 channels (32 and more channels are supported also). It is possible to use non power of two values for this parameter but with some limitations (see V2SplitterType).

Although the data-pin interface of the GDDRX chips is fixed to 32-bit it is easy to simulate configurations with 64-bit per channel, 128-bit and so on. For example, 64-bit per channel interfaces are implemented in real hardware having two 32-bit chips attached to one single channel (these two chips receive exactly the same stream of DDR commands). In our model, this can be achieved doubling the data rate of the chip (see V2BurstBytesPerCycle ) and doubling the row capacity (see V2MemoryRowSize).

Format

Integer

V2BanksPerMemoryChannel

Description

Defines the number of banks contained into a single GDDRX chip. The number of banks per chip amounts to the maximum number of open pages per chip (each bank can have only one page opened).

The usual value for this parameter is 8 (typical value in GDDR3/4 memories)

Format

Integer

Integer (multiples of 2048)

V2MemoryRowSize

Description

Defines the size in bytes of the rows (aka pages) in the GDDR3/4 memory banks.

The usual value is 2048 (the usual value in GDDR3/4 memories ). This value can be set to multiples of 2048 in case we are interested in simulate interfaces bigger than 32-bit per channel (see also V2BurstBytesPerCycle).

Format

Integer (multiples of 2048)

V2BurstBytesPerCycle

Description

Defines the number of bytes transmitted per cycle.

The usual value is 8 (for DDR memories)

Format

Integer

V2SplitterType

This parameter allows selecting between two memory transaction splitters/distributors available in the Memory Controller.

The two splitters are compatible with the parameter V2SecondInterleaving and depend on parameters V2MemoryChannels and V2BanksPerMemoryChannel.

The two splitters/distributors initially split each received memory transaction in what we call channel transactions, channel transactions are the work unit processed by channel schedulers. After this common splitting/distributing process, each splitter distributes channel transactions using its own scheme described below.

V2SplitterType = 0

This memory transaction splitter/distributor selects the destination channel and bank using the parameters V2ChannelInterleaving and V2BankInterleaving (and V2SecondChannelInterleaving and V2SecondBankInterleaving if V2SecondInterleaving is set to TRUE). The process to obtain the target channel and bank is the following:

The parameter V2ChannelInterleaving is used to extract the target channel, once the channel bits has been extracted the bank bits are extracted using the parameter V2BankInterleaving. Note that bits are truly extracted, so bank bits displacement has to be taken into account after extracting channel bits (this situation occurs when V2BankInterleaving multiplied by V2BanksPerMemoryChannel is equal or greater than V2ChannelInterleaving).

Example:
V2SecondInterleaving = FALSE
V2MemoryChannels = 8
V2BanksPerMemoryChannel = 8
V2ChannelInterleaving = 512
V2BankInterleaving = 128
The binary address is decoded as follows:
  • Address in binary format = X.X.X.X.X.X.X.X.X.X.X.X.X.X.X.B2.C2.C1.C0.B1.B0.X.X.X.X.X.X.X
  • Channel bits are extracted, channel selected using: C2.C1.C0
  • Remaining bits: X.X.X.X.X.X.X.X.X.X.X.X.X.X.X.B2.B1.B0.X.X.X.X.X.X.X
  • Bank bits are extracted, bank selected using: B2.B1.B0
  • The remaining bits are used to select the row and the start column.
V2SplitterType = 1

This memory transaction splitter selects the destination channel and bank using the parameters V2ChannelInterleavingMask and V2BankInterleavingMask (and V2SecondChannelInterleavingMask and V2SecondBankInterleavingMask if V2SecondInterleaving is set to TRUE). The process to obtain the channel and bank destination is based on a string mask specifying which bits compound the channel and the bank (note that bank bits displacement is now avoided), with this splitter is also possible to select random (not consecutive) bits to compound the channel and the bank bits.

The string mask format is a list of integers representing bit positions. Examples of valid strings are: "10 9 8", "8 9 10", "12 14 6", etc. Note that the first and the second examples are not equivalent, since the order is taken into account. If bits 10, 9 and 8 are respectively 110 the first mask string will produce the value 6 (110) and the second mask string will produce 3 (011).

Example:
V2SecondInterleaving = FALSE
V2MemoryChannels = 8
V2BanksPerMemoryChannel = 8
V2ChannelInterleavingMask = "10 9 8"
V2BankInterleaving = "14 12 11"
The binary address is decoded as follows:
  • Address in binary format = X.X.X.X.X.X.X.X.X.X.X.X.X.B0.X.B1.B2.C2.C1.C0.X.X.X.X.X.X.X.X
  • Channel bits used to select the channel: C2.C1.C0
  • Bank bits used to select the bank: B2.B1.B0
  • Extract channel and bank bits, the remaining bits are used to select the row and the start column.

V2ChannelInterleaving

Description

Defines how the linear memory is assigned/interleaved among the available physical memory channels. This assignment/interleaving is expressed in bytes. This parameter is only used when V2SplitterType=0.

Usual values are 256, 512, 1024 and 2048

Format

Integer

V2BankInterleaving

Description

Defines how the memory handled by each channel is arranged among its banks. Note that channel interleaving is first applied and then the bank interleaving is applied to select the corresponding bank within the channel. This parameter is only used when V2SplitterType=0.

Usual values are 256, 512, 1024, 2048 and 4096

Format

Integer


V2ChannelInterleavingMask

Description

Defines how the linear memory is assigned/interleaved among the available physical memory channels. This assignment/interleaving is expressed using a bit mask. This parameter is only used when V2SplitterType=1.

Usual values are "10 9 8", "11 10 9", "12 11 10" and "13 12 11" (these values are equivalent to V2ChannelInterleaving = 256, 512, 1024 and 2048 using V2SplitterType=0)

Format

String

V2BankInterleavingMask

Description

Defines how the memory handled by each channel is arranged among its banks. This parameter is only used when V2SplitterType=1.

Usual values are "12 11 10", "14 10 9", "14 13 12" and so on

It is mandatory that channel and bank mask are disjoint

V2ChannelIntreleavingMask="12 11 10" and V2BankInterleavingMask="14 13 9" is CORRECT V2ChannelIntreleavingMask="12 11 10" and V2BankInterleavingMask="14 13 12" is NOT CORRECT (bit 12 is used in both masks)

Format

String


V2SecondInterleaving

Description

This parameter enables a second interleaving space. Thus, the linear memory is split into two disjoint segments, the first segment from address 0 to N-1 uses the first interleaving (V2ChannelInterleaving/V2ChannelInterleavingMask and V2BankInterleaving/V2BankInterleavingMask) and the second segment, from address N to MAX_MEMORY_ADDRESS, uses the second interleaving defined in V2SecondChannelInterleaving/V2SecondChannelInterleavingMask and V2SecondBankInterleaving/V2SecondBankInterleavingMask.

The value of N is contained in the Memory Controller's register: MCV2_2ND_INTERLEAVING_START_ADDR. The current driver implementation sets this value to map Color/Z buffers into first address segment and the rest (texture data, vertex data and so on) into the second segment

Format

Boolean

V2SecondChannelInterleaving

Analogous to 'V2ChannelInterleaving' for the second interleaving

V2SecondBankInterleaving

Analogous to 'V2BankInterleaving' for the second interleaving

V2SecondChannelInterleavingMask

Analogous to 'V2ChannelInterleavingMask' for the second interleaving

V2SecondBankInterleavingMask

Analogous to 'V2BankInterleavingMask' for the second interleaving

V2BankSelectionPolicy

V2ChannelScheduler

V2PagePolicy

V2MaxChannelTransactions

V2MemoryTrace

V2DisableActiveManager

V2DisablePrechargeManager

V2ManagerSelectionAlgorithm

V2MemoryType

V2GDDR_Profile

V2GDDR_tRRD

V2GDDR_tRCD

V2GDDR_tWTR

V2GDDR_tRTW

V2GDDR_tWR

V2GDDR_tRP

V2GDDR_CAS

V2GDDR_WL

STREAMER section

Vertex Cache and Streamer Commit parameters

IndicesCycle

Description

Defines the number of vertex indices that are read/generated and processed per cycle in the Streamer.

The parameter limits the maximum vertex and triangle througput of the geometry pipeline (even if the draw call is not indexed!).


Format

Integer

IndexBufferSize

Description

Defines the size in bytes of the buffer in Streamer Fetch to store index data read from memory.

The parameter should be set to a size multiple of the memory transaction size. The size doesn't needs to be that large, just enough to hide some of the latency of the memory requests in the rare case that indices have to be generated at a fast rate. A normal tested value would be 2 or 4 KBytes.

Format

Integer

OutputFIFOSize

Description

Defines the size of the vertex reorder queue in the Streamer Commit used to keep the order of shaded vertices with shader processors that may shade vertices out of order. The parameter defines one of the limits of the post-shading vertex cache.

The parameter limits the number of vertices that can be processed at the same time in the shader processors. It's associated with the OutputMemorySize parameter.

In the unified shader model it may useful for hundred of vertices to be in the shader processors at the same time so the recommendation is to set this parameter to a high number. The value assigned to the parameter may be higher than the value assigned to OutputMemorySize as it's very likely reuse of the shaded vertices (multiple instances) at rates as high as 2:1 or 3:1. Some tested values have been 512 or 768.

Format

Integer

OutputMemorySize

Description

Defines the size in vertices (a vertex can be associated with up to 16 128-bit attributes!) of the post shading vertex cache. In the current implementation this corresponds with the storage memory for vertices pending from being shaded and each position is linked to the reorder queue (size defined by the OutputFIFOSize parameter). As a backpressure mechanism is not implemented with the shader processors or the Shader Work Distributor (FragmentFIFO) entries in this memory are reserved when the vertex is sent to the shader processors or the Shader Work Distributor. Therefore the size of the memory limits the number of vertices that can be on execution on the shader processors at any time.

As it limits the number of vertices being shaded a large number will be interesting for the unified shader model. The number can be smaller than the defined for the reorder queue (OutputFIFOSize parameter) as it is to be expected some index/vertex reuse (different instances of the same index/vertex that hit the post shading vertex cache will share the same entry in the memory). Tested values range from 512 to 768.

Format

Integer

VerticesCycle

Description

Defines the number of vertices that are processed in the Streamer per cycle. In the current implementation this parameter only affects the throughput from Streamer Commit to Primitive Assembly.

This parameter limits the triangle throughput of the geometry pipeline.

Format

Integer

AttributesSentCycle

Description

Defines the number of vertex attributes per vertex per cycle that can be transmitted from Streamer Commit to Primitve assembly. Keep in mind that this value is per vertex processed/issued so the actual bandwidth between Streamer and Primitive assembly is the attribute maximum size (128 bits) multipled by the value of this parameter and by the value of the VerticesCycle parameter.

If you don't want to be limited by communication between Streamer and Primitive Assembly set this to some large number. Usual value is 4 attributes per cycle.

Format

Integer

Streamer Loader parameters

StreamerLoaderUnits

Description

The current implementation of the Streamer accepts multiple instances of the Streamer Loader unit to maximize vertex/attribute output to the shader processors. This parameter defines the number of instances of the Streamer Loader unit.

Unless the modeled architecture is targeted to heavy triangle throughput architecture usual values would be just 1 or 2 instances. Even if the architecture tries to maximize triangle throughput a relatively small number of instances (4 to 8) will likely saturate the available bandwidth from GPU memory so setting this parameter to a high number is not useful.

Format

Integer

SLIndicesCycle

Description

Defines the number of indices/vertices per cycle that can be processed by an instance of the Streamer Loader unit.

When the StreamerLoaderUnits is set to 1 the value of this parameter should match the value of the IndicesCycle parameter.

Check what are the requirements when StreamerLoaderUnits is higher than 1.

Format

Integer

SLInputRequestQueueSize

Description

Defines the size of the structure (FIFO) that stores information for requests to memory. Each entry in the structure is associated with an index/vertex and holds information associated with each of the attributes defined for the vertex. So the actual number of memory transactions that can be tracked is as high as the defined value multiplied by two times (in case of splited accesses to two cache lines) the attributes per vertex.

Set this parameter to a relatively high number to hide memory access latency. An usual value is 128 entries.

Format

Integer

SLAttributesCycle

Description

Defines the number of vertex attributes that are processed per cycle by a Streamer Loader unit. Processing includes default value generation, address generation and cache access.

The parameter affects the bandwidth between the Streamer Loader unit and it's associated Input Cache. A 'port' with the Input Cache is defined for each attribute that can be processed per cycle. Each 'port' allows an independent access (set, line, line offset) to the Input Cache.

Realistically the value of this parameter shouldn't be that high but if the purpose is to saturate as much possible the memory subsystem with just one or two Streamer Loader units it can be set to a high value. Usual values used are 4 o 8 attributes per cycle.

Format

Integer

SLInputCacheLines

Description

Defines the number of cache lines in the Input Cache associated with a Streamer Loader unit.

The Input Cache line is fully associative so the parameter defines the actual number of lines in the cache.

The Input Cache doesn't need to be very large (4 KB - 8 KB). An usual value used is 32 lines associated with 256 byte lines.

Format

Integer

SLInputCacheLineSize

Description

Defines the bytes per cache line in the Input Cache associated to a Streamer Loader unit.

Along with the SLInputCacheLines parameter defines the actual size of the Input Cache.

Any power of two value is allowed for this cache. However sizes smallers than a memory transaction (defined as 64 bytes in the current implementation) are not recommended for obvious reasons. The usual value we are using right now is 256 bytes. However it's likely we would want to reduce this size to a more reasonable 64 or 128 bytes per line.

Format

Integer

SLInputCachePortWidth

Description

Defines the width in bytes of a read 'port' between the Input Cache and the associated Streamer Loader unit. The actual bandwidth between the Input Cache and the Streamer Loader unit can be obtained multiplying the value of the SLAttributesCycle parameters (defines the number of 'ports' to the cache) with the value of this parameter.

An usual value is 16 bytes as that's the maximum size of a vertex attribute.

Format

Integer

SLInputCacheRequestQueueSize

Description

The Input Cache uses the Fetch Cache and the Fetch Cache implements a structure (FIFO) that stores the information about pending cache line fill and spill (not in this case as the Input Cache is read only) requests.

The value of this parameter limits the number of pending memory transactions that the Input Cache can support. The actual number of pending memory transactions is obtained by dividing the cache line size by the memory transaction size. Small numbers will limit the latency hiding capabilites of the Input Cache and Streamer Loader unit. Currently we are using a value of 32 requests.

Format

Integer

SLInputCacheInputQueueSize

Description

Due to the specific implementation there is an additional structure (FIFO) inside Input Cache that tracks the state of the cache line fill/spill requests. This parameter defines the size of this structure.

As the SLInputCacheRequestQueueSize parameter this parameter limits the number of pending memory transactions that the Input Cache can generate and thus how much memory latency can the Input Cache and Streamer Loader unit hide. The actual number of possible pending memory transactions is the minimum of both parameters so they are usually set to the same value. The current value we are using is 32 entries/requests.

Format

Integer

VERTEXSHADER section

The parameters in this section are only used by the legacy non-unified shader model.

The legacy non-unified shader model has not been tested in years so it may not even work.


ExecutableThreads (VSH)

Description

Number of executable 'threads' in the vertex shader processor. A thread in this case corresponds with a vertex element.

The number assigned to this parameter must be a multiple of the assigned to the ThreadGroup parameter.

As this parameter is for old style specific purpose vertex shader, which didn't even implement texture loads in most case, the value doesn't need to be very large. Just enough to hide the ALU pipeline, data dependencies and vertex in/out traffic with the Streamer. For example as a vestige of the old days the value in our current configuration files is 12. Given that the maximum latency is around 9 that would cover even the worst dependencies and a couple threads being frozen for input/output from/to the Streamer.

Type

Integer

InputBuffers (VSH)

Description

For some reason in the first implementation of the shader processor model there was a difference between storage for shader elements (vertices or fragments) that were being received from the producer unit (Streamer or FragmentFIFO) and storage and state for actual runnable threads for shader elements. This parameter represents how much storage is used for those elements that are being loaded but can not execute until a free runnable thread is available.

The number assigned to this parameter must be a multiple of the defined for the ThreadGroup parameter.

Large values are not required. For the vertex shaders any value is fine. The value in our current configuration files is 4.

Format

Integer

ThreadResources (VSH)

Description

In the current implementation resources mean registers to store temporal data for the shader elements. At some point a resource represented two registers allocated as a pair but I think that was changed to represent a single register (128 bits). The actual meaning is mainly controlled by the API/Driver implementation which is the one that decides how many resources requires a shader program to execute. The shader processor logic also takes into account the number of input or output attributes defined with the shader program to compute the resources to reserve per shader element.

The value represents the total number of resources for each instance of the shader processor. Each shader thread (actually an element) has to reserve the required amount of resources before it can start executing.

The value limits how many shader threads/elements can be in execution at any point and depends on the loaded shader program.

The value assigned to this parameter must be equal or greater than the value assigned to the ExecutableThreads parameter. At least a resource per thread is required or the actual number of executable threads would never be reached.

In this case as vertex programs are relatively large and require large number of input and output attributes and temporal registers a high number of resources relative to the small number of threads is desired so that all the threads will have available resources. In our current configuration files the value is 128.

Format

Integer

ThreadRate (VSH)

Description

Defines how many shader threads (really elements) are executed per cycle. In this case executed means: fetching an instruction for the shader thread, decoding, executing and commiting the instruction. All the shader threads/elements may execute the same instruction in lock step or execute independently the program depending on the value of the LockedExecutionMode parameter.

The value assigned to this parameter must be equal or greater than 1.

For the vertex shader the normal implementation is that a single thread/element can be executed. That's the value in our current configuration files.

Format

Integer

FetchRate (VSH)

Description

Defines how many instructions are fetched and then decoded, executed and commited per thread/element per cycle. Actually defines the ALU architecture: SIMD4, 2xSIMD4, 3xSIMD4, ... or as a special case SIMD4+scalar.

A value greater than 1 represents a super scalar implementation. In the current implementation and for the vertex shader using the legacy shader model the instruction and associated ALU is for SIMD4 data. An alternative architecture that combines a SIMD4 ALU/instruction with a scalar ALU/instruction per cycle is enabled using the ScalarALU parameter.

The value assigned to this parameter must be equal or greater than 1.

For the vertex shader the usual value would be 2 and then ScalarALU would be set to TRUE to implement a SIMD4+scalar.

Format

Integer

ThreadGroup (VSH)

Description

Defines how many threads/elements are processed as a single group, ganged or working like a team. Thread groups actually only have a meaning when the LockedExecutionMode parameter is set to TRUE and all the threads/elements in the group execute the same instructions in lock-step mode which basically simulates a kind of vector architecture with a single thread state (PC) associated to a group of elements/threads.

The value of this parameter must be equal or greater than 1.

For the vertex shader lock-step execution mode is not desired and therefore the value of this parameter uses to be 1.

Format

Integer

LockedExecutionMode (VSH)

Description

In the old shader model this parameter defines if shader threads/elements in a group are executed in lock step. All the threads/elements in the group execute the same instruction(s) using a SIMD execution model and share the same thread information (PC, state).

However for the legacy vertex shader the usual configuration is a single thread and even in case multiple threads are supported it would be reasonable a MIMD execution model. In our current configuration files the value of this parameter is FALSE.

Format

Boolean (TRUE/FALSE)

ScalarALU (VSH)

Description

Defines in the old shader model if the ALU configuration is SIMD4+scalar. The scalar ALU can be used for scalar instructions and vector instructions (not dot products) with a single result component.

The FetchRate must be configured to 2 to enable this option.


Format

Boolean (TRUE/FALSE)

ThreadWindow (VSH)

Description

Defines the method implemented to select the next thread to execute. If the parameter is enabled a Thread Window will select the next ready thread from the pool of currently executing threads (round robin priority if multiple threads are ready). If the parameter is disabled a thread queue will be implemented and only the head of the queue will be selectable for executing. If the thread in the head is not ready instructions won't be fetched that cycle.

The usual parameter for the vertex shader is to enable the Thread Window. The number of threads implemented in the vertex shader is small so the cost is reduced.

Format

Boolean (TRUE/FALSE)

FetchDelay (VSH)

Description

Defines the minimum number cycles between instruction fetches for a group of threads/elements. This parameter has a meaning when the thread group is large and requires multiple cycles to fully execute in the shader ALUs (vector length > number vector ALUs).

In the legacy vertex shader this parameter should always be 0 as the usual configuration has no thread group.

Format

Integer

SwapOnBlock (VSH)

Description

Defines the event that triggers the switch from the current executing thread group to the next thread group to execute. If the parameter is set to FALSE each fetch cycle (fetch cycles may happen every cycle or every N cycles) a new thread group is selected for execution in round robin order. Only if there is a single ready thread group will the same thread group fetch instructions in the consecutive fetch cycles. When the parameter is set to TRUE a new thread will be selected only when the executing thread group is blocked, either because of a texture operation or because the shader program end was reached.

For the legacy vertex shader this parameter is not really useful as the legacy vertex shader doesn't support texture instructions.

Format

Boolean (TRUE/FALSE)

InputsPerCycle (VSH)

Description

Defines how many shader elements (vertices) can be received from the producer unit per cycle. In the case of the Vertex Shader the producer are the StreamerLoader units.

The value assigned to this parameter must be equal or greater than 1.

The value of this parameter depends on the capacity of the Vertex Shaders and the Streamer Loader to produce and process vertices. As the configured number of vertex shaders uses to be higher than the capacity of the Streamer Loader the normal value is one input per cycle.

Format

Integer

OutputsPerCycle (VSH)

Description

Defines how many shader elements (vertices) can be sent per cycle to the consumer units (Streamer Commit for the legacy vertex shader).

The value assigned to this parameter must be equal or greater than 1.

The value of this parameter is affected by the processing capacity of the configured vertex shaders and the consuming capacity of the Streamer Commit unit. The combined capacity of the configured vertex shaders uses to be larger than the capacity of the Streamer Commit unit so the normal value for a single vertex shader is one output per cycle.

Format

Integer

OutputLatency (VSH)

Description

Defines the number of cycles required for a shader element (vertices) to reach the consumer unit from the shader processor. Simulates a delay due to the location of the shader processor and location of geometry pipeline on the die.

For a legacy vertex shader given that the number of cycles that a shader element can spend executing can be quite large it's unclear the performance effect of this parameter.

The actual implementation is a bit more complex. The parameter defines the maximum latency of the output signal from the vertex shader to the Streamer Commit unit. The actual latency of a shader element sent through the signal depends on the number of output attributes for the element and a couple of constants. The maximum latency is only reached when all 16 vertex output attributes are enabled.

The value assigned to this parameter must be equal or greater than 1.

In our current configurations the value used is 11 cycles. It's not advisable to change this number in the current implementation.

Format

Integer

PRIMITIVEASSEMBLY section

VerticesCycle

Description

Defines the number of shaded vertices that the Primitive Assembly stage can receive per cycle from Streamer Commit.

This parameter limits the throughput in vertices (and triangles) of the geometry pipeline.

An usual value for this parameter is 2 unless the architecture to simulate requires a higher triangle throughput.


Format

Integer

TrianglesCycle

Description

Defines the triangle output per cycle from the Primitive Assembly stage to the Clipper stage.

This parameter limits the triangle throughput of the geometry pipeline.

A normal value is 2 triangles per cycle but architectures with higher triangle throughput requirements may use higher values.


Format

Integer

InputBusLatency

Description

Defines the latency in cycles for vertices sent from Streamer Commit to the Primitive Assembly stage.

The purpose of the parameter was to define some delay due to bandwith limitations between Streamer Commit and Primitive Assembly and the location of Primitive Assembly and Streamer Commit on the die. The actual implementation uses this value as just on die distance limitation. The number of attributes per cycle defined for the Streamer unit increases the basic latency depending on the number of vertex output attributes defined.

In our current configuration files the value is set to 10 cycles.


Format

Integer

AssemblyQueueSize

Description

Defines the number of vertices that can be stored in the Primitive Assembly queue.

The number of entries in the queue must be at least 4 to support quad strips. The number of entries has to be larger than the vertex/triangle rate.

Our current configuration defines a value of 32 entries.

Format

Integer

CLIPPER section

TrianglesCycle

Description

Defines the number of triangles that the Clipper stage can receive, processes and sent per cycle.

The value of the parameter limits the throughput of the geometry pipeline.

The usual value is 2 triangles per cycle but higher values can be used for architectures with a requirement for high triangle throughput.


Format

Integer

ClipperUnits

Description

Defines the number of Triangle Clipping units implemented in the Clipper stage. Each unit can process one triangle.

In the current implementation the value for this parameter must match the value of the TrianglesCycle parameter.

The usual value is 2 units, but higher values are possible for architectures with higher triangle throughputs.

Format

Integer

StartLatency

Description

Defines the startup latency, as the number of cycles between consecutive triangles issued to the unit, of the Clipping units in the Clipper stage.

This parameter limits the throughput of the geometry pipeline.

An usual value for this parameter is 1 cycle.

Format

Integer

ExecLatency

Description

Defines the number of cycles a triangle must spend on a Clipping unit.

Our current configuration files set a value of 6 cycles.


Format

Integer

ClipBufferSize

Description

Defines the size of the buffer that holds triangles received from the Primitive Assembly stage and that are waiting to be processed in a Triangle Clipping unit.

In the current implementation the number of entries must be at least 3 times the triangle processing rate defined by the TrianglesCycle parameter.

In our current configuration files the value of this parameter is 32 entries.

Format

Integer

RASTERIZER Section

Triangle Setup parameters

TrianglesCycle

Description

Defines how many triangles can be received, processed and sent by the Triangle Setup unit.

This parameter limits the triangle throughput of the geometry pipeline.

An usual value for this parameter is 2 triangles cycles. Higher values can be defined if the simulated architecture has a high triangle throughput requirement.

Format

Integer

SetupFIFOSize

Description

Defines the size of the buffer (FIFO) for triangles received from the Clipper stage and waiting to be processed by the Triangle Setup unit.

The value assigned to this parameter must be at least TrianglesPerCycle * (TriangleInputLatency + 1).

Our current configuration files use a value of 32 triangles.

Format

Integer

SetupUnits

Description

Defines how many Triangle Setup units have been implemented. The parameter defines how many triangles can start processing each cycle.

This parameter limits the triangle throughput of the geometry pipeline-

The value of this parameter should be at least equal to the value defined for TrianglesCycle parameter.

An usual value is 2 units but higher numbers can be defined for simulated architectures with a high triangle throughput.


Format

Integer

SetupLatency

Description

Defines the number of cycles required to setup a triangle in one of the Triangle Setup units.

The usual value is 10 cycles.

Format

Integer

SetupStartLatency

Description

Defines the number of cycles between consecutive triangles issued to the same Triangle Setup unit.

This parameter limits the triangle throughput of the geometry pipeline. Higher values in this parameter can be compensated by increasing the number of Triangle Setup units.

Our current configuration files use a value of 4 cycles.


Format

Integer

TriangleInputLatency

Description

Defines the latency in cycles for triangles received from the Clipper stage.

The actual implementation is the latency in cycles of the triangle signal from the Clipper box.

The usual value for this parameter is 2 cycles.

Format

Integer

TriangleOutputLatency

Description

Defines the latency in cycles of triangles sent to the Triangle Traversal/Fragment Generation stage.

The actual implementation is the latency of the signal from Triangle Setup to Triangle Traversal box.

The usual value for this parameter is 2 cycles.

Format

Integer

TriangleSetupOnShader

Description

Enables a hack for executing part of the triangle setup computations (setup matrix, matrix determinant, edge and z equations) on the shader processor. The shader program is executed as a special vertex program that receives the positions of three vertices as the input for a triangle.

This option can only be enabled when using the unified shader model.

We have not tested this hack for years so it may no longer work.

The usual value for this parameter is FALSE

Format

Boolean (TRUE/FALSE)

TriangleShaderQueueSize

Description

Defines the size of the reorder buffer (FIFO) used to store triangles pending to be processed by the triangle setup shader program or that are currently being processed in the shaders.

The value of this parameter can not be zero if triangle setup on the shader is enabled.

The value assigned to this parameter in the current implementation limits the number of triangles that can be on the shader processors at any time.

Triangle setup on the shader has not been used in years so it may not work.

Our current configuration files use a value of 32 triangles.

Format

Integer

EmulatorStoredTriangles

Description

Defines the maximum number of triangles that the Rasterizer Emulator class can track as being processed through the Triangle Setup, Triangle Traversal and Interpolator boxes.

This parameter may limit the number of triangles being processed in the Triangle Setup, Triangle Traversal and Interpolator stages. The current implementation may just trigger a panic if the limit is reached.

Our current configuration files use a value of 64 triangles. As far as I know the limit has never been reached.

Format

Integer

Rasterization parameters

StampsPerCycle

Description

Defines how many quads (2x2 fragments) are generated, received, processed and sent per cycle at different stages of the fragment pipeline.

For Triangle Traversal and Hierarchical Z the parameter represents how many generation tiles (size of the generation tile defined by the GenWidth and GenHeight parameters) are generated, received and processed per cycle.

From Hierarchical Z to Fragment FIFO and the rest of fragment processing units (Interpolator, Z Stencil Test and Color Write) the parameter defines the total number of fragment quads that can be received/processed/sent per cycle. The total fragment quad throughput is evenly distributed between the ROP instances (number defined by the NumStampPipes parameter in the GPU section).

The value assigned to this parameter must be a multiple of the number of ROPs as defined by the NumStampPipes parameter in the GPU section.

This parameter limits the maximum fragment quad throughput through the pipeline.

The usual value for this parameter is 4 as the one used for NumStampPipes parameter. Normal configurations should match both parameters.

In future implementations this parameter may be removed or spawned into multiple parameters for different stages. Z Stencil Test and Color Write already have their own parameter but right now it must match (taking account the number of instances) the value assigned to this parameter.

Format

Integer

MSAASamplesCycle

Description

Defines how many MSAA samples per fragment generated or processed along the fragment pipeline per cycle can be generated or processed per cycle. So if a value of 2 is assigned it means that for each fragment that can processed per cycle at any stage of the fragment pipeline 2 samples for the fragment can be processed per cycle. If the number of samples associated with a fragment is higher than the value assigned to the parameter the stage will loop over the same fragment and take as many cycles as required to fully process the fragment at the configured sample processing rate.

This parameter limits the fragment quad throughput of the fragment pipeline.

In the current implementation the sample processing limitation is only implemented in the Triangle Traversal stage. Z Stencil Test and Color Write are limited to process one sample per cycle.

Usual values for this parameter are 2 or 4 samples per fragment.

Format

Integer

OverScanWidth

Description

Defines the width of an over scan tile as a number of scan tile.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

In the current implementation the over scan tile is not associated with any architectural parameter (like memory row size) or used for work distribution.

In the current implementation only square tiles are likely to work (same width and height).

The usual value for this parameter is 4. Other values have not been tested so they may not work.

Format

Integer

OverScanHeight

Description

Defines the height of an over scan tile as a number of scan tile.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

In the current implementation the over scan tile is not associated with any architectural parameter (like memory row size) or used for work distribution.

In the current implementation only square tiles are likely to work (same width and height).

The usual value for this parameter is 4. Other values have not been tested so they may not work.

Format

Integer

ScanWidth

Description

Defines the width of a scan tile as a number of fragments.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

The scan tile defines the size of the scan step for the scan based rasterization algorithm. The width of a scan tile must be a multiple of the width of a gen tile (defined by the GenWidth parameter).

The scan tile is the work unit used to distribute fragment quads between the ROP instances and the shader processors. Some implementations would also try to associate the memory footprint of a scan tile to a single memory channel (or group of memory channels) to improve memory access locality. Note that in the current implementation the footprint will increase when multiple samples are supported per fragment and the memory mapping function doesn't helps to distribute this increased footprint to the memory channels.

The current implementation is likely to require square tiles (same width and height).

The usual value for this parameter is 16. Other values may not work.

Format

Integer

ScanHeight

Description

Defines the height of a scan tile as a number of fragments.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

The scan tile defines the size of the scan step for the scan based rasterization algorithm. The height of a scan tile must be a multiple of the height of a gen tile (defined by the GenHeight parameter).

The scan tile is the work unit used to distribute fragment quads between the ROP instances and the shader processors. Some implementations would also try to associate the memory footprint of a scan tile to a single memory channel (or group of memory channels) to improve memory access locality. Note that in the current implementation the footprint will increase when multiple samples are supported per fragment and the memory mapping function doesn't helps to distribute this increased footprint to the memory channels.

The current implementation is likely to require square tiles (same width and height).

The usual value for this parameter is 16. Other values may not work.

Format

Integer

GenWidth

Description

Defines the width in fragments of a gen (generation) tile.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

The gen tile defines the unit output of the Triangle Traversal (Fragment Generation) stage. It's also used as the work unit for the Hierarchical Z stage.

In the current implementation the memory footprint of a gen tile should match the cache line size for the Z and Color Cache (or be a multiple of the size when multiple samples per fragment are present). It should also corresponds with the granurality of the Hierarchical Z buffer (1 element in the HZ Buffer corresponds with a gen tile).

In the current implementation only squared tiles (same width and height) are likely to work.

The usual value for this parameter is 8. Other values won't work.

Format

Integer

GenHeight

Description

Defines the height in fragments of a gen (generation) tile.

The fragments in a frame are organized in a hierarchy of tiles:

fragment -> quad (2x2) -> gen tile -> scan tile -> over scan tile

The gen tile defines the unit output of the Triangle Traversal (Fragment Generation) stage. It's also used as the work unit for the Hierarchical Z stage.

In the current implementation the memory footprint of a gen tile should match the cache line size for the Z and Color Cache (or be a multiple of the size when multiple samples per fragment are present). It should also corresponds with the granurality of the Hierarchical Z buffer (1 element in the HZ Buffer corresponds with a gen tile).

In the current implementation only squared tiles (same width and height) are likely to work.

The usual value for this parameter is 8. Other values won't work.

Format

Integer

RasterizationBatchSize

Description

Defines the number of triangles that are processed in parallel when using the recursive rasterization algorithm at the Triangle Traversal (Fragment Generation) stage. The equations for the triangles forming a batch of the size defined by this parameter are evaluated recursively at different tile sizes in parallel to generate fragments. When the lower tile level is reached (scan tile) gen tiles worth of fragments are generated iteratively for each of the triangles being processed.

The value for this parameter shouldn't be a large number to the large number of ALUs required to evaluate the triangle edge equations at each level of the tile hierarchy.

The usual values for this parameter should be 2 or 4 triangles.

Format

Integer

BatchQueueSize

Description

Defines the size of a buffer (FIFO) in Triangle Traversal (Fragment Generation) where triangles obtained from Triangle Setup are stored until they can be processed. When the recursive rasterization algorithm is enabled groups of triangles (number defined by the RasterizationBatchSize parameter) are fetched from this buffer and initiate the traversal stage.

The value assigned to this parameter must be a multiple of the value assigned to the TrianglesCycle parameter (triangle throughput).

The value assigned to this parameter should be a mutiple of the value assigned to the RasterizationBatchSize parameter.

In our current configuration files the value of this parameter is 16 triangles.

Format

Integer

RecursiveMode

Description

Enables the recursive rasterization algorithm in the Triangle Traversal (Fragment Generation) stage. When the parameter is enabled recursive rasterization is used and the RasterizationBatchSize must be set to a number equal or greater than 1. When the parameter is disabled a scan based traversal rasterization algorithm is used.

In our current simulation files the recursive rasterization rasterization is used. However the scan based rasterization algorithm should also work correctly. Differences in performance may be due to overheads in the recursive algorithm, different order in which fragments are generated (in the recursive algorithm multiple triangles may be traversed at the same time), etc.

Format

Boolean (TRUE/FALSE)

Micropolygon rasterization parameters

UseMicroPolygonRasterizer

TriangleBoundOutputLatency

TriangleBoundOpLatency

LargeTriangleFIFOSize

MicroTriangleFIFOSize

BypassStampFIFOSize

MicroTriangleBypass

BypassMode

DumpTriangleBurstSizeHistogram

Hierarchical Z parameters

DisableHZ

Description

When the value of this parameter is TRUE the parameter disables the test at the Hierarchical Z stage of the fragment pipeline.

For performance reasons this parameter should always be set to FALSE.

Hierarchical Z test requires compression to be enabled at the Z caches. So if compression is disabled the value of this parameter should be TRUE. That's the actual common use for the parameter, prevent test at the Hierarchical Z in architectures defined with no Z compression.

The usual value for this parameter is FALSE (test at Hierarchical Z stage enabled).

Format

Boolean (TRUE/FALSE)

StampsPerHZBlock

Description

Defines how many fragment quads (2x2) correspond with a Hierarchical Z buffer block. The number in this parameter should be the number of fragment quads in a gen tile (as defined by the GenWidth and GenHeight parameters).

The value for this parameter should be 16. Other values won't work in the current implementation.


Format

Integer

HierarchicalZBufferSize

Description

Defines the size of the Hierarchical Z buffer as a the number of elements, each element corresponding with a HZ block (each block corresponds with a gen tile or Z cache cache line), that can be stored in the HZ buffer.

To compute the actual size of the Hierarchical Z buffer in bits multiply by the

This parameter limits the size of the frame buffer. If a framebuffer larger than the limit defined by this parameter is used the current implementation will generate a panic or fail to work properly as it will try to access elements beyond the actual size of the buffer. Notice that when multisampling is enabled a HZ block corresponds with the size of a gen tile in samples, not fragments, so the actual limit to the framebuffer size is in samples not fragments.

The usual value is 262144 which allows for framebuffers up to 4096x4096 fragments/samples.

Format

Integer

HZCacheLines

Configuration

Defines how many lines has the cache used to access the Hierarchical Z buffer.

As the Hierarchical Z buffer is a very large on die structure (> 128 KBs) to reduce the latency of the access a small cache is implemente at the Hierarchical Z test stage to hold values read from the large Hierachical Z buffer.

The performance effect of this parameter has not been evaluated.

The usual value for this parameter is 8 lines.

Format

Integer

HZCacheLineSize

Description

Defines how many elements (HZ block reference value(s)) are stored per HZ cache line. With the HZCacheLines parameter and the number of bits used to represent a HZ block/element defines the size of the HZ cache.

The HZ cache is to reduce the latency when accessing the large on die Hierarchical Z buffer.

The usual value for this parameter is 16 elements/blocks.

Format

Integer

EarlyZQueueSize

Description

Defines the size of the queue (FIFO) which holds fragment quads received from the Triangle Traversal (Fragment Generation) stage and that are being tested against the Hierachical Z buffer.

The current implementation uses a queue that holds fragment quads at all points of the HZ stage so it can be considered that the queue has multiple pointers and multiple read ports (aka incorrectly implemented/simulated).

The size of the queue should at least match two times the bandwidth from the Triangle Traversal stage in terms of fragment quads (note that the bw is defined in terms of gen tiles in the configuration file).

The performance effect of this queue has not been evaluated.

The usual value for this parameter is 256 quads.


Format

Integer

HZAccessLatency

Description

Defines the latency in cycles for reading or writing a value from/to the Hierarchical Z buffer. The access to the buffer is considered to be fully pipelined so an operation can be issued per cycle but will take the defined number of cycles to complete.

The performance impact of this parameter has not been evaluated.

In our current configuration files the value of this parameter is 5 cycles.

Format

Integer

HZUpdateLatency

Description

Defines the latency in cycles for HZ block updates received from the Z caches.

In the current implementation this parameter defines the latency of the signal between the Z Stencil Test boxes and the Hierarchical Z box that is used to send updates (on Z cache line eviction) for HZ blocks in the Hierarchical Z buffer.

The performance impact of this parameter has not been evaluated.

In our current configuration files the value of this parameter is 4 cycles.

Format

Integer

HZBlocksClearedPerCycle

Description

Defines how many HZ elements/blocks in the Hierarchical Z buffer can be cleared (assigned to the default value: farthest Z) per cycle.

When using fast z/stencil clear commands to clear and initialize the z and stencil buffer this parameter defines how fast the Hierarchical Z buffer can be cleared in parallel with the block state memory in the Z caches.

The performance impact of this parameter has not been evaluated.

In our current configuration files the value of the parameter is 256 blocks/elements.

Format

Integer

Interpolator parameters

NumInterpolators

Description

Defines how many attributes (128 bits) can be interpolated per fragment processed per cycle by the attribute Interpolator stage. As this is per fragment the actual number of attributes interpolated per cycles (and thus the number of interpolator ALUs required) is the value of this parameter multiplied by the value of the StampsCycle parameter and by 4 (number of fragments in a quad).

The performance impact of this parameter has not been evaluated.

In our current configuration files the value of the parameter is 4 attributes. GPU architectures like the RV7xx family that implement attribute interpolation before shading like ATTILA the actual value is 1.


Format

Integer

Work Distributor (FragmentFIFO) parameters

ShaderInputQueueSize

Description

Defines the size of the queues (FIFO) that store shader elements to be issued to a shader processor. There is one queue per shader processor and each queue has the size defined by the value assigned to this parameter. Each entry in the queue represents a shader input. The shader input queue receives shader inputs from other queues in Fragment FIFO (Shader Work Distributor) for all kind of tasks: vertices, triangles, fragments.

The size of the queue must be at least 2 * numStampPipes multiplied by the size of the thread group in the legacy unified shader model or vector length in the vector shader model.

The performance impact of this parameter has not been tested.

In the current configuration files the size of this queue is 512 inputs.

Format

Integer

ShaderOutputQueueSize

Description

Defines the size of the queues (FIFO) used to receive shader outputs from the shader processors. There is a queue per shader processor and the size of each queue is the value assigned to this parameter. Each entry in the queue stores the data associated with a shader output. From this queue the shader output is distributed in Fragment FIFO (Shader Work Distributor) to different units in the GPU depending on the shader output type: vertices, triangles, fragments.

The size of the queue must be at least 2 * numStampPipes multiplied by the size of a thread group in the legacy shader model or the vector lenght in the vector shader model.

The performance impact of this parameter has not been evaluated.

In the current configuration files the value assigned to this parameter is 512 shader outputs.


Format

Integer

ShaderInputBatchSize

Description

Defines the number of consecutive fragments that are batched together to be sent to the same shader processor. This batch size is used for the batch based distribution algorithm that distributes fragments between the different shader processors. The batch based distribution algorithm is used when the TiledShaderDistribution parameter is set to FALSE. When using the batch based distribution algorithm the fragments are distributed between the shader processors based on the order they are generated by the Triangle Traversal (Fragment Generation stage) rather than based on the fragment position in the framebuffer used for the tiled distribution algorithm.

The value assigned to this parameter must be a multiple of the value assigned to the ThreadGroup parameter (FRAGMENTSHADER section) for the legacy shader model or the VectorLenght parameter for the Vector Shader model.

The batch based distribution algorithm for fragment to be processed in the shader processors has not been tested in years so it may not work.

The value assigned to this parameter in our current configuration files is 64 fragments.

Format

Integer

TiledShaderDistribution

Description

Defines the algorithm used to distribute the fragments generated by the Triangle Traversal (Fragment Generation) stage between the shader processors.

When the parameter is set to TRUE the fragments are distributed based on their position in the framebuffer. In the current implementation all the fragments inside a scan tile (size defined by the ScanWidth and ScanHeight parameters) are issued to the same shader processor. Each scan tile in on its own assigned to a different shader processor using a suitable distribution algorithm (Morton order, checkerboard, interleaved) based on its position on the framebuffer. The distribution should try to avoid a single shader processor receiving most of the generated fragments and becoming the bottleneck of the GPU.

When the parameter is set to FALSE the fragments are distributed based on the order they were generated by the Triangle Traversal (Fragment Generation) stage. A number of consecutive fragments (defined by the ShaderInputBatchSize parameter) are send as a group/batch to the same shader processors, and the next batch will be sent to the next available shader processors. Batching the fragments is required to keep access locality to the textures as each shader processor is assigned to its own Texture Unit.

The usual value for this parameter is TRUE. The batch based distribution algorithm has not been tested in years so it may not work.


Format

Boolean (TRUE/FALSE)

VertexInputQueueSize

Description

Defines the size of the buffer (FIFO) that is used to store vertices pending from being processed by the shader processors that have been received from the Streamer Loader unit. From this queue groups of vertices (defined by the ThreadGroup or VectorLength parameters in the FRAGMENTSHADER section) are issued to a shader input queue assigned to a shader processor. The vertices in this buffer are issued to the shader input queues.

This parameter is not used by the legacy non-unified shader model.

The value assigned to this parameter must be at least the number of elements in a ThreadGroup or VectorLength (FRAGMENTSHADER section), for legacy shader model or Vector Shader model, plus two times the vertex issue rate defined by the NumVertexShaders parameter (GPU section).

The effect on performance of this parameter has not been evaluated.

In our current configuration files the value of this parameter is 128 vertices.


Format

Integer

ShadedVertexQueueSize

Description

Defines the size of the reorder buffer (FIFO) that stores vertices that are being processed by the shader processors or have finished processing and are pending from being sent to Streamer Commit. This buffer receives the shaded vertices from the shader output queues.

This parameter is not used by the legacy non unified shader model architecture.

The value assigned to this parameter must be at least the value assigned to the ThreadGroup, for legacy shader model, or Vector Length, Vector Shader model, parameters (FRAGMENTSHADER section).

In the current implementation as the parameter defines a reorder buffer for all vertices being processed and there is no backpressure mechanism with the shader processors the value assigned limits the actual number of vertices that can be on the shader processors at any time. Note that in the current implementation this parameter and the OutputFIFOSize and OutputMemorySize parameters (STREAMER section) represent redundant structures. The parameter with the smaller value will be the one actually limiting the level of vertex parallelism in the shader processors. For this reason it's recommendable to assign the same value to all three parameters. It's reasonable though to assign a slightly higher value to OutputFIFOSize to account for vertices pending to be sent to the Primitive Assembly stage.

The usual value assigned to this parameter is 512 vertices.


Format

Integer

TriangleInputQueueSize

Description

Defines the size of the buffer (FIFO) used to hold triangles that have been received from the Triangle Setup stage and are pending from being processed in the shader processors. The triangles are stored in this buffer before being assigned to one of the shader input queues. This buffer is only used when triangle setup on the shader is enabled by setting the TriangleSetupOnShader parameter to TRUE.

The value assigned to this parameter must be equal or greater than the following sum:

ThreadGroup/VectorLength + (1 + SetupLatency) * TrianglesCycle

Where ThreadGroup or VectorLength (FRAGMENTSHADER section) define the minimum work that can be issued to a shader processor and the SetupLatency and TrianglesCycle define the latency and throughput from/of the Triangle Setup stage.

This parameter is not used by the legacy non unified shader architecture.

Triangle setup on the shader has not been used in years so it may not work.

In our current configuration files the value assigned to this parameter is 32 triangles.


Format

Integer

TriangleOutputQueueSize

Description

Defines the size of the reorder buffer (FIFO) used for triangles that are being processed in the shader processors or that finished processing are pending from being returned to the Triangle Setup stage. This buffer receives triangles from the shader output queues in Fragment FIFO (Shader Work Distributor).

In the current implementation there is no backpressure mechanism with the shader processors so value assigned to this parameter limits the number of triangles that can be on the shader processors at any time. This parameter and the TriangleShaderQueueSize parameter represent redundant structures though TriangleShaderQueueSize could be set to a slightly higher value to account for triangles already fully processed at Triangle Setup stage and waiting to be issued to the Triangle Traversal (Fragment Generation) stage.

The value assigned to this parameter must be at least the value assigned to ThreadGroup, legacy shader model, or VectorLength, Vector Shader model (FRAGMENTSHADER section).

The value assigned to this parameter must be equal or greater than:

(1 + SetupLatency) * TrianglesCycle)

Triangle setup on the shader has not been used in years so it may not work.

In our current configuration files the value of this parameter is 32 triangles.

Format

Integer

GeneratedStampQueueSize

Description

Defines the size of the buffers (FIFO) that store fragment quads (2x2 fragments) received from the Hierarchical Z. There is one queue per ROP pipeline so the actual number of fragments stored in the buffers has to be multiplied by the number of ROP pipelines as defined by the NumStampPipes parameter (GPU section). Quads stored in these buffers are sent to either the Interpolator stage that precedes shader processing (late Z) or to the Z Stencil Test ROPs (early Z).

The effect on performance of this parameter has not been evaluated.

An usual value for this parameter is 256 fragment quads.


Format

Integer

EarlyZTestedStampQueueSize

Description

Defines the size of the buffers (FIFO) that store fragment quads (2x2 fragments) that have been processed in the Z Stencil Test stage. There is a buffer per ROP pipeline so the total number of fragments hold on these buffers is obtained by multiplying by the number of ROP pipelines as defined by the NumStampPipes parameter (GPU section). From these buffers fragments are issued to the Interpolator stage preceding shading (early Z) or to the Color Write stage (late Z).

The performance effect of this parameter has not been evaluated.

An usual value for this parameter is 32 fragment quads.


Format

Integer

InterpolatedStampQueueSize

Description

Defines the size of the buffers (FIFO) that store fragment quads that have been processed in the Interpolator stage and therefore have their input fragments attributes (128 bits) computed from their corresponding triangle vertex output attributes. There is a buffer per ROP pipeline so the total number of fragments (with attributes) stored in these buffers is obtained by multipling by the number of ROP pipelines defined by the NumStampPipes parameter (GPU section). The quads in the buffers wait until they can be issued to the shader input queues for shader processing.

In the current implementation the number of fragments that the buffer can store is not affected by the actual number of input attributes defined for the fragments (up to 16).

The value assigned to this parameter must be at least equal to the thread group or vector length defined in the FRAGMENTSHADER section.

The performance effect of this parameter has not been evaluated.

An usual value for the parameter is 32 quads.

Format

Integer

ShadedStampQueueSize

Description

Defines the size of the reorder buffer (FIFO) for fragment quads that are being processed in the shaders or have finished processing are waiting to be issued to the Color Write stage (early Z) or the Z Stencil Test stage (late Z). There is a buffer per ROP pipeline so the total number of fragments that can be on the shader processors combined with those that already finished shading are waiting to be issued to the next stage is obtained by multiplying by the number of ROP pipelines defined by the NumStampPipes (GPU section).

In the current implementation as a shader output reorder queue and with no backpressure mechanism implemented with the shader processors the value assigned to this parameter limits the actual number of fragments that can be on a shader processor at any time. For this reason the combined size of all the shaded quad buffers should be slightly larger than the combined number of shader elements supported per shader processor as defined by the ExecutableThreads parameter, for legacy shader model, or VectorThreads * VectorLength parameters, for Vector Shader model (FRAGMENTSHADER section).

An usual value for this parameter is 2048 fragment quads (when combined with ExecutableThreads set to 8192 or VectorThreads set to 128 and VectorLenght set to 64).


Format

Integer

FRAGMENTSHADER section

Common Shader parameters

Parameters used by the legacy shader model and the Vector Shader model.


VertexAttributeLoadFromShader

Description

When this parameter is set to TRUE vertex attribute load is performed explicitly in the shader program using the LDA (load attribute) shader instruction and the Streamer Loader is configured in bypass mode so that only the index associated with the vertex is passed down to the shader processors. Shader program translation in the driver must be enabled using the EnableDriverShaderTranslation parameter (SIMULATOR section).

Enabling vertex attribute load from the shader will likely reduce performance as the number of instructions per vertex shader program increases and parallelism between the Streamer Loader stage and the shader processors is prevented. For special workloads that require a high vertex or triangle throughput evaluating the performance of this configuration, and compared with implementing multiple instances of the Streamer Loader, would be interesting. For now this option has received limited testing just to prove its correct functionality.

Vertex attribute load from the shader is not possible with the legacy non unified shader architecture as a vector shader processor is not connected to a Texture Unit.

The usual value for this parameter is FALSE.

Format

Boolean (TRUE/FALSE)


SwapOnBlock

Description

When this parameter is set to TRUE the thread group or vector thread that is currently active in the fetch stage will only be replaced (thread swap/switch) with another ready thread group or vector thread when the current thread group or vector thread is blocked, either due to a texture operation, an explicit wait point or the end of the thread.

When this parameter is set to FALSE a thread group or vector thread only remains active in the fetch stage for a single fetch operation (number of instructions fetches depends on the value of the FetchRate parameter). The next cycle a new thread group or vector thread will become active in the fetch stage. The mechanism used for the selection of the next thread group or vector thread depends on the value ThreadWindow parameter.

Evaluation of the performance effect of this parameter is planned as a future research topic.

In our current configuration files this parameter is set to FALSE.

Format

Boolean (TRUE/FALSE)

FixedLatencyALU

Description

When this parameter is set to TRUE all the shader instructions take the same fixed number of cycles to execute through the ALUs. In the current implementation this fixed execution latency is set to 4 but may change or be configurable in future implementations.

When this parameter is set to FALSE the execution latency of a shader instruction will depend on the opcode of the instruction. Some instructions may require just one cycle through the ALUs while other may require up to 9 cycles (for example reciprocate or reciprocate square root operations).

The actual execution latency and repeat rate latency for shader instructions is currently implemented per opcode tables in the ShaderArchitectureParameters class. This parameter is used to select between two sets of tables: fixed latency (FixLat) and variable latency (VarLat). For the vector shader architecture it also exist a fixed latency and variable latency table sets for the SOA (scalar) ALU architecture. Future implementations may replace the current parameter for one that explicitely selects a table set using a predefined name for example "FixedLatAOS".

Performance effect of this parameter may be evaluated in the future.

In our current configuration this parameter is set to FALSE.


Format

Boolean (TRUE/FALSE)

InputsPerCycle

Description

Defines how many shader elements can be received per cycle from the Fragment FIFO (Shader Work Distributor) stage. The shader elements or shader inputs can be of any of the currently supported types: vertices, triangles, fragments.

The value assigned to this parameter shouldn't be a bottleneck so the combined bandwidth from FragmentFIFO to the shader processors should at least match the maximum throughput of the rest of the GPU pipeline usually defined by the fragment pipeline which uses to be limited by the ROP throoughput.

The current implementation doesn't takes into account the number of active input attributes to the shader element to reduce the actual bandwidth.

An usual value for this parameter is 4 shader elements per cycle.

Format

Integer

OutputsPerCycle

Description

Defines the number of shader elements that can be sent back to the Fragment FIFO (Shader Work Distributor) stage per cycle. The value assigned to this parameter is actually the maximum throughput of the shader processor.

The current implementation decreases the bandwidth with Fragment FIFO based on the number of active output attributes associated with the shader elements/outputs. The bandwidth decrease is based on constants defined in the shader simulation classes. The current implementation for example may provide enough bandwidth for 2 attributes (2x128 bits) per shader element.

The value assigned to this parameter shouldn't, usually, become the bottleneck of the GPU. For this reason the combined throughput from the shader processors to the Fragment FIFO (Shader Work Distributor) stage should at least match the throughput of the ROP pipelines, which is usually the maximum rate at which shader outputs can be consumed by the GPU pipeline.

An usual value for this parameter is 4 shader elements/outputs.


Format

Integer

OutputLatency

Description

Defines the maximum latency in cycles for sending shader elements back to the Fragment FIFO (Shader Work Distributor) stage.

In the current implementation the actual latency depends on the number of active output attributes associated with a shader element. The actual latency is computed using defined constants in the shader simulation classes.

Performance effect of this parameter has not been evaluated.

In our current configuration files the value for this parameter is 11 cycles which correspond with 3 minimum cycles and up to 8 cycles, one cycle for each two active attributes for a maximum of 16 output attributes.


Format

Integer

TextureUnits

Description

Defines the number of Texture Units that are attached to a shader processor.

In the current implementation a Texture Unit can only be attached to a single shader processor but a shader processor may have multiple Texture Units attached.

The ALUs configured for the shader processor by the ThreadRate parameter for the legacy shader model or VectorALUWidth for the vector shader model and the value of this parameter define the ALU to texture ratio of the simulated architecture. The ALU to texture ratio is a key element that affects the performance of the simulated architecture. GPU architectures with a high ALU to texture ratio may become texture limited while GPU architectures with a low ALU texture ratio may suffer from underutilization of the Texture Unit in modern shading dominated games.

The usual value for this parameter is one Texture Unit per shader processor but higher values are possible when ThreadRate or VectorALUWidth are set to values larger than 4 and the simulated architecture requires a low ALU to texture ratio.

Format

Integer

TextureRequestRate

Description

Defines how many texture requests (one texture request for 4 shader elements) can be issued and received per cycle from/to the Texture Units attached to shader processor. Note that the value is the aggregated throughput with the all the Texture Units.

The value assigned to this parameter limits the throughput of the Texture Units.

The usual value for this parameter is 1 or if more than one Texture Unit is attached to the shader processor the number of Texture Units attached.


Format

Integer

TextureRequestGroup

Description

Defines the number of consecutive texture requests (one texture requests per 4 shader elements) are issued to the same Texture Unit. After the defined number of texture requests are issued the next batch of consecutive texture requests will be issued to the next available Texture Unit.

This parameter is only useful when more than one Texture Unit is attached to the shader processor. The purpose of this parameter is to batch together a relatively large number of requests to improve the locality of texture accesses.

In our current configuration files this parameter is set to 64 texture requests.

Format

Integer

Legacy Shader parameters

Parameters that are only used by the legacy shader model.


ExecutableThreads

Description

Defines, for the legacy shader model, how many threads or shader elements can be executable at any time in the shader processors.

The value assigned to this parameter must be a multiple of the value assigned to ThreadGroup.

The value of this parameter divided by the value of the ThreadRate defines the maximum latency for texture operations that the shader processor can hide. However this maximum value is reduced by resource limitations and threads that may not be in an executable state.

The value of this parameter affects the optimum value for the ShadedStampQueueSize parameter (RASTERIZATION section).

In our current configuration files the value assigned to this parameter is 8192 shader threads/elements.


Format

Integer

InputBuffers

Description

For some reason in the first implementation of the shader processor model there was a difference between storage for shader elements (vertices or fragments) that were being received from the producer unit (Streamer or FragmentFIFO) and storage and state for actual runnable threads for shader elements. This parameter represents how much storage is used for those elements that are being loaded but can not execute until a free runnable thread is available.

The value assigned to this parameter must be a multiple of the value defined for the ThreadGroup parameter. Large values are not required.

The value in our current configuration files is 128 (2x64 for ThreadGroup set to 64).

Format

Integer

ThreadResources

Description

In the current implementation resources mean registers to store temporal data for the shader elements. At some point a resource represented two registers allocated as a pair but I think it was changed later to represent a single register (128 bits). The actual meaning is mainly controlled by the API/driver implementation which is the one that decides how many resources requires a shader program to execute. The shader processor logic also takes into account the number of input or output attributes defined with the shader program to compute the resources to reserve per shader element.

The value represents the total number of resources for each instance of the shader processor. Each shader thread (actually an element) has to reserve the required amount of resources before it can start executing.

The value limits how many shader threads/elements can be in execution at any point and depends on the loaded shader program.

The value assigned to this parameter must be equal or greater than the value assigned to the ExecutableThreads parameter. At least a resource per thread is required or the actual number of executable threads would never be reached.

In our current configuration files the value for this parameter is 16384 or two times the number of executable shader threads/elements as defined in the ExecutableThreads parameter. Two registers per fragment shader element is a reasonable minimum to keep enough executable threads/elements for texture operation latency hiding after resource allocationg. And 16384 registers is more than enough for vertex processing.


Format

Integer

ThreadRate

ThreadWindow

Description

Defines which group of shader elements (a thread group as defined by the ThreadGroup parameter) will selected as the next fetch group and start the fetch and execution of instructions for the group.

When the parameter is set to TRUE a thread/group window tracks the state of the different thread groups and selects out of order (using a round robin priority) a ready thread for fetch and execution. Groups that are not ready due to be waiting for memory (Texture Unit) are never selected and the fetch stage will only stall if there are no ready groups in the thread window.

When the parameter is set to FALSE all groups are in a FIFO and the group at the head of the FIFO is the one that is selected in order for fetch and execution. If the group is not ready the fetch stage is stalled until the groups becomes ready.

The usual value for this parameter is TRUE. FIFO based group selection has not been tested in a while so it may not work correctly.


Format

Boolean (TRUE/FALSE)


FetchDelay

Description

Defines the minimum number of cycles between instructions fetches for thread groups in the legacy shader model. The fetch stage will wait the configured number of cycles before attempting to fetch an instruction from the current or next (depending on the thread scheduling configuration) thread group.

If the defined size of a thread group (ThreadGroup parameter) is higher than the number of available ALUs the fetch logic will already delay the next fetch for as many cycles as iterations are required to fully run the thread group through the available ALUs. The limit defined by this parameter is applied as a minimum on top of the previous limitation so in cases that the thread group would require a single iteration the fetch logic would still stall for the defined number of cycles before fetching instructions for the current or next thread group.

When the groups to fetch are selected from a FIFO (ThreadWindow parameter set to FALSE) this parameter should be set to a value greater than 0 and equal or greater than the number of iterations required to fully run a thread group through the ALUs. This is necessary to avoid desynchronization of the PC for the groups in the FIFO due to the decode stage issuing 'repeat last instruction' commands.

The usual value for this parameter in our configuration files is 4 cycles (thread groups require four iterations through the fetch stage and ALUs).

Format

Integer


FetchRate

Description

Defines the number of SIMD4 operations that each ALU in the shader processor can execute per cycle. The operations are for the same shader thread/element so the parameter also defines how many consecutive instructions are fetched for each fetch operation (in a way similar to a superscalar processor or a VLIW processor). The decode stage will prevent coupled instructions from the fetch stage to execute if it detects dependencies (most VLIW processors defer dependency checking to the compiler).

Combined with the ScalarALU parameter when this parameter has a value of 2 it is used to specify a shader ALU configuration that supports one SIMD4 operation coupled with one scalar operation. When the ScalarALU parameter is set to FALSE the ALU configuration will support as many SIMD4 in parallel as the value defined for this parameter.

The usual value for this parameter is 2 and coupled with the ScalarALU parameter set to TRUE for a SIMD4+scalar ALU configuration.


Format

Integer

ScalarALU

Description

When combined with the FetchRate parameter set to 2 operations this parameter defines a ALU configuration that supports a SIMD4 operation and a scalar operation in the same cycle.

When the value of this parameter is TRUE the FetchRate parameter must be set to 2.

In our current configuration files the value of this parameter is TRUE.


Format

Boolean (TRUE/FALSE)

ThreadGroup

Description

Defines how many shader threads/elements are ganged together as a single thread group that shares the thread state (ready, blocked, finished) and the program counter (PC). When the LockedExecutionMode parameter is set to TRUE all the threads/elements in the group execute the same instructions in lock step similar to how a vector architecture would execute instructions over the vector elements.

In our current configuration files the value of this parameter is 64 shader threads/elements (similar the ATI/AMD R5xx to RV7xx architectures).


Format

Integer

LockedExecutionMode

Description

In the old shader model this parameter defines if shader threads/elements in a group are executed in lock step. All the threads/elements in the group execute the same instruction(s) using a SIMD execution model and share the same thread information (PC, state).

For fragment shaders in the legacy non-unified shader architecture and for shader processors in the unified shader architecture this parameter should always be set to TRUE. MIMD execution is not suited for fragment shading as at least lock step execution per fragment quad (2x2 fragments) is required to compute texture coordinate derivatives in the Texture Unit.

In our current configuration files the value of this parameter is TURE.

Format

Boolean (TRUE/FALSE)

Vector Shader parameters

Parameters used by the Vector Shader model.


VectorShader

Description

When this parameter is set to TRUE the Vector Shader model is used to simulate the shader processors. The Vector Shader model is only supported for the unified shader architecture.

When this parameter is set to FALSE the legacy shader model is used to simulate the shader processors or fragment shader processors for the legacy non unified shader architecture.

The old shader model remains as legacy and for compatibility and validation. The Vector Shader model will be the base for future research and development of the shader processors so in our current simulation files the value for this parameter is TRUE.

Format

Boolean (TRUE/FALSE)

VectorThreads

Description

Defines the number of threads supported in the vector shader processor. Each thread is associated with a number of shader elements defined by the Vector Length parameter. All the shader elements in the thread execute the same instructions in lock-step (as a normal vector architecture). The value assigned to this parameter is the maximum number of threads supported by the vector shader processor, due to resource limitations or threads blocked waiting to memory the actual number of threads that are executable at some point in time may be significatively smaller.

The value assigned to this parameter, multipled by the value of the VectorLength and divided by the VectorALUWidth determines the maximum number of memory access latency that the shader processor can hide (at least when the SwitchOnBlock and parameter is set to FALSE, otherwise it's just a good approximation).

In our current configuration files the value of this parameter is 128 threads (similar to ATI/AMD R5xx-RV7xx GPUs). Coupled with VectorLength set to 64 the total number of shader elements that can be on execution in the vector shader processor would be 8192.

Format

Integer

VectorResources

Description

Defines how many resources are available per vector thread. When a vector thread is loaded in the vector shader processor it must allocate a number of resources based on the number of active input or output attributes for the shader element type and the temporal registers required by the shader program (value computed by the API/driver). In the current implementation one resource represents one vector register (128-bit register x VectorLength elements).

The value of this parameter limits how many vector threads are actually available for execution depending on the characteristics of the shader programs and shader elements of the different types that are in the vector shader processor.

The value of this parameter must at least equal to the value of the VectorThreads parameter. The minimum requirement is 1 resource/vector register per vector thread.

In our current configuration files the value of this parameter is 512 or four vector registers per vector thread (with VectorThread set to 128). This value should be similar to the registers per vector in the AMD RV7xx GPUs.


Format

Integer

VectorLength

Description

Defines the number of shader elements in a shader vector thread. All the shader elements in a vector thread share the thread state (ready/blocked state, program counter, etc) and execute the same shader instructions in lock-step.

The value of this parameter must be a multiple of the value assigned to the VectorALUWidth parameter.

The value of this parameter must be a multiple of the values assigned to InputsPerCycle and OutputsPerCycle parameters.

In our current configuration files the value of this parameter is 64 shader elements. This value is similar to the vector length for the AMD R600 and RV770 GPUs.

Format

Integer

VectorALUWidth

Description

Defines the number of ALUs in the vector ALU array of the vector shader processor.

The value assigned to this parameter determines how many cycles are required to execute one instruction (or group of instructions for the SIMD4+scalar architecture) over all the elements in a vector thread. The number of iterations required is computed by dividing the value of the VectorLength parameter by this value. The value of the VectorLenght parameter must be a multiple of this value. The fetch and decode stage will stall until all the elements in the vector thread have executed (started) the instruction.

In our current configuration files the value of this parameter is 64 ALUs. This value is the same the value of the ALU array in the shader processors for the AMD R600 and RV770 GPUs.


Format

Integer

VectorALUConfig

Description

Defines the configuration of the per element ALU in the vector ALU array.

The current implementation of the vector shader model supports the following ALU configurations:

In the current implementation if the parameter is set to the "scalar" configuration the EnableDriverShaderTranslation parameter (SIMULATOR section) must also be set to TRUE. The conversion from AOS (array-of-structs) shader programs (OGL ARB or D3D ISAs) to SOA (struct of arrays) is performed in the driver translation function.

In our current configuration files the value of this parameter is set to "simd4+scalar".


Format

String

VectorWaitOnStall

Description

When this parameter is set to TRUE the decode stage will stall the vector shader processor if the next shader instruction to execute has a pending dependency with a previous instruction. When multiple instructions are fetched as a group (VectorALUConfig parameter with value "simd4+scalar") only the first instruction (in program order) can actually stall the vector shader processor, if the other instruction has a pending dependency the decode stage will just send a message to the fetch stage to set back the program counter (PC) of the corresponding vector thread to the PC of the dependant instruction.

When this parameter is set to FALSE the decode stage will never stall the vector shader processor. When pending dependency is detected on a fetched instruction the instruction, and any other instruction following the instruction with the dependency if multiple instruction are fetched as a group (VectorALUConfig parameter with value "simd4+scalar"), will be dropped before the execution stage and the decode stage will send a message to the fetch stage to set back the program counter (PC) of the corresponding vector thread to the PC of the dependant instruction.

The performance effect of this parameter will be evaluated in the future.

In our current configuration files this parameter is set to FALSE. The implementation of this feature is still in an experimental stage.


Format

Boolean (TRUE/FALSE)

VectorExplicitBlock

Description

When this parameter is set to TRUE a vector thread will only become blocked at shader instructions with the wait point flag set (explicit wait/blocking point). The decode stage will check if there are pending requests to the Texture Units for the vector thread and will block the vector thread until all the pending requests have returned from the Texture Units. When a vector thread issues the last instruction (end flag set) the vector ALU array the vector thread is also blocked as a temporal step before transitioning to the finished state.

When this parameter is set to FALSE a vector thread will be blocked when a texture instruction is executed and won't be resumed until the corresponding texture request has not returned from the Texture Units.

In the current implementation if this parameter is set to TRUE the EnableDriverShaderTranslation parameter (SIMULATOR section) must be set to TRUE as the explicit wait points for texture results are currently set by the driver shader translation function.

The performance effect of this parameter will be evaluated in the future.

In our current configuration files the parameter is set to FALSE. The implementation of this parameter is feature is still in a experimental stage.


Format

Boolean (TRUE/FALSE)

Texture Unit parameters

AddressALULatency

Description

Defines the latency of cycles of the Texture Unit stage that converts texture coordinates into memory addresses.

The computations performed by the Address ALU include:

In the current implementation the Address ALU can generate addresses for a bilinear sample (2x2 texels) for a fragment quad (2x2 fragments) in a single cycle. When multiple bilinear samples are required (trilinear filtering, 3D textures or anisotropic filtering) the texture request will require multiple cycles through the address ALU, and therefore the throughput will be reduced.

In our current simulation files the latency is set to 15 cycles.

Format

Integer

FilterALULatency

Description

Defines the latency in cycles of the Texture Unit Filter ALU. The Filter ALU computes the bilinear sample result from the texels read from the Texture Cache and combines bilinear sample results for complex filter modes (trilinear filtering, 3D textures, anisotropic filtering).

In the current implementation the Filter ALU can generate a bilinear sample result for a fragment quad (2x2 fragments) per cycle. For texture requests with complex filter modes the request will iterate through the Filter ALU until all the corresponding bilinear sample results are computed so the throughput will be reduced.

In out current configuration files the latency is set to 10 cycles.


Format

Integer

AnisotropyAlgorithm

Description

Defines the implementation of the anisotropic algorithm that the Texture Unit will use.

There are four implementations currently implemented:

In future implementations this parameter may be converted to string type.

In our current configuration files this parameter is set to 3 (EWA, high quality). However if the simulated traces are limited by texture filtering and depending on the objective of the experiments being performed it can be more reasonable to set the parameter to 1 (four axis, low quality/performance).


Format

Integer

ForceMaxAnisotropy

Description

When this parameter is set to TRUE the maximum anisotropy defined by the MaxAnisotropy parameter is force for all texture requests.

When this parameter is set to FALSE the anisotropy used for texture requests is the one computed by the selected anisotropic algorithm (AnisotropyAlgorithm parameter) clamped to the maximum anisotropy defined by the MaxAnisotropy parameter.

The value of this parameter may affect the performance of the Texture Unit reducing or increasing the texture filtering workload of the application.

In our current configuration files this parameter is set to FALSE.


Format

Boolean (TRUE/FALSE)

MaxAnisotropy

Description

Defines the maximum anisotropy that the Texture Unit will apply. This parameter defines the maximum supported anisotropic ratio or what is the same the maximum number of bilinear samples that can be requested. All texture requests will be clamped to the defined maximum anisotropy. When the ForceMaxAnisotropy is set all the texture requests (with anisotropic filtering enabled) will be force to request the defined maximum number of bilinear samples.

The maximum anisotropy supported by the current implementation of the Texture Unit is 16 samples. The computed anisotropic ratio will first be clamped to 16 (constant defined in GPU.h) and then clamped to the value defined by the parameter. Values greater than 16 are not supported for this parameter.

This parameter may be used to reduce (or even increase when combined with the ForceMaxAnisotropy parameter) the texture filtering workload of the simulated traces.

In the current implementation this parameter is set to 16 samples.

Format

Integer

TrilinearPrecision

Description

Defines the precision as number of bits used to compute the fractional part of the level of detail (lod) that is also used as a weight for combining the two bilinear sample results corresponding with a trilinear sample.

In the current implementation trilinear weight computation is performed using 32-bit float point operations and later the resulting lod fraction is clamped to the specified precision.

When the due to precision limitations the lod fraction becomes 0.0 or 1.0 only one bilinear sample from a single lod is taken. The BrilinearThreshold parameter can be used to increase the range of fractional values for which only one bilinear sample is taken.

In the current implementation the maximum value that can be used for this parameter is 32 bits.

In our current configuration files this parameter is set to 8 bits.


Format

Integer

BrilinearThreshold

Description

Defines the range of fractional lod values for which samples from two lods will be taken.

The range for the given value is defines as:

[value / 2^prec .. 1.0 - value /2^prec]

Where 'value' is the value defined for this parameter and 'prec' is the precision in bits defined by the TrilinearPrecision parameter.

If the value of this parameter is set to 0 then the full precision of the current implementation will be used (32-bit float point).

The name 'brilinear' of the parameter comes from a performance optimization used by AMD and NVidia drivers that greatly reduce the range of lod fractional values for which true trilinear filtering is used.

This parameter can be used to reduce the texture filtering workload of the simulated traces.

The value of this parameter can not be greater than (2^prec - 1).

In our current configuration files this parameter is set to 0.


Format

Integer

AnisoRoundPrecision

Description

Defines the precision, as a number of bits, used to compute and round the anisotropic ratio to the next valid anisotropic ratio value.

In the current implementation the anisotropic ratio is computed using 32-bit float point operations. The full precision is then clamped and rounded up based on the value of this parameter and the values of the AnisoRoundThreshold and AnisoRatioMultOfTwo parameters.

This parameter affects the texture filtering workload by changing how many anisotropic samples are actually required for a texture request.

In the current implementation the maximum value for this parameter is 32 bits.

In our current configuration files this parameter is set to 8 bits.

Format

Integer

AnisoRoundThreshold

Description

Defines how the anisotropic ratio is rounded to the next valid anisotropic ratio value.

If the AnisoRatioMultOfTwo parameter is set to FALSE the anisotropic ratio is rounded to the next integer anisotropic ratio value if the fractional ratio is greater than (value / 2^prec) where 'value' is the value of the parameter and prec is the precision defined by the AnisoRoundPrecision parameter.

If the AnisoRatioMultOfTwo parameter is set to TRUE the anisotropic ratio is rounded to the next even integer anisotropic ratio value if the fractional ratio is greater than (2.0 * (value / 2^prec)) where 'value' is the value of the parameter and 'prec' is the precision defined by the AnisoRoundPrecision parameter.

In the current implementation the anisotropic ratio is computed with 32-bit float point operations. If the value of this parameter is 0 the full precision of the implementation is used and the anisotropic ratio is rounded to the nearest smaller integer.

This parameter affects the texture filtering workload of the simulated traces.

The value of this parameter can not be greater than (2^prec - 1) where 'prec' is the precision defined by the AnisoRoundPrecision parameter.

In our current configuration files the value of this parameter is 0.


Format

Integer

AnisoRatioMultOfTwo

Description

When this parameter is set to TRUE the final anisotropic ratio, and therefore the number of samples taken for a texture request with anisotropic filtering enabled, must be 1 or an even number.

This parameter affects how the computed anisotropic ratio is rounded to the next valid anisotropic ratio value (see AnisoRoundThreshold parameter for details).

This parameter affects the texture filtering workload of the simulated traces.

In our current configuration files this parameter is set to FALSE.


Format

Boolean (TRUE/FALSE)

TextureBlockDimension

Description

Defines the size in texels of the first level texture tiles. Textures are stored in memory using tiles of the defined size to improve access locality. The first level texture tile is a square tile with a size of 2^n x 2^n texels, where n is the value assigned to the parameter.

The first level texture tile is related to the Texture Cache cache line size.

In our current configuration files this parameter is set to 2 for a size of 4x4 texels (for 32-bit texture formats 64 bytes per cache line).


Format

Integer

TextureSuperBlockDimension

Description

Defines the size of the second level texture tiles in first level texture tiles. Textures are stored in memory as tiles of tiles. Each second level texture tile corresponds with 2^n x 2^n first level tiles, where n is the value assigned with this parameter. The size of first level tiles in texels is defined by the TextureBlockDimension parameter.

The second level texture tile size is related with the size of the texture cache.

In our current configuration files this parameter is set to 4 for a size of 16x16 first level tiles (for 32 bit texture formats would corresponds with a 16 KB texture cache).

Format

Integer

TextureRequestQueueSize

Description

Defines the size of the buffer (FIFO) that holds texture requests received from the shader processors. The texture requests are issued from this queue to the Address ALU stage of the Texture Unit.

In the current implementation the size of this buffer should be large enough to hold all the possible pending requests that a shader processor would generate. The reason is that shader processor doesn't has a mechanism to store texture requests for future issue and if the Texture Unit stops accepting new requests the shader threads may stall.

The size of this buffer must be at least twice the number of texture requets that the shader processor can issue per cycle as defined by the TextureRequestRate parameter.

In fact performance impact of this parameter or the current implementation has not been evaluated.

In our current configuration files the value of this parameter is 512 texture requests (each request corresponds to a fragment quad).

Format

Integer

TextureAccessQueue

Description

Defines the size of the buffer (FIFO) that holds texture requests being processed through the whole Texture Unit pipeline.

In the current implementation the defined size acts as a global buffer for requests computing addresses (Address ALU), fetching data (Texture Cache), reading data (Texture Cache) or performing filtering (Filter ALU). Given that the Texture Unit is a relatively large pipeline with reduced buffering between stages coupled with a very large buffer for memory latency hiding the main purpose of this buffer is to model the actual latency hiding capability of the Texture Unit.

The actual performance effect of this parameter has not been evaluated.

In our current configuration files the value of this parameter is set to 256 texture requests (each texture requests corresponds with a fragment quad).


Format

Integer

TextureResultQueue

Description

Defines the size of the buffer (FIFO) that hold texture results pending to be send back to the corresponding shader processor.

In the current implementation the size of this buffer doesn't has to be very large as the shader processor can consume the results at the same rate that the Texture Unit produces them.

In our current configuration files the value of this parameter is set to 4 texture results (each texture results corresponds with a fragment quad).

Format

Integer

TextureWaitReadWindow

Description

Defines the size of the buffer (window) where read operations to the Texture Cache remain until the corresponding Texture Cache line is received.

This structure is used to support out of order processing of the texture requests. After performing a texture request fetch all the required data from the Texture Cache it will either be moved to the queue for texture requests pending from reading the Texture Cache, if all the corresponding Texture Cache lines are already present, or to the wait window, if any line has to be requested to memory. When the pending Texture Cache lines are received the texture requests are moved out of order to the read queue.

Small sizes for this structure will likely reduce the out of order and latency hiding capabilities of the Texture Unit.

The actual performance impact of this structure has not been tested. Also it has not been proved that texture requests are processed out of order.

In the current implementation the value for this parameter is 128 texture requests (each texture requests corresponds with a fragment quad).

Format

Integer

TwoLevelTextureCache

Description

When the parameter is set to TRUE the Texture Cache is implemented as two caches. The first cache, and smaller cache, holds uncompressed texture data and should have enough bandwidth to service all texels for a bilinear request for a fragment quad in a cycle (at least for 32-bit texture formats). The second, larger to hide latency and to profit from texture access locality, holds uncompressed texture data and the bandwidth should be enough to keep the level one cache filled.

When the parameter is set to FALSE the Texture Cache is implemented as a single cache. The cache would hold uncompressed texture data.

In the current implementation due to a large memory transaction size (64 bytes) when the parameter is set to FALSE the cache line size should be at least 256 bytes or memory bandwidth would be wasted for compressed textures (uncompressed size is x8 for DXT1, x4 for DXT3/DXT5).

In our current configuration files the parameter is set to TRUE.


Format

Boolean (TRUE/FALSE)

TextureCacheLineSize

Description

When TwoLevelTextureCache is set to TRUE this parameter defines the size of the first level (L0) cache line in bytes.

When TwoLevelTextureCache is set to FALSE this parameter defines the size of the Texture Cache line in bytes.

In both cases the cache line size defined is for uncompressed data. So take that into account depending on the actual size of the minimum request possible to the next memory level for possible loses of bandwidth due to the decompression ratio.

Cache size is computed as :

TextureCacheWays * TextureCacheLines * TextureCacheLineSize

In our current configuration files the line size is set to 64 bytes.


Format

Integer

TextureCacheWays

Description

When TwoLevelTextureCache is set to TRUE this parameter defines the number of ways (associativity) in the first level (L0) cache.

When TwoLevelTextureCache is set to FALSE this parameter defines the number of ways (associativity) in the Texture Cache.

Cache size is computed as :

TextureCacheWays * TextureCacheLines * TextureCacheLineSize

In our current configuration files this parameter is set to 8 ways.

Format

Integer

TextureCacheLines

Description

When TwoLevelTextureCache is set to TRUE this parameter defines the number of lines per way in the first level (L0) cache.

When TwoLevelTextureCache is set to FALSE this parameter defines the number of lines per way ( in the Texture Cache.

Cache size is computed as :

TextureCacheWays * TextureCacheLines * TextureCacheLineSize

In our current configuration files this parameter is set to 8 ways.

Format

Integer

TextureCachePortWidth

Description

Defines how many bytes are read from each 'port' in the first level (L0) Texture Cache or when the Texture Cache is implemented as a single cache. There are 4x4 ports (4 texel, 4 fragments) to service all the texels in a bilinear sample for a fragment quad in a single cycle without restrictions.

In our current configuration files this parameter is set to 4 bytes.


Format

Integer

TextureCacheRequestQueueSize

Description

Defines the size of the buffer (FIFO) that stores cache line requests to memory.

The buffer defined by this parameter is implemented in the FetchCache class and stores requests for texture misses (fills from memory and spills to memory if eviction is required, not the case for the Texture Cache).

This buffer limits the maximum number of out-standing misses for the Texture Cache. Given that the Texture Unit requires a large number of out-standing misses to completely hide memory latency a relatively large number of entries is required for good performance.

When the TwoLevelTextureCache parameter is set to TRUE The value of this parameter is used for both instances of the FetchClass: one for the first level cache (L0) and another for the second level cache (L1).

A propper performance evaluation of this parameter and the required number of out standing misses for the Texture Cache for a given memory configuration (latency) has not performed.

In our current configuration files the value of this parameter is set to 128 misses.

Format

Integer

TextureCacheInputQueue

Description

Defines the size of the buffer (FIFO) that holds Texture Cache requests for misses.

The buffer defined by this parameter is implemented in the TextureCache class. The value of this parameter limits the number of outstanding misses supported by the Texture Cache. The actual limit is the minimum of the value of this parameter and the value of the TextureCacheRequestQueueSize parameter.

When TwoLevelTextureCache is set to TRUE this defines the size of the buffer for the first level cache (L0).

The performance impact of this parameter and the actual number of outstanding misses required to completely hide the latency of a given memory configuration has not been propperly evaluated.

In our current configuration files the value of this parameter is 128 misses.

Format

Integer

TextureCacheMissesPerCycle

Description

Defines how many misses can be generated per cycle by the Texture Cache.

In a single cycle the Texture Cache supports up to 16 fetch operations (4 texels for a bilinear for 4 fragments). If the number of misses for those 16 fetch operations exceed the value defined for this parameter the Texture Cache logic will iterate over the same group of 16 fetch operations untill all the required missed lines are enqueued in the miss buffers of the Texture Cache.

The actual performance impact of this parameter has not been evaluated.

In our current configuration files the value of this parameter is 8 misses. Not sure if we use this value due to results from experiments or based on the analysis of actual hardware implementations.


Format

Integer

TextureCacheDecompressLatency

Description

Defines the latency in cycles of the texture decompression stage implemented in the Texture Cache.

In the current implementation the texture decompression stage isn't pipelined so only a single cache line can be decompressed at a time. Latencies greater than 1 reduce the throughput of the decompression stage and thus the actual bandwidth into the Texture Cache. This may be changed in future implementations.

In our current configuration files this parameter is set to 1 cycle. The reason is that the current implementation doesn't model a pipelined texture decompression stage and a latency higher than 1 would greatly reduce the bandwidth into the Texture Cache.

Format

Integer

TextureCacheLineSizeL1

Description

When TwoLevelTextureCache is set to TRUE this parameter defines the size of the second level (L1) cache line in bytes.

When TwoLevelTextureCache is set to FALSE this parameter is not used.

Lines in the second level cache may hold compressed texture data.

Second level (L1) cache size is computed as :

TextureCacheWaysL1 * TextureCacheLinesL1 * TextureCacheLineSizeL1.

In our current configuration files the line size is set to 64 bytes.

Format

Integer

TextureCacheWaysL1

When TwoLevelTextureCache is set to TRUE this parameter defines the number of ways (associativity) in the second level (L1) cache.

When TwoLevelTextureCache is set to FALSE this parameter is not used.

Second level (L1) Texture Cache size is computed as :

TextureCacheWaysL1 * TextureCacheLinesL1 * TextureCacheLineSizeL1 

In our current configuration files this parameter is set to 8 ways.

Format

Integer

TextureCacheLinesL1

Description

When TwoLevelTextureCache is set to TRUE this parameter defines the number of lines per way in the second level (L1) cache.

When TwoLevelTextureCache is set to FALSE this parameter is not used.

Second level (L1) Texture Cache size is computed as :

TextureCacheWaysL1 * TextureCacheLinesL1 * TextureCacheLineSizeL1 

In our current configuration files this parameter is set to 8 ways.

Format

Integer

TextureCacheInputQueueL1

Description

Defines the size of the buffer (FIFO) that holds Texture Cache requests for misses.

The buffer defined by this parameter is implemented in the TextureCache class. The value of this parameter limits the number of outstanding misses supported by the Texture Cache. The actual limit is the minimum of the value of this parameter and the value of the TextureCacheRequestQueueSize parameter.

When TwoLevelTextureCache is set to TRUE this defines the size of the buffer for the second level cache (L1).

When TwoLevelTextureCache is set to FALSE this parameter is not used.

In the current implementation the value of this parameter and TextureCacheInputQueue (first level cache) may limit each other. A propper evaluation of the current implementation is required.

The performance impact of this parameter and the actual number of outstanding misses required to completely hide the latency of a given memory configuration has not been propperly evaluated.

In our current configuration files the value of this parameter is 128 misses.

Format

Integer

ZSTENCILTEST section

ROPZ parameters

StampsPerCycle (ROPZ)

Description

Defines the number of fragment quads per cycle that the Z and Stencil Test unit (ROPZ) can receive from and return back to Fragment FIFO.

In the current implementation the Z and Stencil Test unit (ROPZ) internal throughput is limited to a single fragment quad per cycle and when multisampling antialiasing is enabled to a single sample per fragment read or written to the Z Cache. This may change in future implementations.

The value of this parameter should match the value defined for the StampsPerCycle (RASTERIZER section) divided by the value defined by the NumStampPipes parameter (GPU section).

In our current configuration files this parameter is set to 1 fragment quad.


Format

Integer

BytesPerPixel (ROPZ)

Description

In the original implementation this value represented the bytes per fragment for depth+stencil. In the current implementation this parameter is deprecated (the bit depth of the depth stencil buffer may be fixed or configurable through a GPU register) and not longer used.

This parameter will be eventually removed.


Format

Integer

Z Cache parameters

ZCacheWays

Description

Defines the number of ways (associativity) of the Z Cache.

The Z Cache size is computed as :

ZCacheWays * ZCacheLines * ZCacheStampsPerLine * 16

In the current implementation this parameter is set to 4 ways (Z Cache size of 16 KBs).

Format

Integer

ZCacheLines

Description

Defines the number of line per way of the Z Cache.

The Z Cache size is computed as :

ZCacheWays * ZCacheLines * ZCacheStampsPerLine * 16

In our current configuration files this parameter is set to 16 lines (Z Cache size of 16 KBs).

Format

Integer

ZCacheStampsPerLine

Description

Defines the size of the Z Cache lines.

In the current implementation the size is defined as number of fragment quads (2x2 fragments) for 32-bit (4 bytes) fragments. The actual cache line size in bytes can be obtained by multiplying the value assigned to this parameter by 16 bytes.

The Z Cache size is computed as :

ZCacheWays * ZCacheLines * ZCacheStampsPerLine * 16

In our current configuration files this parameter is set to 16 (256 bytes per cache line, total 16 KBs for the Z Cache).

Format

Integer

ZCachePortWidth

Description

Defines how many bytes can be read or write through the Z Cache ports. Defines the bandwidth between the Z Cache and the Z and Stencil Test unit.

As the Z and Stencil Test unit can process one fragment quad per cycle for 32-bit (4 bytes) per fragment depth+stencil the minimum bandwidth required with the Z cache for read or write operations is 16 bytes.

In our current implementation this parameter is set to 32 bytes. I think that with the current implementation the value should be 16. The current value would valid if the Z and Stencil Test unit could read or write two samples per cycle but that's not the case. Likely the implementation will change to fix this limitation.

Format

Integer

ZCacheExtraReadPort

Description

When this parameter is set to TRUE an extra read port is modeled in the Z Cache. This allows for a read operation from Z and Stencil Test unit in parallel with a read for a Z cache line eviction.

When the Z cache was first implemented it was detected that without the two read and two write ports the performance reduction due to contention from cache line evictions and fills was quite noticeable. This feature and parameter were added to prevent that performance reduction.

In our current configuration files this parameter is set to TRUE. Eventually a more realistic model should be implemented but some mechanism (banked cache) to reduce contention may be required.


Format

Boolean (TRUE/FALSE)

ZCacheExtraWritePort

Description

When this parameter is set to TRUE an extra write port is modeled in the Z Cache. This allows for a write operation from Z and Stencil Test unit in parallel with a write for a Z cache line fill.

When the Z cache was first implemented it was detected that without the two read and two write ports the performance reduction due to contention from cache line evictions and fills was quite noticeable. This feature and parameter were added to prevent that performance reduction.

In our current configuration files this parameter is set to TRUE. Eventually a more realistic model should be implemented but some mechanism (banked cache) to reduce contention may be required.

Format

Boolean (TRUE/FALSE)

ZCacheRequestQueueSize

Description

Defines the size of the buffer (FIFO) that stores misses while they are being serviced.

This buffer is implemented in the FetchCache class and each entry stores information for both the spill (eviction) and fill (miss service) associated with the miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Z Cache. This number affects the memory latency hiding capability of the Z and Stencil Test stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.

Format

Integer

ZCacheInputQueueSize

Description

Defines the size of the buffer (FIFO) that stores cache line fill operations while they are being serviced from memory.

This buffer is implemented in the ROPCache class and each entry stores information cache line fill operations associated with a cache miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Z Cache. This number affects the memory latency hiding capability of the Z and Stencil Test stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.

Format

Integer

ZCacheOutputQueueSize

Defines the size of the buffer (FIFO) that stores cache line spill (eviction) operations while they are being serviced to memory.

This buffer is implemented in the ROPCache class and each entry stores information cache line spill operations associated with a cache miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Z Cache. This number affects the memory latency hiding capability of the Z and Stencil Test stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.

Format

Integer

BlockStateMemorySize (ROPZ)

Description

Defines for how many framebuffer blocks (1 block = 1 cache line) the Z cache can hold the compression and clear state.

For framebuffer compression and fast clear support the Z Cache stores a per cache line (framebuffer block) value of a few bits (for example, 2 to 4 bits) that defines the current state of corresponding cache line (framebuffer block) as stored in memory: cleared, compressed with different compression ratios or algorithms or uncompressed.

In the current implementation the total size of the block state memory through all the Z Caches limits the maximum size of the depth+stencil buffer with fast clear, compression and Hierarchical Z buffer support.

Take into account that when multisampling antialising is enabled the maximum size of the framebuffer in terms of pixel resolution decreases as the size in bytes of a framebuffer block (cache line) doesn't change.

In the current implementation this parameter is set to 262144 blocks.

Format

Integer

BlocksClearedPerCycle (ROPZ)

Description

Defines the number of blocks for which the state can be cleared per cycle in the block state memory of the Z Cache. Defines how fast the fast clear operation is actually performed.

As a block state requires only 2 to 4 bits clearing (actually writing) a relatively large number of values per cycle may be possible.

Fast clear operations are already fast due to using an internal state memory and clearing blocks, not pixels, so for performance the number of blocks cleared per cycle doesn't need to be that large.

In our current configuration files this parameter is set to 1024 blocks.

Format

Integer

DisableCompression (ROPZ)

Description

When this parameter is set to TRUE compression of cache lines (1 cache line = 1 depth+stencil buffer block = 1 HZ buffer block) when are evicted to memory is disabled.

This parameter has a considerable performance effect due to the increase in bandwidth required with memory and because when Z Cache compression is disabled the Hierarchical Z test has to be disabled (DisableHZ parameter in RASTERIZER section).

In our current configuration files this parameter is set to FALSE.


Format

Boolean (TRUE/FALSE)

CompressionAlgorithm (ROPZ)

Description

Defines the compression algorithm that is used to compress Z Cache lines evicted to memory.

Currently implemented compression algorithms are:

For the Z Cache only a specialized version of the HiLo algorithm for depth and 3 compression ratios is currently implemented. The compression algorithm code is 0. Therefore this parameter can only be set to 0.

In our current configuration files this parameter is set to 0.


Format

Integer

CompressionUnitLatency (ROPZ)

Description

Defines the number of cycles required to compress a Z Cache line that is being evicted to memory.

In the current implementation the modeled compression stage is not pipelined. The value assigned to this parameter may reduce the actual bandwidth from the Z Cache to the Memory Controller.

In our current configuration files this parameter is set to 8 cycles. Note that given a cache line size of 256 bytes and a maximum bandwidth to the Memory Controller of 64 bytes per cycle the actual bandwidth is reduced to 32 bytes per cycle with this value. However the Z Cache can compress a cache line and decompress another cache line in parallel so the maximum bandwidth with the Memory Controller can actually be maxed.

Format

Integer

DecompressionUnitLatency (ROPZ)

Description

Defines the number of cycles required to decompress a Z Cache line that is being evicted to memory.

In the current implementation the modeled decompression stage is not pipelined. The value assigned to this parameter may reduce the actual bandwidth from the Z Cache to the Memory Controller.

In our current configuration files this parameter is set to 8 cycles. Note that given a cache line size of 256 bytes and a maximum bandwidth to the Memory Controller of 64 bytes per cycle the actual bandwidth is reduced to 32 bytes per cycle with this value. However the Z Cache can compress a cache line and decompress another cache line in parallel so the maximum bandwidth with the Memory Controller can actually be maxed.


Format

Integer

InputQueueSize (ROPZ)

Description

Defines the size of the buffer (FIFO) that stores fragment quads received from the Fragment FIFO (Shader Work Distributor) stage.

From this queue the fragment quads are issued to the Z Cache fetch/allocate stage.

In our current configuration files this parameter is set to 8 fragment quads.


Format

Integer

FetchQueueSize (ROPZ)

Description

Defines the size of the buffer (FIFO) that stores fragment quads that already peformed the Z cache fetch/allocate operation and are waiting to read data from the cache.

This buffer is the main limit to the memory latency capability of the Z and Stencil Test stage so it should be relatively large.

In our current implementation this parameter is set to 256 fragment quads.

Format

Integer

ReadQueueSize (ROPZ)

Description

Defines the size of the buffer (FIFO) that stores fragment quads that already read data from the Z cache and are waiting to performe the Z and Stencil test.

In our current configuration files this parameter is set to 16 fragment quads.


Format

Integer

OpQueueSize (ROPZ)

Description

Defines the size of the buffer (FIFO) that holds fragment quads that were already tested and are waiting to be write the results of the test back to the Z Cache.

In our current configuration files this parameter is set to 4 fragment quads.

Format

Integer

WriteQueueSize

Description (ROPZ)

Defines the size of the buffer (FIFO) that stores fully processed fragment quads that are waiting to be returned to the Fragment FIFO stage.

In our current configuration files this parameter is set to 8 fragment quads.

Format

Integer

ZALUTestRate

Description

Defines the number of iterations (cycles) through the Z and Stencil Test ALUs required to process a fragment quad.

In the current implementation this parameter is used to reduce the throughput of the Z and Stencil Test stage to less than one fragment quad per cycle.

In our current configuration files this parameter is set to 1 cycle.


Format

Integer

ZALULatency

Description

Defines the latency in cycles of the Z and Stencil Test ALU that implements the tests.

In our current configuration files this parameter is set to 2 cycles.


Format

Integer

COLORWRITE section

ROPC parameters

StampsPerCycle (ROPC)

Defines the number of fragment quads per cycle that the Color Write unit (ROPC) can receive from Fragment FIFO.

In the current implementation the Color Write unit (ROPC) internal throughput is limited to a single fragment quad per cycle and when multisampling antialiasing is enabled to a single sample per fragment read or written to the Color Cache. This may change in future implementations.

The value of this parameter should match the value defined for the StampsPerCycle (RASTERIZER section) divided by the value defined by the NumStampPipes parameter (GPU section).

In our current configuration files this parameter is set to 1 fragment quad.

Format

Integer

BytesPerPixel (ROPC)

Description

In the original implementation this value represented the bytes per fragment for color. In the current implementation this parameter is deprecated (the bit depth of the color buffer may be fixed or configurable through a GPU register) and not longer used.

This parameter will be eventually removed.

Format

Integer

Color Cache parameters

ColorCacheWays

Description

Defines the number of ways (associativity) of the Color Cache.

The Color Cache size is computed as :

ColorCacheWays * ColorCacheLines * ColorCacheStampsPerLine * 16

In the current implementation this parameter is set to 4 ways (Color Cache size of 16 KBs).


Format

Integer

ColorCacheLines

Description

Defines the number of line per way of the Color Cache.

The Color Cache size is computed as :

ColorCacheWays * ColorCacheLines * ColorCacheStampsPerLine * 16

In our current configuration files this parameter is set to 16 lines (Color Cache size of 16 KBs).


Format

Integer

ColorCacheStampsPerLine

Description

Defines the size of the Color Cache lines.

In the current implementation the size is defined as number of fragment quads (2x2 fragments) for 32-bit (4 bytes) fragments. The actual cache line size in bytes can be obtained by multiplying the value assigned to this parameter by 16 bytes.

The Z Cache size is computed as :

ColorCacheWays * ColorCacheLines * ColorCacheStampsPerLine * 16

In our current configuration files this parameter is set to 16 (256 bytes per cache line, total 16 KBs for the Color Cache).

Format

Integer

ColorCachePortWidth

Description

Defines how many bytes can be read or write through the Color Cache ports. Defines the bandwidth between the Color Cache and the Color Write unit.

As the Color Write unit can process one fragment quad per cycle for 32-bit (4 bytes) per fragment color the minimum bandwidth required with the Color cache for read or write operations is 16 bytes.

In our current implementation this parameter is set to 32 bytes. I think that with the current implementation the value should be 16. The current value would valid if the Color Write unit could read or write two samples per cycle but that's not the case. Likely the implementation will change to fix this limitation.

Format

Integer

ColorCacheExtraReadPort

Description

When this parameter is set to TRUE an extra read port is modeled in the Color Cache. This allows for a read operation from Color Write unit in parallel with a read for a Color cache line eviction.

When the Color cache was first implemented it was detected that without the two read and two write ports the performance reduction due to contention from cache line evictions and fills was quite noticeable. This feature and parameter were added to prevent that performance reduction.

In our current configuration files this parameter is set to TRUE. Eventually a more realistic model should be implemented but some mechanism (banked cache) to reduce contention may be required.


Format

Boolean (TRUE/FALSE)

ColorCacheExtraWritePort

Description

When this parameter is set to TRUE an extra write port is modeled in the Color Cache. This allows for a write operation from Color Write unit in parallel with a write for a Color cache line fill.

When the Color cache was first implemented it was detected that without the two read and two write ports the performance reduction due to contention from cache line evictions and fills was quite noticeable. This feature and parameter were added to prevent that performance reduction.

In our current configuration files this parameter is set to TRUE. Eventually a more realistic model should be implemented but some mechanism (banked cache) to reduce contention may be required.


Format

Boolean (TRUE/FALSE)

ColorCacheRequestQueueSize

Description

Defines the size of the buffer (FIFO) that stores misses while they are being serviced.

This buffer is implemented in the FetchCache class and each entry stores information for both the spill (eviction) and fill (miss service) associated with the miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Color Cache. This number affects the memory latency hiding capability of the Color Write stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.

Format

Integer

ColorCacheInputQueueSize

Description

Defines the size of the buffer (FIFO) that stores cache line fill operations while they are being serviced from memory.

This buffer is implemented in the ROPCache class and each entry stores information cache line fill operations associated with a cache miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Color Cache. This number affects the memory latency hiding capability of the Color Write stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.


Format

Integer

ColorCacheOutputQueueSize

Description

Defines the size of the buffer (FIFO) that stores cache line spill (eviction) operations while they are being serviced to memory.

This buffer is implemented in the ROPCache class and each entry stores information cache line spill operations associated with a cache miss.

This value assigned to this parameter limits the number of outstanding misses supported by the Color Cache. This number affects the memory latency hiding capability of the Color Write stage.

It has been evaluated that a reduced number of outstanding misses reduces the performance of the Memory Controller.

In our current configuration files this parameter is set to 128 misses.


Format

Integer

BlockStateMemorySize (ROPC)

Description

Defines for how many framebuffer blocks (1 block = 1 cache line) the Color cache can hold the compression and clear state.

For framebuffer compression and fast clear support the Color Cache stores a per cache line (framebuffer block) value of a few bits (for example, 2 to 4 bits) that defines the current state of corresponding cache line (framebuffer block) as stored in memory: cleared, compressed with different compression ratios or algorithms or uncompressed.

In the current implementation the total size of the block state memory through all the Color Caches limits the maximum size of the color buffer with fast clear and compression support.

Take into account that when multisampling antialising is enabled the maximum size of the framebuffer in terms of pixel resolution decreases as the size in bytes of a framebuffer block (cache line) doesn't change.

In the current implementation this parameter is set to 262144 blocks.


Format

Integer

BlocksClearedPerCycle (ROPC)

Description

Defines the number of blocks for which the state can be cleared per cycle in the block state memory of the Color Cache. Defines how fast the fast clear operation is actually performed.

As a block state requires only 2 to 4 bits clearing (actually writing) a relatively large number of values per cycle may be possible.

Fast clear operations are already fast due to using an internal state memory and clearing blocks, not pixels, so for performance the number of blocks cleared per cycle doesn't need to be that large.

In our current configuration files this parameter is set to 1024 blocks.


Format

Integer

DisableCompression (ROPC)

Description

When this parameter is set to TRUE compression of cache lines (1 cache line = 1 color buffer block) when are evicted to memory is disabled.

This parameter has a considerable performance effect due to the increase in bandwidth required with memory.

In our current configuration files this parameter is set to FALSE.


Format

Integer

CompressionAlgorithm (ROPC)

Description

Defines the compression algorithm that is used to compress Color Cache lines evicted to memory.

Currently implemented compression algorithms are:

   * 0 : HiLo algorithm (original). The implemented algorithm is actually suitable only for depth compression and the presence of varying stencil values will greatly reduce performance. The algorithm is based on an ATI patent for depth compression. Two reference depth values are stored per block and two extra depth values are derived from the reference depth values. These four values are used as the high (MSB) bits of the block depth values. The block depth values are compressed as indices to one of these four depth values and an offset (lower bits, LSB). Only two compression ratios are supported: 2x, 4x.
   * 1 : An enhanced implementation of the HiLo algorithm that supports up to 3 compression ratios and rearranging the MSB and LSB relative to the different fragment/sample fields (stencil/depth or R/G/B/A). Implemented by Christian, look at code and his Master Thesis for details.
   * 2 : MSAA compression algorithm. Implemented by Christian, look at code and his Master Thesis for details. 

In our current configuration files this parameter is set to 0.


Format

Integer

CompressionUnitLatency (ROPC)

Description

Defines the number of cycles required to compress a Color Cache line that is being evicted to memory.

In the current implementation the modeled compression stage is not pipelined. The value assigned to this parameter may reduce the actual bandwidth from the Color Cache to the Memory Controller.

In our current configuration files this parameter is set to 8 cycles. Note that given a cache line size of 256 bytes and a maximum bandwidth to the Memory Controller of 64 bytes per cycle the actual bandwidth is reduced to 32 bytes per cycle with this value. However the Color Cache can compress a cache line and decompress another cache line in parallel so the maximum bandwidth with the Memory Controller can actually be maxed.


Format

Integer

DecompressionUnitLatency

Description (ROPC)

Defines the number of cycles required to decompress a Color Cache line that is being evicted to memory.

In the current implementation the modeled decompression stage is not pipelined. The value assigned to this parameter may reduce the actual bandwidth from the Color Cache to the Memory Controller.

In our current configuration files this parameter is set to 8 cycles. Note that given a cache line size of 256 bytes and a maximum bandwidth to the Memory Controller of 64 bytes per cycle the actual bandwidth is reduced to 32 bytes per cycle with this value. However the Color Cache can compress a cache line and decompress another cache line in parallel so the maximum bandwidth with the Memory Controller can actually be maxed.


Format

Integer

InputQueueSize (ROPC)

Description

Defines the size of the buffer (FIFO) that stores fragment quads received from the Fragment FIFO (Shader Work Distributor) stage.

From this queue the fragment quads are issued to the Color Cache fetch/allocate stage.

In our current configuration files this parameter is set to 8 fragment quads.


Format

Integer

FetchQueueSize (ROPC)

Description

Defines the size of the buffer (FIFO) that stores fragment quads that already peformed the Color cache fetch/allocate operation and are waiting to read data from the cache.

This buffer is the main limit to the memory latency capability of the Color Write stage so it should be relatively large.

In our current implementation this parameter is set to 256 fragment quads.


Format

Integer

ReadQueueSize (ROPC)

Description

Defines the size of the buffer (FIFO) that stores fragment quads that already read data from the Color cache and are waiting to performe the color/blend operation.

In our current configuration files this parameter is set to 16 fragment quads.


Format

Integer

OpQueueSize (ROPC)

Description

Defines the size of the buffer (FIFO) that holds fragment quads that were already operated/blended and are waiting to be write the results of the operation back to the Color Cache.

In our current configuration files this parameter is set to 4 fragment quads.


Format

Integer

WriteQueueSize (ROPC)

Description

Defines the size of the buffer (FIFO) that stores fully processed fragment quads that are waiting to be eliminated (end of the fragment pipelined) or returned to the Fragment FIFO stage.

In our current configuration files this parameter is set to 8 fragment quads.


Format

Integer

BlendALUTestRate

Description

Defines the number of iterations (cycles) through the Color Blend ALUs required to process a fragment quad.

In the current implementation this parameter is used to reduce the throughput of the Color Write stage to less than one fragment quad per cycle.

In our current configuration files this parameter is set to 1 cycle.


Format

Integer

BlendALULatency

Description

Defines the latency in cycles of the Color Blend ALU that implements the color and blend operations.

In our current configuration files this parameter is set to 2 cycles.


Format

Integer

DAC section

BytesPerPixel (DAC)

Description

In the original implementation this parameter defined the bits (actually bytes) per fragment in the color buffer. The current implementation defines the bit depth of the color buffer format using a GPU register so this parameter has been deprecated and is no longer used.

This parameter will eventually removed.


Format

Integer

BlockSize

Description

Defines the size of framebuffer block in bytes. A framebuffer block corresponds with a Z Cache and Color Cache line.

In the current implementation the value of this parameter should match the cache line size for the Z Cache and Color Cache (or at least for the Color Cache). In future implementations the actual cache line size will be directly passed to the DAC and this parameter will be removed.

In our current configuration files this parameter is set to 256 bytes (matching Z Cache and Color Cache line sizes).


Format

Integer

BlockUpdateLatency

Description

Defines the latency in cycles for the block state memory updates received from the Color Cache.

In the current implementation the value of this parameter is the latency assigned to the signal between the Color Write units and the DAC that is used to pass the updates of the block state memory when a frame is finished.

We are not currently modeling a real DAC and the DAC unit is only used to dump the frames to a file for verification so realistic performance is not a current objective. In any case the block state memory should be accessable (through copies or direct access) to the DAC in a real implementation.

In our current configuration files this parameter is set to 1.


Format

Integer

BlocksUpdatedPerCycle

Description

Defines how many block state values can be received/stored per cycle from the Color Write units.

In the current implementation this parameter is set to 1024 blocks. Note that at 2 or 4 bits per block state value that's 256 or 512 bytes per cycle.

Format

Integer

BlockRequestQueueSize

Description

Defines the size of the buffer (FIFO) that stores information for framebuffer blocks that are being requested to memory and decompressed.

In our current configuration files this parameter is set to 32 blocks.


Format

Integer

DecompressionUnitLatency

Description

Defines the number of cycles required to decompress a framebuffer block.

The decompression stage in the DAC is not modeled as a pipelined unit so a value higher than 1 reduces the throughput of this stage and the effective bandwidth at which data for framebuffer blocks is read.

In our current configuration files this parameter is set to 1 cycle.


Format

Integer

RefreshRate

Description

Defines the screen refresh rate in cycles. Defines the frequency at which the DAC unit reads, decompresses and resolved the color buffer and sends the data to a display device (in the actual implementation a file in PPM format).

Only if SynchedRefresh is set to FALSE will the the value of this parameter be used to trigger the screen refresh operation in the DAC unit.

The current implementation of the DAC unit isn't used to correctly model a real DAC or the real costs of the screen refresh operation in memory bandwidth and strict timing requirements (refresh frequency, VBlank, HBlank, etc). For this reason the normal refresh mode, trigger the screen refresh operation based on a given frequency, is not really used in our current simulations and the feature has not been fully tested nor used for a long time.

This parameter was defined before the GPUClock parameter (GPU section) was implemented when there was no concept of 'real' time (defined as a base frequency) in the simulator. For this reason the refresh rate is defined in cycles. A likely change if we decide to model a real DAC will be to modify this parameter to define the actual refresh frequency in HZ rather than in cycles.

In our current configuration files this parameter is set to 5000000 cycles.


Format

Integer

SynchedRefresh

Description

When this parameter is set to TRUE the DAC will 'refresh the screen', actually dump the current color buffer into a file (in PPM format), is synchronized with the frame end (framebuffer swap command).

When this parameter is set to FALSE the DAC will 'refresh the screen' at the given rate defined by the RefreshRate parameter. This refresh mode would be the correct as implemented by real DAC units to send the framebuffer data to a screen or monitor.

The main purpose of this parameter is to be used to dump the framebuffer content after rendering has finished for debugging and validation purpose.

In the current implementation the DAC unit purpose isn't to accurately simulate the screen refresh and associated memory bandwidth and strict timing requirements. The normal refresh mode for this reason is only partially tested and has not been used for a long time.

In our current configuration files this parameter is set to TRUE.


Format

Boolean (TRUE/FALSE)

RefreshFrame

Description

When this parameter is set to TRUE 'screen refresh' is enabled in the DAC.

In the current implementation 'screen refresh' is implemented with the DAC unit reading, decompressing and resolving (when MSAA is enabled) the framebuffer to the final on screen image as fast as possible and dumping the contents to a file (in PPM format). The main use is to validate the simulation by checking the resulting image.

In the current implementation the purpose of the DAC unit isn't to implement the correct screen refresh operation with the associated memory bandwidth and strict timing requirements.

In our current configuration files this parameter is set to TRUE.


Format

Boolean (TRUE/FALSE)

SaveBlitSourceData

Description

When this parameter is set to TRUE the source data for all blits operations (using the Blitter unit) is dumped into a file (in PPM format).

The purpose of this feature is/was to help in the debug of the Blitter. It can also be used to log the usage of the Blitter operation (copy to texture in OpenGL).

In our current configuration this parameter is set to FALSE.


Format

Boolean (TRUE/FALSE)

ATTILA
Toolbox