Added codebase and microarch guides and updated vortex and simulation guides slightly

This commit is contained in:
Malik Aki Burton 2021-04-23 18:31:52 -04:00
parent 2c908b4a07
commit e3e5c178ff
4 changed files with 133 additions and 5 deletions

35
doc/Codebase.md Normal file
View file

@ -0,0 +1,35 @@
# Vortex Codebase
The directory/file layout of the Vortex codebase is as followed:
- `benchmark`: contains opencl, risc-v, and vector tests
- `opencl`: contains basic kernel operation tests (i.e. vector add, transpose, dot product)
- `riscv`: contains official riscv tests which are pre-compiled into binaries
- `vector`: tests for vector instructions (not yet implemented)
- `ci`: contain tests to be run during continuous integration (Travis CI)
- driver, opencl, riscv_isa, and runtime tests
- `driver`: contains driver software implementation (software that is run on the host to communicate with the vortex processor)
- `opae`: contains code for driver that runs on FPGA
- `rtlsim`: contains code for driver that runs on local machine (driver built using verilator which converts rtl to c++ binary)
- `simx`: contains code for driver that runs on local machine (vortex)
- `include`: contains vortex.h which has the vortex API that is used by the drivers
- `runtime`: contains software used inside kernel programs to expose GPGPU capabilities
- `include`: contains vortex API needed for runtime
- `linker`: contains linker file for compiling kernels
- `src`: contains implementation of vortex API (from include folder)
- `tests`: contains runtime tests
- `simple`: contains test for GPGPU functionality allowed in vortex
- `simx`: contains simX, the cycle approximate simulator for vortex
- `miscs`: contains old code that is no longer used
- `hw`:
- `unit_tests`: contains unit test for RTL of cache and queue
- `syn`: contains all synthesis scripts (quartus and yosys)
- `quartus`: contains code to synthesis cache, core, pipeline, top, and vortex stand-alone
- `simulate`: contains RTL simulator (verilator)
- `testbench.cpp`: runs either the riscv, runtime, or opencl tests
- `opae`: contains source code for the accelerator functional unit (AFU) and code which programs the fpga
- `rtl`: contains rtl source code
- `cache`: contains cache subsystem code
- `fp_cores`: contains floating point unit code
- `interfaces`: contains code that handles communication for each of the units of the microarchitecture
- `libs`: contains general-purpose modules (i.e., buffers, encoders, arbiters, pipe registers)

94
doc/Microarchitecture.md Normal file
View file

@ -0,0 +1,94 @@
# Vortex Microarchitecture
### Vortex GPGPU Execution Model
Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle.
- **Threads**
- Smallest unit of computation
- Each thread has its own register file (32 int + 32 fp registers)
- Threads execute in parallel
- **Warps**
- A logical clster of threads
- Each thread in a warp execute the same instruction
- The PC is shared; maintain thread mask for Writeback
- Warp's execution is time-multiplexed at log steps
- Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1
### Vortex RISC-V ISA Extension
- **Thread Mask Control**
- Control the number of warps to activate during execution
- `TMC` *count*: activate count threads
- **Warp Scheduling**
- Control the number of warps to activate during execution
- `WSPAWN` *count, addr*: activate count warps and jump to addr location
- **Control-Flow Divergence**
- Control threads to activate when a branch diverges
- `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
- `JOIN`: restore 'not-taken' thread mask
- **Warp Synchronization**
- `BAR` *id, count*: stall warps entering barrier *id* until count is reached
### Vortex Pipeline/Datapath
![Image of Vortex Microarchitecture](vortex_microarchitecture_v2.png)
Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
- **Fetch**
- Warp Scheduler
- Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
- Instruction Cache
- Retrieve instruction from cache, issue I-cache requests/responses
- **Decode**
- Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
- Branch, tmc, split/join, wspawn
- Precompute used_regs mask (needed for Issue stage)
- **Issue**
- Scheduling
- In-order issue (operands/execute unit ready), out-of-order commit
- IBuffer
- Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
- Scoreboard
- Track in-use registers
- GPRs (General-Purpose Registers) stage
- Fetch issued instruction operands and send operands to execute unit
- **Execute**
- ALU Unit
- Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
- MULDIV Unit
- Multiplier - done in 2 cycles
- Divider - division and remainder, done in 32 cycles
- Implements serial alogrithm (Stalls the pipeline)
- FPU Unit
- Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
- CSR Unit
- Store constant status registers - device caps, FPU status flags, performance counters
- Handle external CSR requests (requests from host CPU)
- LSU Unit
- Handle load/store operations, issue D-cache requests, handle D-cache responses
- Commit load responses - saves storage, Scoreboard tracks completion
- GPGPU Unit
- Handle GPGPU instructions
- TMC, WSPAWN, SPLIT, BAR
- JOIN is handled by Warp Scheduler (upon SPLIT response)
- **Commit**
- Commit
- Update CSR flags, update performance counters
- Writeback
- Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
- **Clustering**
- Group mulitple cores into clusters (optionally share L2 cache)
- Group multiple clusters (optionally share L3 cache)
- Configurable at build time
- Default configuration:
- #Clusters = 1
- #Cores = 4
- #Warps = 4
- #Threads = 4
- **FPGA AFU Interface**
- Manage CPU-GPU comunication
- Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
- Local Memory - GPU access to local DRAM
- Reserved I/O addresses - redirect to host CPU, console output

View file

@ -24,10 +24,9 @@ Running tests under specific drivers (rtlsim,simx,fpga) is done using the script
- *L3cache* - used to enable the shared l3cache among the Vortex clusters.
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx).
- *Debug* - used to enable debug mode for the Vortex simulation.
- *Scope* -
- *Perf* - is used to enable the detailed performance counters within the Vortex simulation.
- *App* - is used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
- *Args* -
- *Perf* - used to enable the detailed performance counters within the Vortex simulation.
- *App* - used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
- *Args* - used to pass additional arguments to the application.
Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.

View file

@ -2,7 +2,7 @@
### Table of Contents
- Vortex Architecture
- Vortex Microarchitecture
- Vortex Software
- [Vortex Simulation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Simulation.md)
- [FPGA](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Flubber_FPGA_Startup_Guide.md)