This commit is contained in:
Blaise Tine 2021-04-29 07:17:59 -04:00
commit 9f58750664
9 changed files with 299 additions and 10 deletions

View file

@ -18,7 +18,7 @@ Directory structure
- benchmarks: OpenCL and RISC-V benchmarks
- docs: documentation.
- docs: [documentation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Vortex.md).
- hw: hardware sources.

35
doc/Codebase.md Normal file
View file

@ -0,0 +1,35 @@
# Vortex Codebase
The directory/file layout of the Vortex codebase is as followed:
- `benchmark`: contains opencl, risc-v, and vector tests
- `opencl`: contains basic kernel operation tests (i.e. vector add, transpose, dot product)
- `riscv`: contains official riscv tests which are pre-compiled into binaries
- `vector`: tests for vector instructions (not yet implemented)
- `ci`: contain tests to be run during continuous integration (Travis CI)
- driver, opencl, riscv_isa, and runtime tests
- `driver`: contains driver software implementation (software that is run on the host to communicate with the vortex processor)
- `opae`: contains code for driver that runs on FPGA
- `rtlsim`: contains code for driver that runs on local machine (driver built using verilator which converts rtl to c++ binary)
- `simx`: contains code for driver that runs on local machine (vortex)
- `include`: contains vortex.h which has the vortex API that is used by the drivers
- `runtime`: contains software used inside kernel programs to expose GPGPU capabilities
- `include`: contains vortex API needed for runtime
- `linker`: contains linker file for compiling kernels
- `src`: contains implementation of vortex API (from include folder)
- `tests`: contains runtime tests
- `simple`: contains test for GPGPU functionality allowed in vortex
- `simx`: contains simX, the cycle approximate simulator for vortex
- `miscs`: contains old code that is no longer used
- `hw`:
- `unit_tests`: contains unit test for RTL of cache and queue
- `syn`: contains all synthesis scripts (quartus and yosys)
- `quartus`: contains code to synthesis cache, core, pipeline, top, and vortex stand-alone
- `simulate`: contains RTL simulator (verilator)
- `testbench.cpp`: runs either the riscv, runtime, or opencl tests
- `opae`: contains source code for the accelerator functional unit (AFU) and code which programs the fpga
- `rtl`: contains rtl source code
- `cache`: contains cache subsystem code
- `fp_cores`: contains floating point unit code
- `interfaces`: contains code that handles communication for each of the units of the microarchitecture
- `libs`: contains general-purpose modules (i.e., buffers, encoders, arbiters, pipe registers)

Binary file not shown.

After

Width:  |  Height:  |  Size: 517 KiB

94
doc/Microarchitecture.md Normal file
View file

@ -0,0 +1,94 @@
# Vortex Microarchitecture
### Vortex GPGPU Execution Model
Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle.
- **Threads**
- Smallest unit of computation
- Each thread has its own register file (32 int + 32 fp registers)
- Threads execute in parallel
- **Warps**
- A logical clster of threads
- Each thread in a warp execute the same instruction
- The PC is shared; maintain thread mask for Writeback
- Warp's execution is time-multiplexed at log steps
- Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1
### Vortex RISC-V ISA Extension
- **Thread Mask Control**
- Control the number of warps to activate during execution
- `TMC` *count*: activate count threads
- **Warp Scheduling**
- Control the number of warps to activate during execution
- `WSPAWN` *count, addr*: activate count warps and jump to addr location
- **Control-Flow Divergence**
- Control threads to activate when a branch diverges
- `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
- `JOIN`: restore 'not-taken' thread mask
- **Warp Synchronization**
- `BAR` *id, count*: stall warps entering barrier *id* until count is reached
### Vortex Pipeline/Datapath
![Image of Vortex Microarchitecture](./Images/vortex_microarchitecture_v2.png)
Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
- **Fetch**
- Warp Scheduler
- Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
- Instruction Cache
- Retrieve instruction from cache, issue I-cache requests/responses
- **Decode**
- Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
- Branch, tmc, split/join, wspawn
- Precompute used_regs mask (needed for Issue stage)
- **Issue**
- Scheduling
- In-order issue (operands/execute unit ready), out-of-order commit
- IBuffer
- Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
- Scoreboard
- Track in-use registers
- GPRs (General-Purpose Registers) stage
- Fetch issued instruction operands and send operands to execute unit
- **Execute**
- ALU Unit
- Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
- MULDIV Unit
- Multiplier - done in 2 cycles
- Divider - division and remainder, done in 32 cycles
- Implements serial alogrithm (Stalls the pipeline)
- FPU Unit
- Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
- CSR Unit
- Store constant status registers - device caps, FPU status flags, performance counters
- Handle external CSR requests (requests from host CPU)
- LSU Unit
- Handle load/store operations, issue D-cache requests, handle D-cache responses
- Commit load responses - saves storage, Scoreboard tracks completion
- GPGPU Unit
- Handle GPGPU instructions
- TMC, WSPAWN, SPLIT, BAR
- JOIN is handled by Warp Scheduler (upon SPLIT response)
- **Commit**
- Commit
- Update CSR flags, update performance counters
- Writeback
- Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
- **Clustering**
- Group mulitple cores into clusters (optionally share L2 cache)
- Group multiple clusters (optionally share L3 cache)
- Configurable at build time
- Default configuration:
- #Clusters = 1
- #Cores = 4
- #Warps = 4
- #Threads = 4
- **FPGA AFU Interface**
- Manage CPU-GPU comunication
- Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
- Local Memory - GPU access to local DRAM
- Reserved I/O addresses - redirect to host CPU, console output

View file

@ -24,12 +24,27 @@ Running tests under specific drivers (rtlsim,simx,fpga) is done using the script
- *L3cache* - used to enable the shared l3cache among the Vortex clusters.
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx).
- *Debug* - used to enable debug mode for the Vortex simulation.
- *Scope* -
- *Perf* - is used to enable the detailed performance counters within the Vortex simulation.
- *App* - is used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
- *Args* -
- *Perf* - used to enable the detailed performance counters within the Vortex simulation.
- *App* - used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
- *Args* - used to pass additional arguments to the application.
Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=vlsim --app=sgemm
Output from terminal:
```
Create context
Create program from kernel source
Upload source buffers
Execute the kernel
Elapsed time: 2463 ms
Download destination buffer
Verify result
PASSED!
PERF: core0: instrs=90802, cycles=52776, IPC=1.720517
PERF: core1: instrs=90693, cycles=53108, IPC=1.707709
PERF: core2: instrs=90849, cycles=53107, IPC=1.710678
PERF: core3: instrs=90836, cycles=50347, IPC=1.804199
PERF: instrs=363180, cycles=53108, IPC=6.838518
```

31
doc/Vortex.md Normal file
View file

@ -0,0 +1,31 @@
# Vortex Documentation
### Table of Contents
- [Vortex Codebase Layout](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Codebase.md)
- [Vortex Microarchitecture and Extended RISC-V ISA](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Microarchitecture.md)
- Vortex Software
- [Vortex Simulation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Simulation.md)
- [FPGA Configuration, Program and Test](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Flubber_FPGA_Startup_Guide.md)
- Debugging
- Useful Links
### Quick Start
Setup Vortex environment:
```
$ export RISCV_TOOLCHAIN_PATH=/opt/riscv-gnu-toolchain
$ export PATH=:/opt/verilator/bin:$PATH
$ export VERILATOR_ROOT=/opt/verilator
```
Test Vortex with different drivers and configurations:
- Run basic driver test with rtlsim driver and Vortex config of 2 clusters, 2 cores, 2 warps, 4 threads
$ ./ci/blackbox.sh --clusters=2 --cores=2 --warps=2 --threads=4 --driver=rtlsim --app=basic
- Run demo driver test with vlsim driver and Vortex config of 1 clusters, 4 cores, 4 warps, 2 threads
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=2 --driver=vlsim --app=demo
- Run dogfood driver test with simx driver and Vortex config of 4 cluster, 4 cores, 8 warps, 6 threads
$ ./ci/blackbox.sh --clusters=4 --cores=4 --warps=8 --threads=6 --driver=simx --app=dogfood

View file

@ -5,19 +5,16 @@ Description: Makes the build in the opae directory with the specified core
exists, a make clean command is ran before the build. Script waits
until the inteldev script or quartus program is finished running.
Usage: ./build.sh -c [1|2|4|8|16] [-p perf] [-w wait]
Usage: ./build.sh -c [1|2|4|8|16] [-p [y|n]]
Options:
-c
Core count (1, 2, 4, 8, or 16).
-p
Performance profiling enable. Changes the source file in the
Performance profiling enable (y or n). Changes the source file in the
opae directory to include/exclude "+define+PERF_ENABLE".
-w
Wait for the build to complete
_______________________________________________________________________________
@ -27,6 +24,7 @@ Description: Runs build.sh with performance profiling enabled for all valid
core configurations.
_______________________________________________________________________________
_______________________________________________________________________________
-program_fpga.sh-
@ -41,6 +39,7 @@ Options:
Core count (1, 2, 4, 8, or 16).
_______________________________________________________________________________
_______________________________________________________________________________
-gather_perf_results.sh-
@ -65,3 +64,53 @@ _______________________________________________________________________________
Description: Programs fpga and runs gather_perf_results.sh for all valid core
configurations. All builds should already be made before running
this.
_______________________________________________________________________________
_______________________________________________________________________________
-export_csv.sh-
Description: Creates specified .csv output file from an input directory, file,
and parameter. The .csv file contains two columns: cores, and the input
parameter. The output file is located within the directory specified with -d.
Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o
[output filename] -p '[parameter]'
Example: ./export_csv.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv
-p 'PERF: scoreboard stalls'
Options:
-c
Upper limit of cores to be read in. Core directories should exist in
the directory specified by -d e.g. 1c, 2c, 4c for -c 4.
-d
The directory of the form perf_{date} located in the evaluation
directory.
-i
The input filename located in each core directory within the
directory specified by -d.
-o
The output filename to be created within the directory specified
by -d.
-p
The parameter corresponding to the core count in the .csv file. The
full name of the parameter from the start of the line should be
inputted to avoid the parameter name being matched multiple times.
_______________________________________________________________________________
-export_ipc_csv.sh-
Description: Runs export_csv.sh for the parameter IPC.
Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o
[output filename]
Example: ./export_ipc.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv

View file

@ -0,0 +1,33 @@
#!/bin/bash
while getopts c:d:i:o:p: flag
do
case "${flag}" in
c) cores=${OPTARG};; #1, 2, 4, 8, 16
d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07)
i) ifile=${OPTARG};; #input filename
o) ofile=${OPTARG};; #output filename
p) param=${OPTARG};; #parameter to be made into csv
esac
done
if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then
echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)'
exit 1
fi
if [ -z "$ifile" ]; then
echo 'No input filename given for argument -f'
exit 1
fi
if [ -z "$dir" ]; then
echo 'No directory given for argument -d'
exit 1
fi
printf "cores,${param}\n" > "../${dir}/${ofile}"
for ((i=1; i<=$cores; i=i*2)); do
printf "${i}," >> "../${dir}/${ofile}"
(sed -n "s/${param}=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}") >> "../${dir}/${ofile}"
done

View file

@ -0,0 +1,32 @@
#!/bin/bash
while getopts c:d:f:o: flag
do
case "${flag}" in
c) cores=${OPTARG};; #1, 2, 4, 8, 16
d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07)
i) ifile=${OPTARG};; #input filename
o) ofile=${OPTARG};; #output filename
esac
done
if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then
echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)'
exit 1
fi
if [ -z "$ifile" ]; then
echo 'No input filename given for argument -f'
exit 1
fi
if [ -z "$dir" ]; then
echo 'No directory given for argument -d'
exit 1
fi
printf "cores,IPC" > "../${dir}/${ofile}"
for ((i=1; i<=$cores; i=i*2)); do
printf "${i}," >> "../${dir}/${ofile}"
(sed -n "s/IPC=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}" | awk 'END {print $NF}') >> "../${dir}/${ofile}"
done