Introduction
This PR adds support for Zbkb extension in the CVA6 core. It also adds the documentation for this extension. These changes have been tested with self-written single instruction tests and with the riscv-arch-tests. This PR will be followed by other PRs that will add complete support for the Zkn - NIST Algorithm Suite extension.
Implementation
Zbkb Extension:
Added support for the Zbkb instruction set. It essentially expands the Zbb extension with additional instructions useful in cryptography. These instructions are pack, packh, packw, brev8, unzip and zip.
Modifications
1. A new bit ZKN was added. The complete Zkn extension will be added under this bit for ease of use. This configuration will also require the RVB (bitmanip) bit to be set.
2. Updated the ALU and decoder to recognize and handle Zbkb instructions.
Documentation and Reference
The official RISC-V Cryptography Extensions Volume I was followed to ensure alignment with ratification. The relevant documentation for the Zbkb instruction was also added.
Verification
Assembly Tests:
The instructions were tested and verified with the K module of both 32 bit and 64 bit versions of the riscv-arch-tests to ensure proper functionality. These tests check for ISA compliance, edge cases and use assertions to ensure expected behavior. The tests include:
pack-01.S
packh-01.S
packw-01.S
brev8-01.S
unzip-01.S
zip-01.S
As mentioned in the spec, we need to perform a canonical check on the virtual address for instruction fetch, load, and store. If the check fails, it will cause the page-fault exception.
This PR fixes the above two:
- Changes INSTR_ACCESS_FAULT to INSTR_PAGE_FAULT
- Adding virtual address check on data accesses as well
in wt_axi_adapter, axi_rd_blen and axi_wr_blen are defined like this:
logic [$clog2(AxiNumWords)-1:0] axi_rd_blen, axi_wr_blen;
However, if AxiNumWords=1, this gives a synthesis error. This happens if the cache line is set to 64 bits (same as AXI width).
It can be fixed by changing to:
logic [AxiNumWords > 1 ? $clog2(AxiNumWords) : AxiNumWords-1:0] axi_rd_blen, axi_wr_blen;
cva6/core/cache_subsystem/wt_dcache_missunit.sv
Line 202 in b718824
.OutWidth ($clog2(CVA6Cfg.DCACHE_SET_ASSOC))
Better to use the width parameter which already contemplates the case of 0 to avoid issues if associativity is set to 1
cva6/core/include/build_config_pkg.sv
Line 134 in b718824
cfg.DCACHE_SET_ASSOC_WIDTH = CVA6Cfg.DcacheSetAssoc > 1 ? $clog2(CVA6Cfg.DcacheSetAssoc) : CVA6Cfg.DcacheSetAssoc;
The third optimization for Altera FPGA is to move the register file to LUTRAM. Same as before, the reason why the optimization previously done for Xilinx is not working, is that in that case asynchronous RAM primitives are used, and Altera does not support asynchronous RAM. Therefore, this optimization consists in using synchronous RAM for the register file.
The main changes to the existing code are:
Changes in ariane_regfile_fpga.sv file: The idea is the same as before, since synchronous RAM takes one clock cycle to read, we need to store the data when it is written, in case it is read right after. For this there is an auxiliary register that stores the last written data. On the read side, we need to identify if the data to be read is available in the RAM or if it is still in the auxiliary register (read after write). To compensate for the synchronous RAM delay the address is advanced one clock cycle. In this case there is a multiplexer in the output to select the block from where data is read, here we need to keep the read address for one clock cycle to select the right block when data is available.
Changes in issue_read_operands.sv file: adjust address to read from register file (when synchronous RAM is used reads take one cycle, so we advance the address). Since this address is an input, we need a new input port that brings the address in advance “issue_instr_i_prev”.
Changes in issue_stage.sv file: To connect the new input port that brings the address in advance “decoded_instr_i_prev”.
Changes in id_stage.sv file: To output the instruction to be issued before registering it (one clock cycle in advance). A new output port is needed for this “issue_entry_o_prev”
Changes in cva6.sv file: To connect the new output of the id_stage to the issue_stage to bring the address in advance to the register file (issue_entry_id_issue_prev)
The second optimization for Altera FPGA is to move the BHT to LUTRAM. Same as before, the reason why the optimization previously done for Xilinx is not working, is that in that case asynchronous RAM primitives are used, and Altera does not support asynchronous RAM. Therefore, this optimization consists in using synchronous RAM for the BHT.
The main changes to the existing code are:
New RAM module to infer synchronous RAM in altera with 2 independent read ports and one write port (SyncThreePortRam.sv)
Changes in the frontend.sv file: modify input to vpc_i port of BHT, by advancing the address to read, in order to compensate for the delay of synchronous RAM.
Changes in the bht.sv file: This case is more complex because of the logic operations that need to be performed inside the BHT. First, the pc pointed by bht_update_i is read from the memory, modified according to the saturation counter and valid bit, and finally written again in the memory. The prediction output is given based on the vpc_i. With asynchronous memory, the new data written via update_i is available one clock cycle after writing it. So, if vpc_i tries to read the address that was previously written by update_i, everything is fine. However, in the case of synchronous memory there are three clock cycles of latency (one for reading the pc content (read port 1), another one for writing it, and another one for reading in the other port (read port 0)). For this reason, there is the need to adapt the design to these new latency constraints:
First, there is the need for a delay on the address write of the synchronous RAM, to wait for the previous pc read and store the right modified data.
Once this is solved, similarly to the FIFO case, there is the need for an auxiliary buffer that will store the data written in the FIFO, allowing to have it available 2 clock cycles after the update_i was valid. This is because after having the correct data, the RAM takes 2 clock cycles until data can be available in the output (one clock cycle for writing and one for reading).
Finally, there is a multiplexer in the output that permits to deliver the correct prediction providing the data from the update logic (1 cycle of delay), the auxiliary register (2 cycles of delay), or the RAM (3 or more cycles of delay), depending on the delay since the update_i was valid (i.e. written to the memory).
The first optimization for Altera FPGA is to move the instruction queue to LUTRAM. The reason why the optimization previously done for Xilinx is not working, is that in that case asynchronous RAM primitives are used, and Altera does not support asynchronous RAM. Therefore, this optimization consists in using synchronous RAM for the instruction queue and FIFOs inside wt axi adapter.
The main changes to the existing code are:
New RAM module to infer synchronous RAM in altera with independent read and write ports (SyncDpRam_ind_r_w.sv)
Changes inside cva6_fifo_v3 to adapt to the use of synchronous RAM instead of asynchronous:
When the FIFO is not empty, next data is always read and available at the output hiding the reading latency introduced by synchronous RAM (similar to fall-through approach). This is a simplification that is possible because in a FIFO we always know what is the next address to be read.
When data is read right after write, we can’t use the previous method because there is a latency to first write the data in the FIFO, and then to read it. For this reason, in the new design there is an auxiliary register used to hide this latency. This is used only if the FIFO is empty, so we detect when the word written is first word, and keep it in this register. If the next cycle comes a read, the data out is taken from the aux register. Afterwards the data is already available in the RAM and can be read continuously as in the first case.
All this is only used inf FpgaAlteraEn parameter is enabled, otherwise the previous implementation with asynchronous RAM applies (when FpgaEn is set), or the register based implementation (when FpgaEn is not set).
* Fill docs/design/design-manual/source/cva6_issue_stage.adoc
* Add variables to docs/design/design-manual/source/design.adoc
* Update port doc comments in core/issue_stage.sv, core/issue_read_operands.sv and core/scoreboard.sv
Expands all glob port maps in the core/ directory of this repository except the core/cache_subsystem/ directory, despite the glob port maps in core/cache_subsystem/miss_handler.sv and core/cache_subsystem/std_nbdcache.sv.
Also reorders port maps to keep the same order as port declarations.
For `XLEN = 64`, some tools (e.g. VCS) still elaborate the offset generation block for `XLEN = 32`, throwing an elaboration error (illegal bit access). Fix this by generating the AXI offset in an equivalent, parameter-agnostic and tool-friendly way.
The controller flushes the pipeline and all the unissued instructions in the presence of instructions with side effects (e.g., fence).
The accelerator dispatcher buffer (now used with the Ara RVV Vector processor) is flushed when this happens and avoids accepting a new instruction in that cycle, but it does not prevent the actual issuing of instructions during a flush cycle.
This fix avoids the issue during a flush cycle.
- FIX: Replace riscv_pkg:VLEN by CVA6Cfg.VLEN
- Declare VLEN as new CVA6 parameter
- smoke-hwconfig: run with vcs-uvm and use return0 tests to speed-up CI light stage timing execution
- Use dedicated linker scripts for 65x configuration.
- Use dedicated spike.yaml for 65x configuration.
- Set BHTEntries=128, cache=WT, scoreboard entries=8 to improve Coremark and Dhrystone results
- Run 4 iterations of coremark to improve results
Fix following requirement:
The assertion included in the always_comb block apparently violates the requirements in [section 9.2.2.2.2 of the SystemVerilog standard](https://ieeexplore.ieee.org/document/10458102):
Statements in an always_comb shall not include those that block, have blocking timing or event
controls, or fork-join statements.
`exception_o.valid` is always 0 here because the context requires:
- `commit_ack_o[0]`
- `commit_instr_i[0].fu != CSR`
Proof.
***********************************************************
We have `commit_ack_o[0]` and `commit_instr_i[0].fu != CSR`
We need to prove `!exception_o.valid`.
***********************************************************
`exception_o.valid` is built such that:
`exception_o.valid => !commit_drop_i[0] && (csr_exception_i.valid || commit_instr_i[0].ex.valid)`
By contraposition:
`!(!commit_drop_i[0] && (csr_exception_i.valid || commit_instr_i[0].ex.valid)) => !exception_o.valid`
By De Morgan:
`commit_drop_i[0] || !(csr_exception_i.valid || commit_instr_i[0].ex.valid) => !exception_o.valid`
`commit_drop_i[0] || (!csr_exception_i.valid && !commit_instr_i[0].ex.valid) => !exception_o.valid`
`(commit_drop_i[0] || !csr_exception_i.valid) && (commit_drop_i[0] || !commit_instr_i[0].ex.valid) => !exception_o.valid`
Goal split.
***********************************************************
We have `commit_ack_o[0]` and `commit_instr_i[0].fu != CSR`
We need to prove both:
1.`commit_drop_i[0] || !csr_exception_i.valid`
2. `commit_drop_i[0] || !commit_instr_i[0].ex.valid`
***********************************************************
`csr_exception_i.valid` is built such that (see `core/csr_regfile.sv`):
`csr_exception_i.valid => commit_instr_i[0].fu == CSR`
By contraposition:
`commit_instr_i[0].fu != CSR => !csr_exception_i.valid`
Because `commit_instr_i[0].fu != CSR`:
`!csr_exception_i.valid`
By implication:
`commit_drop_i[0] || !csr_exception_i.valid`
Goal 1 reached.
`commit_ack_o[0]` is built such that:
`commit_ack_o[0] => commit_instr_i[0].ex.valid && commit_drop_i[0] || !commit_instr_i[0].ex_valid`
Which can be simplified (AB+!A = AB+!A(B+1) = AB+!AB+!A = (A+!A)B+!A = B+!A):
`commit_ack_o[0] => commit_drop_i[0] || !commit_instr_i[0].ex_valid`
Because `commit_ack_o[0]`:
`commit_drop_i[0] || !commit_instr_i[0].ex.valid`
Goal 2 reached.
Qed.
* Remove misaligned_ex computation: get it from outside
* Remove data and instr pmps, get match_execution from outside
* Get data and instr allow from outside
* Simplify fetch_instruction exception when instr not allow by pmp
* Simplify exception when data not allow by pmp, getting it from outside
* Apply verible format
* First public version of extracted pmp
* Integrate PMP fully outside MMU
* fix translation_valid and dtlb_ppn when no mmu
* Add pmp_data_if in needed file lists
* Fix exception tval when translation is enabled
* integrate no_locked assertions for pmp: now in blocking assignments to avoid raise condition in simulation
* Fix mixed assignment for no_locked_if
* Remove assertion no_locked from pmp: need clk and reset
* Apply verible format
---------
Co-authored-by: Olivier Betschi <olivier.betschi@fr.bosch.com>