Remove the SpeculativeRequest parameter, and replace it with a
policy. If the cache is disabled, make the request on the basis that we
definitely won't hit in the cache and so should make the bus request
asap.
Stop fill buffers from fetching complete cache lines when the cache is
disabled. When the core branches into the middle of a line, the lower
words would normally be fetched to complete the line. This is
unnecessary when the cache is disabled since those words will just be
thrown away and the core will stall while they are being fetched.
These two changes make the performance using the disabled icache the
same as using non-icache prefetch buffer.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit amends some paths in the vendoring hjson file (and updates
config files to use things at the new paths). Finally it re-runs the
vendoring tool:
Update code from upstream repository
https://github.com/lowRISC/opentitan to revision
92e9242424c72c59008e267dd3779e2af5ec8e83
which just ends up with a load of file renames.
Signed-off-by: Rupert Swarbrick <rswarbrick@lowrisc.org>
A few tweaks to fix broken links, make explanations more clear and
update the information to reflect the present (e.g. Spike master now
implements bit-manip, the seperate branch is gone).
Add parameter `DbgHwBreakNum` to configure the number of HW breakpoints.
The parameters controls the number of trigger registers available if
debug support is enabled with `DbgTriggerEn`.
Closes#1070
- Document that SecureIbex cannot be used without a multiplier and add
an assertion in the rtl. This fixes#1080.
- Move the PC checking hardware onto its own parameter to match all the
other individual security features.
- Make the PC increment behavior more sensible on fetch errors (and make
it match the icache behavior). Factor this into the PC increment check
to prevent false triggering, fixes#1094.
- Stop the PC mismatch checker firing on dummy instructions, fixes
#1095.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
Option to control the state of the Ibex tracer.
Defaults to enable, but allows a setting to disable the trace log.
In long running simulations this file can get quite big and the contents
might not be needed. Use the argument `+ibex_tracer_enable=0` to prevent
the creation of the file.
Restructure the existing documentation to group the content by intended
audience. This produces four sections:
* An introduction section, relevant to "newcomers" to Ibex.
* An user guide, intended for hardware designers (integrators) and
software developers who want to integrate Ibex, and develop software
for it.
* A reference guide, which provides background information on the
design. This section is essential when working on Ibex, and also
documents our design decisions.
* A developer guide aimed at people modifying Ibex itself. It consists
mostly of process and tool documentation: how to run the verification
after a code change, how to use GitHub, etc.
This commit is large, but text is mostly unchanged. A couple of
introductions and tables of content were added, but no significant
changes to the text have been made. These will be done in follow-ups.
Signed-off-by: Philipp Wagner <phw@lowrisc.org>
This commit replaces the previous combination of `RV32M` bit parameter
used to en/disable the M extension and the `MultiplierImplementation`
used to select the multiplier implementation by a single enum parameter.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
- Add a technology map for latches (only works with nandgate45 library
at the moment)
- Add a real latch-based clock gating cell
- Update timing path reporting to differentiate between register and
latch paths
- Update summary results in README to reflect the latch-based numbers,
plus add numbers for a micro-riscy-style (RV32EC) config
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This removes the manually copied version at dv/uvm/core_ibex/common
and vendors things properly now that the vendor tool supports such
things (this picks up the same OpenTitan version as the previous
commit: lowRISC/opentitan@067272a2).
- Add SECDED ECC checking to the register file when SecureIbex is
enabled
- No correction is attempted, but an alert is raised for the system to
intervene
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
- Add a major and minor alert output which can be used by the system to
react to fault injection attacks
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
Support for this extension is not experimental (it's fully verified using
RISCV-DV) but the extension might change before being ratified.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
By now, ibex depends not only on the a clock gating cell, but also on
assert macros and a LFSR; there will be more to come. This removed piece
of documentation was from the early days, when only a clock gating cell
was to be provided (and we didn't ship one).
Today, users need to either use FuseSoC, or run fusesoc on the simple
system and "harvest" the resulting files if they want to copy-paste Ibex
into their build system. That's not ideal, but not something we can very
easily fix -- so let's remove the outdated documentation first to at
least reduce the confusion.
This commit contains some final optimizations regarding the bit
manipulation extension as well as the parametrization into a balanced
version and a full performance version.
Balanced Version:
* Supports ZBB, ZBS, ZBF and ZBT extensions
* Dual cycle instructions:
ror[i], rol, cmov, cmix fsl, fsr[i]
* Everything else completes in a single cycle.
Full Version:
* Supports all 32b sub extensions.
* Dual cycle instructions:
ror[i], rol, cmov, cmix fsl, fsr[i], crc32[c], bext, bdep
* Everything else completes in a single cycle.
Notable Changes:
* bext/bdep are now multi-cycle: Sharing additional register
with multiplier module
* grev/gorc instructions are implemented in separate structures
rather than sharing the shifter or butterfly network.
* Speed up decision on using rs1 or rs3 for alu_operand_a by
introducing single-bit register, to identify ternary
instructions in their first cycle.
* Introduce enumerated parameter to chose bit manipulation
implementation
Signed-off-by: ganoam <gnoam@live.com>
One aspect of (i)cache design that I didn't know about before writing
test code for this block is the problem of multi-way hits. The icache,
as implemented, stores data to parallel ways and it's possible for a
fetch to match more than one way. The data from matching ways all gets
ORed together, which doesn't matter so long as it never
changes (because V | V == V for all V).
Of course, things go poorly if you have two different values, V and W,
at an address which are both stored in the cache. Then the result is V
| W, which isn't necessarily equal to either instruction.
Avoiding this needs priority encoders, which are rather large, so it
seems the usual approach is to disallow branching to modified code
before flushing the cache. This patch teaches the testbench to do this
properly.
Sadly, this means there's now a connection between the core agent and
the memory agent: the memory agent can no longer generate new seeds
whenever it pleases.
- Drive a speculative version of the branch signal into the IF stage to
drive address muxing
- The speculative signal is the same as the regular branch signal but
assumes all conditional branches are taken
- This breaks the timing path from branch condition calculation into
address muxing (and therefore PMP error calculation)
- When the branch is not taken, any external request we might otherwise
have made is suppressed
- This has a minor performance cost (0.8% without I$, ~0% with I$)
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit implements the Bit Manipulation Extension ZBR instruction
group: crc32[c].[bhw].
CRC-32 (CRC-32/ISO-HDLC) and CRC-32C (CRC-32/ISCSI) are directly
implemented. The CRC operation solves the following equation using
binary polynomial arithmetic:
rev(rd)(x) = rev(rs1)(x) * x**n mod {1, P}(x),
where {1,P}(x) denotes the crc polynomial. Using barret reduction one
can write this as
rd = (rs1 >> n) ^ rev(rev( (rs1 << (32-1)) cx rev(mu)) cx P)
^-- cycle 0--------------------^
^-- cycle 1 ------------------------------------------^
Where cx denotes carry-less multiplication and mu = polydiv(x**64,
{1,P}), omitting the MSB (bit 32).
The implementation increases area consumption by ~0.6kGE for synthesis
with relaxed timing constraints. With tight timing constraints that is
~1.6kGE. There is no significant impact on frequency.
Signed-off-by: ganoam <gnoam@live.com>
- Adds a new module in the IF stage to inject dummy instructions into
the pipeline
- Control / frequency of insertion is governed by configuration CSRs
- Extra CSR added to allow reseed of the internal LFSR useed for
randomizing insertion
- Extra logic added to the register file to make dummy instruction
writebacks look like real intructions (via the zero register)
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit implements the Bit Manipulation Extension ZBC instruction
group: clmul[rh] (carry-less multiply [reverse][high])
Carry-less multiplication can be understood as multiplication based on
the addition interpreted as the bit-wise xor operation.
Example: 1101 X 1011 = 1111111:
1011 X 1101
-----------
1101
xor 1101
---------
10111
xor 0000
----------
010111
xor 1101
-----------
1111111
Architectural details:
A 32 x 32-bit array
[ operand_b[i] ? (operand_a << i) : '0 for i in 0 ... 31 ]
is generated. The entries of the array are pairwise 'xor-ed'
together in a 5-stage binary tree.
The area increase when synthesized with relaxed timing constraints is
1.6-1.7kGE.
Timing figures are improve by 0.1 ns for the 3-stage configuration and
worsen by 0.04ns for the 2-stage implementation. This suggests
fluctuations due to the heuristic nature of the synthesis tools.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBF instruction
group, which consists only of the one instruction bfp (bit-field
place).
This instruction places a field of length len < 16 from rs2 in rs1 at
offset off.
Architectureal details:
The implementation works exactly the same as proposed by Claire
Wolf in her reference implementation.
1. bfp_mask = slo(o, len)
2. bfp_result =
(rs1 & ~(bfp_mask << off)) | (rs2 & bfp_mask) << off
^------ shifter-^
The existing shifter structure is shared for the indicated
operation.
Impact on area:
* When synthesizing without the B-extension, the 2 stage
design seems to move the timing bottleneck, leading to
optimizations which result in an area increase by 1 kGE,
when synthesized with tight timing constraints. For the
3 stage configuration there is no change.
When synthesized with relaxed timing constraints there is no
significant change in either configuration.
* With the B-extension enabled, the area increase for tight
timing constraints is 1.1-1.2 kGE. For relaxed timing
constraints that is ~0.4kGE
Impact on timing: No significant impact.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBE instruction
group: bext (bit extract) and bdep (bit deposit).
Architectural details:
* bext/bdep: A new butterfly and inverse butterfly network is
implemented. The generation of its controlbits depend on a
parallel prefix bitcount of the deposit / extract mask.
* bitcounter: The path for bext / bdep instructions traverses
the bit counter and the butterfly network, resulting in both a
larger delay and area. To mitigate the bitcounter has been
changed from a serial bit counter to a radix-2 tree structure.
* grev/gorc: Zbp instructions general reverse and general
or-combine have as of yet shared the shifters reversal
structure. It has proven benefitial to area and timing to reuse
the novel butterfly network instead
The butterfly network itself consumes ~3.5kGE and ~1.1kGE for synthesis
with tight and relaxed timing constraints respectively. Including the
optimizations of the bitcounter and grev/gorc, the overall change in
area consumption is +4.6kGE (+1.2kGE) and +3.3kGE (+1.1kGE) for
synthesis with tight (relaxed) timing constraints for 2- and 3-stage
configurations respectively. For tight timing constraints that is a
growth by around ~10%, for relaxed ~5%.
The impact on the maximum frequency is negligable.
Signed-off-by: ganoam <gnoam@live.com>
Once the cache has passed an error to the core, we now allow it to
wiggle its valid, addr, rdata, err and err_plus2 lines however it sees
fit until the core issues a new branch.
Since the core isn't allowed to assert ready until then, the values
will not be read and this won't matter.
This was exposed by
make -C dv/uvm/icache/dv run SEED=1314810947 WAVES=1
This commit implements the Bit Manipulation Extension ZBP instruction
group: grev[i] (generalized reverse), gorc[i] (generalized or-combine)
and [un]shfl[i] (generalized shuffle) and all of their
pseudo-instructions.
Architectural details:
* grev / gorc: The shifter structure features only a right
shift structure. In order to perform a left shift therefore the
operand needs to be reversed, shifted and reversed again. The
architecture of the back-reversal is implemented in stages
which are activated using the general reverse / orcombine
operand, or a signal marking left-shifts.
* shfl / unshfl: Also known as zip / unzip or interlace /
uninterlace operation. These instructions are implemented
in their own structure using a permutation networ of 6 stages.
4 stages thereof implement the shuffle permutations. the first
and last stage is the flip stage, which effectively reverse s
the order of the inner stages, for unshuffle operations.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension SBS instruction
group: sbset[i], sbclr[i], sbinv[i] and sbext[i]. These instructions
set, clear, invert or extract bit rs1[rs2] or rs1[imm] for reg-reg and
reg-imm instructions respectively.
Archtectural details:
* A multiplexer is added to the shifter structure in order to
chose between 32'h1, used for the single-bit instructions as
summarized below, and regular operand_b input.
* Dedicated bitwise-logic blocks are introduced for multicycle
shifts and cmix instructions (fsr, fsl, ror, rol),
single-bit instructions (sbset, sbclr, sbinv, sbext), and
stanard-ALU and zbb instructions (or, and xor, orn, andn,
xnor).
Instruction details: All of the zbs instructions rely on sharing the
existing shifter structure. The instructions are carried out in
one cycle.
* sbset, sbclr, sbinv:
shift_result = 32'h1 << rs2[4:0];
singlebit_result = rs1 [|, ^ , &~] shift_result;
* sbext:
shift_result = rs1 >> rs2[4:0];
singlebit_result = {31'0,shift_resutl[0]};
Signed-off-by: ganoam <gnoam@live.com>
Ibex only provides the necessary interfaces and core-internal
functionality for run-control debug. To get a fully working,
"debuggable" toplevel design, more components are needed. Describe where
to get them from, and include OpenTitan as an exemplary integration.
This commits implements the Bit Manipulateion Extension ZBT instruction
group: cmix, cmov, fsr[i] and fsl. Those are instructions depend on
three ALU operands. Completeion of these instructions takes 2 clock
cycles. Additionally, the rotation shifts rol and ror are made
multicycle instructions.
All multicycle instructions take exactly two cycles to complete.
Architectural additions:
* Multicycle Stage Register in ID stage.
multicycle_op_stage_reg
* Decoder generates alu_multicycle signal, to stall pipeline
* For all ternary instructions:
1. cycle: connect alu operands a and b to rs1 and rs2
respectively
2. cycle: connect operands a and be to rs3 and rs2
respectively
* Reduce the physical size of the shifter from 64 bit to 63
bit: 32-bit operand + 1 bit for arithmetic / one-shift
* Make rotation shifts multicycle instructions.
Instruction Details:
* cmov:
1. store operand a (rs1) in stage reg.
2. return stage reg output (rs2) or rs3.
if rs2 != 0 the output (rs1) is already known in the
first cycle. -> variable latency implementation is
possible.
* cmix:
1. store rs1 & rs2 in stage reg
2. return stage_reg_q | (rs2 & ~rs3)
reusing bwlogic from zbb
* rol/ror: (here: ror)
shift_amt = rs2 & 31;
shift_amt_compl = (32 - shift_amt) & 31
1. store (rs1 >> shift_amt) in stage reg
2. return (rs1 << shift_amt_compl) | stage_reg_q
* fsl/fsr:
For funnel shifts, the order of applying the shift
amount or its complement is determined by bit [5] of
shift_amt. Pseudocode for fsr:
shift_amt = rs2 & 63
shift_amt_compl = (32 - shift_amt[4:0])
1. if (shift_amt >= 33):
store (rs1 >> shift_amt_compl[4:0]) in stage reg
else if (shift_amt <0 && shift_amt <= 31):
store (rs1 << shift_amt[4:0]) in stage reg
else if (shift_amt == 32 || shift_amt == 0):
store rs1 in stage reg
2. if (shift_amt >= 33):
return stage_reg_q | (rs3 << shift_amt[4:0])
else if (shift_amt <0 && shift_amt <= 31):
return stage_reg_q | (rs3 >> shift_amt_compl[4:0])
else if (shift_amt == 32):
return rs3
else if (shift_amt == 0):
return rs1
Signed-off-by: ganoam <gnoam@live.com>
The new information is:
- Branch addresses must be 16-bit aligned.
- Explicitly allow top 16 bits of rdata to change when lower 16 bits
contain a compressed instruction.
- Explicitly allow the core to drop ready without valid.
I've also rejigged the layout slightly, improving (I think!) the
description of compressed and uncompressed instructions.
- A new parameter and a run-time control bit (DataIndTiming and
data_ind_timing) enabling different behaviour for running security critical
code sections.
- In the new mode, all branches act as if taken, with not-taken
branches executing as a branch to the next instruction.
- This should give similar execution time/power characteristics
regardless of the branch condition.
- Note that with the BranchTargetALU, branches stall an extra cycle in
secure mode to avoid factoring the branch-taken decision into the
branch target address mux.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
Add a few sentences to describe the behaviour/meaning of the err_plus2_o
signal, and how it is used by the core.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>