RV32BOTEarlGrey selects the Zba, Zbb, Zbc, Zbs sub-extensions from
v.1.0.0 of the bitmanip spec and the Zbf, Zbp, Zbr, Zbt sub-extensions
from draft v.0.93. Zbe (bcompress/bdecompress) is supported by RV32BFull
only.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
This invovles the following changes:
- Rename pcnt to cpop
- Switch encoding of max and minu
- Remove rev from Balanced version, only available in Full version via
grev (Zbp)
- Include sext.b/h (previously in Zb_tmp)
- Remove slo[i] and sro[i] from Balanced version, only available in Full
version (Zbp)
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
This change is related to the bitmanip draft version 0.94. It's needed
as in draft version 0.93 as well as in version 1.00 sbext from Zbs
changes to bext, leading to two completely different instructions having
the same name.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
Add support for the Zba extension added in v0.93 of the bit manipulation
specification (unchanged in v1.0.0). The new instructions added are:
- sh1add: rd = (rs1 << 1) + rs2
- sh2add: rd = (rs1 << 2) + rs2
- sh3add: rd = (rs1 << 3) + rs2
The instructions are single cycle and have been implemented using the
adder in the ALU.
Signed-off-by: Michael Munday <mike.munday@lowrisc.org>
With data-independent timing enabled and BranchTargetALU configured,
branches will stall for a cycle causing an illegal value to be decoded
for the B Operand. No functional impact of this, but an assertion fires
so we might as well tie it off properly.
Fixes#1367
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
- Move various unused signal fixes from the waiver file to the rtl, so
that all tools can pick them up.
- Fix some oversize line issues.
- Fix some signed / unsigned casting issues.
- Remove some extraneous semicolons.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
- Without the B extension, a gorci instruction should decode as an
illegal instruction.
- Relates to #1129
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit replaces the previous combination of `RV32M` bit parameter
used to en/disable the M extension and the `MultiplierImplementation`
used to select the multiplier implementation by a single enum parameter.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
Previously, these bits were not checked when decoding slli, srli and
srai, causing some illegal instruction encodings not to trigger an
illegal instructions exception.
This resolveslowRISC/Ibex#1018.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
This commit contains some final optimizations regarding the bit
manipulation extension as well as the parametrization into a balanced
version and a full performance version.
Balanced Version:
* Supports ZBB, ZBS, ZBF and ZBT extensions
* Dual cycle instructions:
ror[i], rol, cmov, cmix fsl, fsr[i]
* Everything else completes in a single cycle.
Full Version:
* Supports all 32b sub extensions.
* Dual cycle instructions:
ror[i], rol, cmov, cmix fsl, fsr[i], crc32[c], bext, bdep
* Everything else completes in a single cycle.
Notable Changes:
* bext/bdep are now multi-cycle: Sharing additional register
with multiplier module
* grev/gorc instructions are implemented in separate structures
rather than sharing the shifter or butterfly network.
* Speed up decision on using rs1 or rs3 for alu_operand_a by
introducing single-bit register, to identify ternary
instructions in their first cycle.
* Introduce enumerated parameter to chose bit manipulation
implementation
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBR instruction
group: crc32[c].[bhw].
CRC-32 (CRC-32/ISO-HDLC) and CRC-32C (CRC-32/ISCSI) are directly
implemented. The CRC operation solves the following equation using
binary polynomial arithmetic:
rev(rd)(x) = rev(rs1)(x) * x**n mod {1, P}(x),
where {1,P}(x) denotes the crc polynomial. Using barret reduction one
can write this as
rd = (rs1 >> n) ^ rev(rev( (rs1 << (32-1)) cx rev(mu)) cx P)
^-- cycle 0--------------------^
^-- cycle 1 ------------------------------------------^
Where cx denotes carry-less multiplication and mu = polydiv(x**64,
{1,P}), omitting the MSB (bit 32).
The implementation increases area consumption by ~0.6kGE for synthesis
with relaxed timing constraints. With tight timing constraints that is
~1.6kGE. There is no significant impact on frequency.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBC instruction
group: clmul[rh] (carry-less multiply [reverse][high])
Carry-less multiplication can be understood as multiplication based on
the addition interpreted as the bit-wise xor operation.
Example: 1101 X 1011 = 1111111:
1011 X 1101
-----------
1101
xor 1101
---------
10111
xor 0000
----------
010111
xor 1101
-----------
1111111
Architectural details:
A 32 x 32-bit array
[ operand_b[i] ? (operand_a << i) : '0 for i in 0 ... 31 ]
is generated. The entries of the array are pairwise 'xor-ed'
together in a 5-stage binary tree.
The area increase when synthesized with relaxed timing constraints is
1.6-1.7kGE.
Timing figures are improve by 0.1 ns for the 3-stage configuration and
worsen by 0.04ns for the 2-stage implementation. This suggests
fluctuations due to the heuristic nature of the synthesis tools.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension sign-extend
instructions: sext.b (sign-extend byte) and sext.h (sign-extend half
word).
The implementation is basically a one-liner, duplicating the msb of the
byte / half-word into the msb of the output register.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBF instruction
group, which consists only of the one instruction bfp (bit-field
place).
This instruction places a field of length len < 16 from rs2 in rs1 at
offset off.
Architectureal details:
The implementation works exactly the same as proposed by Claire
Wolf in her reference implementation.
1. bfp_mask = slo(o, len)
2. bfp_result =
(rs1 & ~(bfp_mask << off)) | (rs2 & bfp_mask) << off
^------ shifter-^
The existing shifter structure is shared for the indicated
operation.
Impact on area:
* When synthesizing without the B-extension, the 2 stage
design seems to move the timing bottleneck, leading to
optimizations which result in an area increase by 1 kGE,
when synthesized with tight timing constraints. For the
3 stage configuration there is no change.
When synthesized with relaxed timing constraints there is no
significant change in either configuration.
* With the B-extension enabled, the area increase for tight
timing constraints is 1.1-1.2 kGE. For relaxed timing
constraints that is ~0.4kGE
Impact on timing: No significant impact.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBE instruction
group: bext (bit extract) and bdep (bit deposit).
Architectural details:
* bext/bdep: A new butterfly and inverse butterfly network is
implemented. The generation of its controlbits depend on a
parallel prefix bitcount of the deposit / extract mask.
* bitcounter: The path for bext / bdep instructions traverses
the bit counter and the butterfly network, resulting in both a
larger delay and area. To mitigate the bitcounter has been
changed from a serial bit counter to a radix-2 tree structure.
* grev/gorc: Zbp instructions general reverse and general
or-combine have as of yet shared the shifters reversal
structure. It has proven benefitial to area and timing to reuse
the novel butterfly network instead
The butterfly network itself consumes ~3.5kGE and ~1.1kGE for synthesis
with tight and relaxed timing constraints respectively. Including the
optimizations of the bitcounter and grev/gorc, the overall change in
area consumption is +4.6kGE (+1.2kGE) and +3.3kGE (+1.1kGE) for
synthesis with tight (relaxed) timing constraints for 2- and 3-stage
configurations respectively. For tight timing constraints that is a
growth by around ~10%, for relaxed ~5%.
The impact on the maximum frequency is negligable.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBP instruction
group: grev[i] (generalized reverse), gorc[i] (generalized or-combine)
and [un]shfl[i] (generalized shuffle) and all of their
pseudo-instructions.
Architectural details:
* grev / gorc: The shifter structure features only a right
shift structure. In order to perform a left shift therefore the
operand needs to be reversed, shifted and reversed again. The
architecture of the back-reversal is implemented in stages
which are activated using the general reverse / orcombine
operand, or a signal marking left-shifts.
* shfl / unshfl: Also known as zip / unzip or interlace /
uninterlace operation. These instructions are implemented
in their own structure using a permutation networ of 6 stages.
4 stages thereof implement the shuffle permutations. the first
and last stage is the flip stage, which effectively reverse s
the order of the inner stages, for unshuffle operations.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension SBS instruction
group: sbset[i], sbclr[i], sbinv[i] and sbext[i]. These instructions
set, clear, invert or extract bit rs1[rs2] or rs1[imm] for reg-reg and
reg-imm instructions respectively.
Archtectural details:
* A multiplexer is added to the shifter structure in order to
chose between 32'h1, used for the single-bit instructions as
summarized below, and regular operand_b input.
* Dedicated bitwise-logic blocks are introduced for multicycle
shifts and cmix instructions (fsr, fsl, ror, rol),
single-bit instructions (sbset, sbclr, sbinv, sbext), and
stanard-ALU and zbb instructions (or, and xor, orn, andn,
xnor).
Instruction details: All of the zbs instructions rely on sharing the
existing shifter structure. The instructions are carried out in
one cycle.
* sbset, sbclr, sbinv:
shift_result = 32'h1 << rs2[4:0];
singlebit_result = rs1 [|, ^ , &~] shift_result;
* sbext:
shift_result = rs1 >> rs2[4:0];
singlebit_result = {31'0,shift_resutl[0]};
Signed-off-by: ganoam <gnoam@live.com>
This commit fixes three possible cases for erroneous generation of
illegal instruction signals. Also, the bit-slices considered for
decoding ALU instructions are corrected to better reflect their
encoding specifications.
* Fix decoding of orc_b in illegal_insn generation.
* Insn[31] is no longer checked for generation of illegal instructions:
This bit is part of the rs3 register adress for ternary
bitmanipulation instructions (zbt).
* Correct bit-slicing for ALU reg-immediate instructions according
to specification: immediates are encoded in the range
insn[26:20] in all cases. Where a shift-amount is encoded, bits
[26:25] will have no effect, but will no longer generate
illegal instructions.
Signed-off-by: ganoam <gnoam@live.com>
This commits implements the Bit Manipulateion Extension ZBT instruction
group: cmix, cmov, fsr[i] and fsl. Those are instructions depend on
three ALU operands. Completeion of these instructions takes 2 clock
cycles. Additionally, the rotation shifts rol and ror are made
multicycle instructions.
All multicycle instructions take exactly two cycles to complete.
Architectural additions:
* Multicycle Stage Register in ID stage.
multicycle_op_stage_reg
* Decoder generates alu_multicycle signal, to stall pipeline
* For all ternary instructions:
1. cycle: connect alu operands a and b to rs1 and rs2
respectively
2. cycle: connect operands a and be to rs3 and rs2
respectively
* Reduce the physical size of the shifter from 64 bit to 63
bit: 32-bit operand + 1 bit for arithmetic / one-shift
* Make rotation shifts multicycle instructions.
Instruction Details:
* cmov:
1. store operand a (rs1) in stage reg.
2. return stage reg output (rs2) or rs3.
if rs2 != 0 the output (rs1) is already known in the
first cycle. -> variable latency implementation is
possible.
* cmix:
1. store rs1 & rs2 in stage reg
2. return stage_reg_q | (rs2 & ~rs3)
reusing bwlogic from zbb
* rol/ror: (here: ror)
shift_amt = rs2 & 31;
shift_amt_compl = (32 - shift_amt) & 31
1. store (rs1 >> shift_amt) in stage reg
2. return (rs1 << shift_amt_compl) | stage_reg_q
* fsl/fsr:
For funnel shifts, the order of applying the shift
amount or its complement is determined by bit [5] of
shift_amt. Pseudocode for fsr:
shift_amt = rs2 & 63
shift_amt_compl = (32 - shift_amt[4:0])
1. if (shift_amt >= 33):
store (rs1 >> shift_amt_compl[4:0]) in stage reg
else if (shift_amt <0 && shift_amt <= 31):
store (rs1 << shift_amt[4:0]) in stage reg
else if (shift_amt == 32 || shift_amt == 0):
store rs1 in stage reg
2. if (shift_amt >= 33):
return stage_reg_q | (rs3 << shift_amt[4:0])
else if (shift_amt <0 && shift_amt <= 31):
return stage_reg_q | (rs3 >> shift_amt_compl[4:0])
else if (shift_amt == 32):
return rs3
else if (shift_amt == 0):
return rs1
Signed-off-by: ganoam <gnoam@live.com>
- A new parameter and a run-time control bit (DataIndTiming and
data_ind_timing) enabling different behaviour for running security critical
code sections.
- In the new mode, all branches act as if taken, with not-taken
branches executing as a branch to the next instruction.
- This should give similar execution time/power characteristics
regardless of the branch condition.
- Note that with the BranchTargetALU, branches stall an extra cycle in
secure mode to avoid factoring the branch-taken decision into the
branch target address mux.
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit implements the Bit Manipulation Extension ZBB instruction
group: clz, ctz, pcnt, slo, sro, rol, ror, rev, rev8, orcb, pack
packu, packh, min, max, andn, orn, and xnor.
* Bit counting instructions clz, ctz and pcnt can be implemented to
share much of the architecture:
clz: Count Leading Zeros. Counts the number of 0 bits at the
MSB end of the argument.
ctz: Count Trailing Zeros. Counts the number of 0 bits at the
LSB end of the argument.
pcnt: Counts the number of set bits of the argument.
The implementation uses:
- 32 one bit adders, counting the set bits of a signal
bitcnt_bits, starting from the LSB end.
- For pcnt the argument is fed directly into bitcnt_bits.
- For clz, the operand is reversed such that leading zeros are
located at the LSB end of bitcnt_bits.
- For ctz and clz: counter enable signal for 1-bit counter i
is high, if the previous enable signal, and
its corresponting bitcnt_bit was high.
* Instructions sll[i], srl[i],slo[i], sro[i], rol, ror[i], rev, rev8
and orc.b are summarized as shifting instructions and related:
The following instructions are slight variations of the
existing base spec's sll, srl and sra instructions.
- slo[i] and sro[i]: shift left/right ones: similar to
shift-logical operations from base spec, but shifting
in ones instead of zeros.
- rol and ror[i]: rotate left/right ones: circular shift
operations. shifting in values from the oposite end
of the operand instead of zeros.
Those instructions can be implemented, sharing the base spec's
shifting structure. In order to support rotate operations, a
64-bit shifting structure is needed.
In the existing ALU, hardware is described only for right
shifts. For left shifts the operand is initially reversed,
right shifted and the result is reversed back. This gives rise
to an additional resource sharing oportunity for some more
zbb operations:
- rev: bitwise reversal.
- rev8: byte-order swap.
- orc.b: byte-wise reverse and or-combine.
* Instructions min, max:
For the B-extension's min/max instructions, we can share the
existing comparison operations. The result is obtained by
activating the comparison structure accordingly and
multiplexing the operands using the comparison result.
* Logic-with-negate instructions andn, orn, xnor:
For the B-extension's logic-with-negate instructions we can
share the structures of the base spec's logic structures
already present for 'xnor', 'or' and 'and' instructions as
well as the conditionally negated b operand generated for
subtraction operations.
* Instructions pack, packu, packh:
For the pack, packh and packu instructions I don't see any
opportunities for resource sharing. However, the architecture
is quite simple.
- pack: pack the lower halves of rs1 and rs2 into rd, with rs1
in the lower half and rs2 in the upper half.
- packu: pack the upper halves of rs1 and rs2 into rd, with
rs1 in the lower half and rs2 in the upper half.
- packh: pack the LSB bytes of rs1 and rs2 into rd, with rs1
in the lower half and rs2 in the upper half.
Signed-off-by: ganoam <gnoam@live.com>
- Create separate operand muxes for the branch/jump target ALU
- Complete jump instructions in one cycle when BT ALU configured
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
- Add parameters and actual instantiation of icache
- Add a custom CSR in the M-mode custom RW range to enable the cache
- Wire up the cache invalidation signal to trigger on fence.i
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
The third pipeline stage is a new writeback stage. Ibex can now be
configured as the original two stage design or the new three stage
design using the `WritebackStage` parameter in ibex_core. This defaults
to 0 (giving the original two stage design).
The three stage design is *EXPERIMENTAL*
In the three stage design all register write back occurs in the third,
final stage. This allows a cycle for responses to loads and stores so
when the memory system can respond in a single cycle there will be no
stall. This offers significant performance benefits.
Documentation of the three stage design is still to be written so
existing documentation applies to the two stage design only as various
aspects of Ibex behaviour will change in the three stage design.
Signed-off-by: Greg Chadwick <gac@lowrisc.org>
multdiv_sel signals the mult/div operand should be selected for the ALU
inputs. Previously the mult_en/div_en signals were used but these factor
in whether the instruction is actually happening which is not relevant
for the mux select. The dedicated select signal gives better timing.
Adds a second set of instruction flops that are used to determine ALU
operation and operand selection. This reduces fanout from the
instruction flops and so helps timing.
prim_assert.sv is a file containing assertion macros (defines).
Previously, prim_assert.sv was compiled as normal SystemVerilog file.
This made the defines available for the whole compilation unit as soon
as they were defined. Since all cores using prim_assert depended (in
fusesoc) on the lowrisc:prim:assert core, prim_assert was always
compiled first, and the defines were visible in subsequent files.
All of that is only true if all files end up in one comilation unit. The
SV standard states that what makes up a compilation unit is
tool-defined, but also states that typically, passing multiple files (or
a file list/.f file) to a single tool invocation means that all files
end up in one compilation unit; if the tool is called multiple times,
then the files end up in separate compilation units.
Edalize (the fusesoc backend) doesn't guarantee either behavior, and so
it happens that for Vivado, Verilator, Cadence and Synopsys simulators,
all files are compiled into a single compilation unit. But for Riviera,
each file is a separate compilation unit.
To avoid relying on the definition of compilation units, and to do the
generally right thing (TM), this commit changes the code to always
include the prim_assert.sv file when it is used in a source file.
Include guards are introduced in the prim_assert.sv file to avoid
defining things twice.
This commit replaces all X assignments in the RTL with defined
values. In addition, SystemVerilog Assertions are added to catch
invalid signal values in simulation. A new file containing the
corresponding assertion macros is added as well.
Signed-off-by: Pirmin Vogel <vogelpi@lowrisc.org>
We currently have a documentation block at the beginning of each file,
containing author credits and module-level documentation. The
module-level documentation is retained for historic reasons and
duplicated with the newer comments below it.
For the authors, maintaining author credits in the file is error-prone,
as this information gets outdated very soon. A more reliable way to see
who modified a file is to use the history information in git.
Additionally, we now have the CREDITS.md file, which lists all
contributors, even the ones which don't appear in the git history (e.g.
because the code was copied and commited by someone else).
This commit clarifies why CSR-related pipeline flushes are needed
(e.g. when enabling interrupts), only introduces them when doing the
critical modifications (write/set bits in `mstatus` and `mie` CSRs
for enabling interrupts, reads and clears are uncritical for these
CSRs), and makes sure the controller is actually able to start
handling interrupts while doing a CSR-related pipeline flush.
This resolveslowrisc/ibex#6.
This file doesn't contain defines any more, but a normal SV package.
The diff is best viewed without whitespace changes, as the reindents
cause a lof of diff noise.
Fixeslowrisc/ibex#173
The custom reg-reg load instruction was added in the original design but
is no longer needed. This commit removes it. Also, load instructions
with `instr[14:12] == 3'b110` are now decoded as illegal.
This resolves#25.
Without this commit, the PC is still set to a possible wrong jump
target on illegal JALR instructions ultimately causing the wrong PC
being saved to `mepc` during the illegal instruction exception.
This bug has been reported by @taoliug. This commit resolves#170.
This commit fixes two bugs in the decoder:
1. For illegal branch condition selections, the illegal instruction
condition must be signaled as long as the instruction is being executed
and not just during the first cycle, as the controller cannot interrupt
multicycle instructions.
2. Illegal instructions should also be signaled when `instr[28]` is set
for register-register ALU operations. Previously, these were not
signaled as the original design used `instr[28]` to encode custom bit-
manipulation instructions.
These bugs were discovered by @taoliug. This resolves issue #163.