This test constrains the address range (giving the cache a chance to
do some caching), but leaves the cache disabled. Seed changes are more
frequent than usual, to give us a good chance to spot any caching that
shouldn't have happened.
- Adds a new module in the IF stage to inject dummy instructions into
the pipeline
- Control / frequency of insertion is governed by configuration CSRs
- Extra CSR added to allow reseed of the internal LFSR useed for
randomizing insertion
- Extra logic added to the register file to make dummy instruction
writebacks look like real intructions (via the zero register)
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
The code before this patch maintained a mailbox, where it would add an
item for each request it saw, and then pop items off until finding the
right address whenever it saw a grant.
Most of the time, you might expect to see a sequence like this:
request 100
grant 100
request 104
grant 104
request 108
grant 108
This scheme is also resilient when glitches (to do with the
delta-cycle scheduling in the simulator) mean you actually see
something like:
request 999
request 100
grant 100
request 104
grant 104
...
However, there's another source of "mismatch" possible too: the cache
can change the request address if the request hasn't been granted (as
opposed to a ready/valid interface, where this sort of tomfoolery is
not allowed!).
When the cache is branching all over the place, as in the sanity
sequence, this doesn't really matter. But if the branch destinations
are constrained, as in the passthru sequence, you can see things like
this:
request 100 (1)
request 120 (2)
request 100 (3)
grant 100 (4)
request 104
grant 104
...
Note that the mailbox has two entries for address 100 when searching
at point (4). This might be ok, but will cause failures if we get a
new seed at (2) or (3).
This patch replaces the mailbox with a queue. New requests get
inserted at the end, as before, but grants search from the end, rather
than the start. This means that when we get to (4) in the example
above, we'll pick the latest seed (and duplicate entries disappear
quickly).
When the memory model sees a new fetch on the bus, it might decide to
pick a new seed for the backing memory. Before this patch, the seed
applied to every fetch strictly after this one. Now, it applies to
this fetch too.
This is what the scoreboard expects. In particular, you can trigger
problems here by disabling the cache and branching lots: things will
go wrong if we pick a new seed at the same time as handling the
branch.
To fix things, we either have to teach the scoreboard to "look one
seed backwards" when the cache is disabled, which is ugly and not as
sensitive to errors in the cache, or we have to apply the new seed
immediately. This is a little painful, because we end up having to
randomize the response item and then calculate a field based on a
possible new seed (see the logic between start_item and end_item in
take_req), but I think it's cleaner than the alternative.
As part of the patch, I've also split the "req" and "grant" handling
code into separate tasks. There's no real change there, except to get
rid of a level of indentation, but I think it makes the code a bit
easier to understand.
The --seed argument has kept its original meaning: Run the one and
only iteration of the test with this seed. We've added another
argument, --start_seed to riscv-dv's run.py and our sim.py which says
"run the first iteration with this seed, and count up for later
iterations".
This should fix issue #859.
If --iterations is 1, this is equivalent to the existing --seed
argument (which we're keeping unchanged). If --iterations is
0 (reading iteration counts from the config) or positive, successive
test iterations use successive seeds. So if you pass --start_seed 123
and run ten iterations, they will run with seeds 123, 124, ... through
133.
Lots of the added code is to check that you don't do something silly
like --seed=123 --iterations=10. Since the next patch will convert the
Makefile which runs this script to using --start_seed, that's all dead
code. Maybe we should get rid of that argument at some point.
This commit implements the Bit Manipulation Extension ZBC instruction
group: clmul[rh] (carry-less multiply [reverse][high])
Carry-less multiplication can be understood as multiplication based on
the addition interpreted as the bit-wise xor operation.
Example: 1101 X 1011 = 1111111:
1011 X 1101
-----------
1101
xor 1101
---------
10111
xor 0000
----------
010111
xor 1101
-----------
1111111
Architectural details:
A 32 x 32-bit array
[ operand_b[i] ? (operand_a << i) : '0 for i in 0 ... 31 ]
is generated. The entries of the array are pairwise 'xor-ed'
together in a 5-stage binary tree.
The area increase when synthesized with relaxed timing constraints is
1.6-1.7kGE.
Timing figures are improve by 0.1 ns for the 3-stage configuration and
worsen by 0.04ns for the 2-stage implementation. This suggests
fluctuations due to the heuristic nature of the synthesis tools.
Signed-off-by: ganoam <gnoam@live.com>
I think these represent the test cases we discussed. I've also removed
non-existent entries from the "tests" keys: I didn't really understand
how dvsim.py worked when I wrote the original version and they just
cause irritating warnings.
The previous code kind of worked, but we were making the "should I
make a new seed" decision in the monitor, rather than the sequence.
The problem is that this is difficult to customize with other test
sequences (they sit adjacent to the monitor in the class hierarchy,
not above it).
The new code seems a little cleaner. We generate new seeds in the
sequence (which is in charge of keeping track of the current seed
anyway). These new seeds get passed to the driver, which has an
analysis port by which it can tell the scoreboard about them. Note
that we have to pass them from the driver, rather than the monitor,
because the new seed doesn't directly appear on the interface.
The rest of the changes are simplifying the ibex_icache_mem_bus_item
class, which now only has two modes and removing the seed field from
the ibex_icache_mem_req_item class.
- The controller state machine could only progress to FLUSH to handle an
exception if instr_valid_i was set
- When the exception comes from a load/store in the Writeback stage, and
no new instruction has been driven into the ID stage, this could cause
exception to be missed
- The instr_valid_i qualification is therefore removed from the state
machine as all relevant signals inside that if block are already
qualified by instr_valid_i anyway
- Fixes#849
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
Change default to 4 rather than 0. Makes no difference when PMPEnable==0
and gets rid of lint failures due to 0 array referencing (0 is an
unsupported value for this parameter).
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
This commit implements the Bit Manipulation Extension sign-extend
instructions: sext.b (sign-extend byte) and sext.h (sign-extend half
word).
The implementation is basically a one-liner, duplicating the msb of the
byte / half-word into the msb of the output register.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBF instruction
group, which consists only of the one instruction bfp (bit-field
place).
This instruction places a field of length len < 16 from rs2 in rs1 at
offset off.
Architectureal details:
The implementation works exactly the same as proposed by Claire
Wolf in her reference implementation.
1. bfp_mask = slo(o, len)
2. bfp_result =
(rs1 & ~(bfp_mask << off)) | (rs2 & bfp_mask) << off
^------ shifter-^
The existing shifter structure is shared for the indicated
operation.
Impact on area:
* When synthesizing without the B-extension, the 2 stage
design seems to move the timing bottleneck, leading to
optimizations which result in an area increase by 1 kGE,
when synthesized with tight timing constraints. For the
3 stage configuration there is no change.
When synthesized with relaxed timing constraints there is no
significant change in either configuration.
* With the B-extension enabled, the area increase for tight
timing constraints is 1.1-1.2 kGE. For relaxed timing
constraints that is ~0.4kGE
Impact on timing: No significant impact.
Signed-off-by: ganoam <gnoam@live.com>
This commit implements the Bit Manipulation Extension ZBE instruction
group: bext (bit extract) and bdep (bit deposit).
Architectural details:
* bext/bdep: A new butterfly and inverse butterfly network is
implemented. The generation of its controlbits depend on a
parallel prefix bitcount of the deposit / extract mask.
* bitcounter: The path for bext / bdep instructions traverses
the bit counter and the butterfly network, resulting in both a
larger delay and area. To mitigate the bitcounter has been
changed from a serial bit counter to a radix-2 tree structure.
* grev/gorc: Zbp instructions general reverse and general
or-combine have as of yet shared the shifters reversal
structure. It has proven benefitial to area and timing to reuse
the novel butterfly network instead
The butterfly network itself consumes ~3.5kGE and ~1.1kGE for synthesis
with tight and relaxed timing constraints respectively. Including the
optimizations of the bitcounter and grev/gorc, the overall change in
area consumption is +4.6kGE (+1.2kGE) and +3.3kGE (+1.1kGE) for
synthesis with tight (relaxed) timing constraints for 2- and 3-stage
configurations respectively. For tight timing constraints that is a
growth by around ~10%, for relaxed ~5%.
The impact on the maximum frequency is negligable.
Signed-off-by: ganoam <gnoam@live.com>
Once the cache has passed an error to the core, we now allow it to
wiggle its valid, addr, rdata, err and err_plus2 lines however it sees
fit until the core issues a new branch.
Since the core isn't allowed to assert ready until then, the values
will not be read and this won't matter.
This was exposed by
make -C dv/uvm/icache/dv run SEED=1314810947 WAVES=1
This assertion is supposed to say "the core may not request more data
from the cache when there's no valid address".
Unfortunately, I'd represented "requesting more data" by req being
high, rather than ready being high. This is wrong: req is a signal
saying "the core isn't currently asleep". ready (of a ready/valid
pair) is the one I wanted.
Our hjson-based logic for constructing VCS commands always passes
-elfile, but this doesn't work if the following list of arguments is
empty.
It seems difficult to figure out how to teach dvsim.py to do something
like "prepend X to Y if Y is nonempty", so let's just add an empty
file for now.
- Only specifying the signals that the TB cares about means people will
no longer have to update this file every time the cs_regs port list
changes
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
As discussed in issue #845, this tells VCS to wait for signals to
settle in combinatorial blocks before checking uniqueness in
constructs like unique case.
Otherwise things like this can cause spurious warnings:
always_comb b = ~in;
always_comb c = in;
always_comb begin
unique case (1'b1)
b: x = 1;
c: x = 0;
default: x = 0; // not that it matters, but this won't happen
endcase
end
For example, on a falling edge of the in signal, if the processes are
executed in the order 1, 3, 2 then the unique case block will appear
to see both b and c true at the same time.
- All primitives the icache uses are specified in distinct core files
with names that match those existing (or about to exist) in OpenTitan
- When vendoring-in Ibex, none of those primitives need to be copied
across, since OpenTitan will use its own versions
- Relates to lowRISC/opentitan/#1231
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
The flow for a memory fetch is:
1. Cache requests data for a memory address
2. Agent spots the request, maybe signalling a PMP error
3. Grant line goes high, at which point the request is granted.
4. Sometime later (in-order pipeline), agent sends a response
Occasionally, we need to pick a new seed for the backing memory.
Before this patch, we picked these seeds at point (3).
Unfortunately this was wrong in the following case:
1. We're switching from seed S0 to seed S1.
2. The request is spotted with seed S0 and doesn't signal a PMP error
3. The request is granted and we switch to seed S1.
4. We respond with data from memory based on S1, with no memory
error either
If S1 would have caused a PMP error, the resulting fetch (no error,
but data from S1) doesn't match any possible seed and the scoreboard
gets confused.
This patch changes to picking new seeds at (2) to solve the problem.
This isn't quite enough by itself, because if a request is granted on
a clock-edge, a new request address might appear and there isn't a
guaranteed ordering in the simulation between the new request and the
old grant (both things happen at the same time). To fix this, the
response sequence now maintains a queue of requests and their
corresponding seeds to make sure that all the checks for a fetch are
done with a single seed.
The patch also gets rid of the seed state in the memory model: it
turns out that this didn't really help: the scoreboard is always
asking "what would I get with this seed?" and now the sequence is
doing something similar.
- Data valid should only be signalled when the current beat is
signalling an error
- PMP errors for future beats can sneak in while waiting for the
current beat
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
- Stalls due to preceding memory accesses in the WB stage shouldn't
cause the jump signal to remain high.
- The jump signal being stuck high causes repeated memory accesses to
the same address, and unnecessary stalling.
- FixeslowRISC/opentitan#2099
Signed-off-by: Tom Roberts <tomroberts@lowrisc.org>
Firstly, the pulse shouldn't be zero length (since that wouldn't
actually do anything).
Also, 1 time in 500 is too rare for either invalidations or "long
invalidations", I think, so this patch also increases how often we see
each.
Now that we have a working framework, let's drive some more items
through (a bit more efficient than running more tests, and also less
skewed by the initial cache invalidation).
This correctly tracks fetch addresses and fetched data. It understands
changing memory seeds, errors, invalidations and enable/disable.
Most of the complexity is in checking whether a fetch got the right
answer, given the set of memory seeds that it might have used. This
isn't conceptually hard (use a local memory model; check each seed and
see whether it matches), but is a bit fiddly in practice. In
particular, a misaligned 4-byte load might actually correspond to two
different seeds: note the nested loops in check_compatible_2.
The general flow of these checks is:
check_compatible
-> check_compatible_<i> (loop over plausible seeds)
-> is_seed_compatible_<i> (ask the memory what data to expect)
-> is_fetch_compatible_<i> (compare seen/expected data)
Note that the check_compatible_<i> functions have a "chatty"
parameter. This is to help with debugging when something goes wrong:
if the check fails then the check_compatible function calls it again
with chatty=1, which dumps a list of the seeds we checked and why they
didn't match.
Other than the scoreboard itself, this patch also adds a "seed" field
to ibex_icache_mem_bus_item. This is used by the monitor to inform the
scoreboard when a new memory seed has been set. Obviously, this
doesn't correspond to any actual monitored signals, but we need some
sort of back channel, and this looked like a sensible way to do it.
The patch also stops reporting PMP responses to the scoreboard: it
turns out the scoreboard doesn't need to care about them, so we can
simplify things slightly this way.
At the moment, the scoreboard doesn't check that fetched data isn't
stored when the cache is disabled. You could see this by disabling the
cache, fetching from an address, enabling the cache and changing the
memory seed, and fetching from the address again. I think it would be
reasonably easy to make an imprecise version of the check, where a
seed gets discarded from the queue if its live period is completely
within a period where the cache was disabled, but I want to wait until
we've got some tests that actually get cache hits before I implement
this.
There's also a slight imprecision in the busy line check that needs
tightening up.
Both of these to-do items have TODO comments in the code.