cva6/docs/design/design-manual/source/cva6_frontend.adoc

////
   Copyright 2021 Thales DIS design services SAS
   Licensed under the Solderpad Hardware Licence, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   SPDX-License-Identifier: Apache-2.0 WITH SHL-2.0
   You may obtain a copy of the License at https://solderpad.org/licenses/

   Original Author: Jean-Roch COULON - Thales
////

[[CVA6_FRONTEND]]

FRONTEND Module
~~~~~~~~~~~~~~~

[[frontend-description]]
Description
^^^^^^^^^^^

The FRONTEND module implements two first stages of the cva6 pipeline,
PC gen and Fetch stages.

PC gen stage is responsible for generating the next program counter.
It hosts a Branch Target Buffer (BTB), a Branch History Table (BHT) and a Return Address Stack (RAS) to speculate on control flow instructions.

Fetch stage requests data to the CACHE module, realigns the data to store them in instruction queue and transmits the instructions to the DECODE module.
FRONTEND can fetch up to {instr-per-fetch} instructions per cycle, but DECODE module decodes up to {issue-width} instruction(s) per cycle.

The module is connected to:

* CACHES module provides fethed instructions to FRONTEND.
* DECODE module receives instructions from FRONTEND.
* CONTROLLER module can order to flush and to halt FRONTEND PC gen stage
* EXECUTE, CONTROLLER, CSR and COMMIT modules trigger PC jumping due to
a branch misprediction, an exception, a return from an exception, a
debug entry or a pipeline flush. They provides the PC next value.
* CSR module states about debug mode.

include::port_frontend.adoc[]

[[frontend-functionality]]
Functionality
^^^^^^^^^^^^^

[[pc-generation-stage]]
PC Generation stage
^^^^^^^^^^^^^^^^^^^

PC gen generates the next program counter. The next PC can originate from the following sources (listed in order of precedence):

* *Reset state:* At reset, the PC is assigned to the boot address.

* *Branch Prediction:* The fetched instruction is predecoded by the
instr_scan submodule. When the instruction is a control flow, three
cases are considered:
+
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
1.  When the instruction is a JALR which corresponds to a return (rs1 = x1 or rs1 = x5).
RAS provides next PC as a prediction.
2.  When the instruction is a JALR which *does not* correspond to areturn.
If BTB (Branch Target Buffer) returns a valid address, then BTBpredicts next PC.
Else JALR is not considered as a control flow instruction, which will generate a mispredict.
3.  When the instruction is a conditional branch.
If BHT (Branch History table) returns a valid address, then BHT predicts next PC.
Else the prediction depends on the PC relative jump offset sign: if sign is negative the prediction is taken, otherwise the prediction is not taken.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+
Then the PC gen informs the Fetch stage that it performed a prediction on the PC.

* *Default:* The next {ifetch-len}-bit block is fetched.
PC Gen fetches word boundary {ifetch-len}-bits block from CACHES module.
And the fetch stage identifies the instructions from the {ifetch-len}-bits blocks.

* *Mispredict:* Misprediction are feedbacked by EX_STAGE module.
In any case we need to correct our action and start fetching from the correct address.

* *Replay instruction fetch:* When the instruction queue is full, the instr_queue submodule asks the fetch replay and provides the address to be replayed.

* *Return from environment call:* When CSR requests a return from an environment call, next PC takes the value of the PC of the instruction after the one pointed to by the mepc CSR.

* *Exception/Interrupt:* If an exception is triggered by CSR_REGISTER, next PC takes the value of the trap vector base address CSR.
ifdef::RVS-true,RVU-true[]
The trap vector base address can be different depending on whether the exception traps to S-Mode or M-Mode.
It is the purpose of the CSR Unit to figure out where to trap to and present the correct address to PC Gen.
endif::[]

* *Pipeline starting fetching from COMMIT PC:* When the commit stage is halted by a WFI instruction or when the pipeline has been flushed due to CSR change, next PC takes the value of the PC coming from the COMMIT submodule.
As CSR instructions do not exist in a compressed form, PC is unconditionally incremented by 4.

ifeval::[{DebugEn} == true]
* *Debug:* Debug has the highest order of precedence as it can interrupt any control flow requests. It also the only source of control flow change which can actually happen simultaneously to any other of the forced control flow changes.
The debug jump is requested by CSR.
The address to be jumped into is HW coded.
endif::[]

All program counters are logical addressed.
ifeval::[{MmuPresent} == true]
If the logical to physical mapping changes, a ``fence.vm`` instruction should be used to flush the pipeline and TLBs.
endif::[]

[[fetch-stage]]
Fetch Stage
^^^^^^^^^^^

Fetch stage controls the CACHE module by a handshaking protocol.
Fetched data is a {ifetch-len}-bit block with a word-aligned address.
A granted fetch is processed by the instr_realign submodule to produce instructions.
Then instructions are pushed into an internal instruction FIFO called instruction queue (instr_queue submodule).
This submodule stores the instructions with its associated address and sends them to the DECODE module.

Before sending the instructions to the DECODE stage, the frontend calculates a prediction address in case of a JUMP or Branch. This predicted address is sent to the DECODE stage along with the instruction and its fetch address.
The prediction address is not valid if there is no prediction.
Instructions following a predicted taken control flow instruction are dropped.

// TO_BE_COMPLETED MMU also feedback an exception, but not present in 65X

Memory can feedback potential exceptions which can be bus errors, invalid accesses or instruction page faults.
The FRONTEND transmits the exception from CACHES to DECODE.


[[submodules]]
Submodules
^^^^^^^^^^

image:frontend_modules.png[FRONTEND submodules]

image:ZoominFrontend.png[FRONTEND submodule interconnections]

[[instr_realign-submodule]]
Instr_realign submodule
+++++++++++++++++++++++

The {ifetch-len}-bit aligned block coming from the CACHE module enters the instr_realign submodule.
This submodule extracts the instructions from the {ifetch-len}-bit blocks.
Based on the fetch address and the fetched data, the instr_realign module extracts the valid instructions to be sent to the queue.
It is possible to fetch up to {instr-per-fetch} instructions per cycle when C extension is used.
A not-compressed instruction can be misaligned on the block size, interleaved with two cache blocks.
In that case, two cache accesses are needed to get the whole instruction.
The instr_realign submodule provides up to {instr-per-fetch} instructions per cycle when compressed extension is enabled, else one instruction per cycle.
Incomplete instruction is stored in instr_realign submodule until its second half is fetched.

Below is a table that explains how the instr_realign works:

* _C : compressed instruction_
* _I : not compressed instruction_
* _UI: Incomplete instruction stored in the instr_realign_

ifeval::[{Superscalar} == true]

|==========================================================
|*Address[2:1]*	|*Incomplete Access* |*64-48* |*48-32* |*32-16*	|*16-0*
.13+^.^|0

.5+^.^|1
|UI
2.+^.^|I
|UI
|C
2.+^.^|I
|UI
2.+^.^|I
|C |UI
|C |C |C |UI
|UI |C |C |UI

.8+^.^|0
|UI
2.+^.^|I
|C
|C
2.+^.^|I
|C
2.+^.^|I
|C
|C
|UI |C |C |C
|C |C |C |C
|UI |C
2.+^.^|I
|C |C
2.+^.^|I
2.+^.^|I
2.+^.^|I

.5+^.^|1
.5+^.^|__
|* |UI
2.+^.^|I
|* |C
2.+^.^|I
|*
2.+^.^|I
|C
|* |UI |C |C
|* |C |C |C

.3+^.^|2
.3+^.^|__
2.+^.^|*
|UI |C
2.+^.^|*
|C |C
2.+^.^|*
2.+^.^|I

.2+^.^|3
.2+^.^|__
3.+^.^|*
|UI
3.+^.^|*
|C

|==========================================================

endif::[]
ifeval::[{Superscalar} == false]

|==========================================================
|*Address[2:1]*	|*Incomplete Access* |*32-16*	|*16-0*
.5+^.^|0

.3+^.^|0
2.+^.^|I
|C
|C
|UI
|C

.2+^.^|1
|UI
|UI
|C
|UI

.2+^.^|1
.2+^.^|__
|* |C
|* |UI

|==========================================================

endif::[]

The Instr_realign can be flushed when the frontend requests the cache to kill the incoming instruction, in this case the incomplete instruction is deleted.

include::port_instr_realign.adoc[]

[[instr_queue-submodule]]
Instr_queue submodule
+++++++++++++++++++++

The instr_queue receives mutliple instructions from instr_realign submodule to create a valid stream to be executed.
Frontend pushes instructions and all related information into the FIFO for storage, including details needed in case of a misprediction or exception: the instructions themselves, instruction control flow type, exception, exception address, and predicted address.
DECODE pops them when decode stage is ready and indicates to the FRONTEND the instruction has been consummed.


The instruction queue contains two FIFOs: one for instructions and one for addresses, which stores addresses in case of a prediction.
The instruction FIFO can hold up to 4×{instr-per-fetch} instructions, while the address FIFO can hold up to 2 addresses.
If the instruction FIFO is full, a replay request is sent to inform the fetch mechanism to replay the fetch.
If the address FIFO is full and there is a prediction, a replay request is sent to inform the fetch mechanism to replay the fetch, even if the instruction FIFO is not full.

The instruction queue can be flushed by the flush signal coming from the CONTROLLER.


include::port_instr_queue.adoc[]

[[instr_scan-submodule]]
Instr_scan submodule
++++++++++++++++++++

As compressed extension is enabled, {instr-per-fetch} instr_scan are instantiated to handle up to {instr-per-fetch} instructions per cycle.

Each instr_scan submodule pre-decodes the fetched instructions coming from the instr_realign module, also calculate the immediate, instructions could be compressed or not.
The instr_scan submodule is a flox controler which provides the intruction type: branch, jump, return, jalr, imm, call or others.
These outputs are used by the branch prediction feature.

include::port_instr_scan.adoc[]

ifeval::[{BHTEntries} > 0]
[[bht-branch-history-table-submodule]]
BHT (Branch History Table) submodule
++++++++++++++++++++++++++++++++++++

The BHT is implemented as a memory which is composed of {BHTEntries} entries.
The BHT is a two-dimensional table:

*	The first dimension represents the access address, with a length equal to `{BHTEntries} / {instr-per-fetch}`.
*	The second dimension represents the row index, with a length equal to `{instr-per-fetch}`.

In the case of branch prediction, the BHT uses only part of the virtual address to get the value of the saturation counter.
In the case of a valid misprediction, the BHT uses only part of the misprediction address to access the BHT table and update the saturation counter.

`UPPER_ADDRESS_INDEX = $clog2(BHTDepth) + ((RVC == 1) ? 1 : 2)`

`LOWER_ADDRESS_INDEX = (RVC == 1) ? 1 : 2 + $clog2(INSTR_PER_FETCH)`

`ACCESS_ADDRESS = PC/MISPREDICT_ADDRESS [ UPPER_ADDRESS_INDEX : LOWER_ADDRESS_INDEX ]`

The lower address bits of the virtual address point to the memory entry.

`UPPER_ADDRESS_INDEX = (RVC == 1) ? 1 : 2 +  $clog2(INSTR_PER_FETCH)`

`LOWER_ADDRESS_INDEX = (RVC == 1) ? 1 : 2 +    $clog2(INSTR_PER_FETCH)`

`ACCESS_INDEX = PC/MISPREDICT_ADDRESS [ UPPER_ADDRESS_INDEX : LOWER_ADDRESS_INDEX]`

_Two distinct branches with different addresses can share the same BHT entry if they have the same ACCESS_ADDRESS._

Each BHT entry contains a two-bit saturating counter and a valid bit.
On reset, the counters are set to 0 (strongly not taken) and the valid bits are cleared.
When a branch instruction is resolved by the EX_STAGE module, the valid bit is set and the counter is updated.
The two bit counter is updated by the successive execution of the instructions as shown in the following figure.

image:bht.png[BHT saturation]

When a branch instruction is pre-decoded by instr_scan submodule, the BHT valids whether the PC address is inside the BHT and provides the taken or not prediction.
The prediction is the most significant bit from the counter, where 1 means "taken".

When the Execute stage processes a branch instruction, it sends the branch status (whether it's taken or not) to the Frontend to update the BHT table

ifeval::[{DebugEn} == true]
FIXME The BHT is not updated if processor is in debug mode.
endif::[]


The BHT is never flushed.

include::port_bht.adoc[]
endif::[]

[[btb-branch-target-buffer-submodule]]
BTB (Branch Target Buffer) submodule
++++++++++++++++++++++++++++++++++++

ifeval::[{BTBEntries} > 0]
BTB is implemented as an array which is composed of {BTBEntries} entries.
The lower address bits of the virtual address point to the memory entry.

When an JALR instruction is found mispredicted by the EX_STAGE module, the JALR PC and the target address are stored into the BTB.

// TODO: Specify the behaviour when BTB is saturated

// TODO: when debug enabled, The BTB is not updated if processor is in debug mode.

When a JALR instruction is pre-decoded by instr_scan submodule, the BTB informs whether the input PC address is in the BTB.
In this case, the BTB provides the predicted target address.

The BTB is never flushed.

include::port_btb.adoc[]
endif::[]
ifeval::[{BTBEntries} == 0]
There is no BTB in {ohg-config}.
As a consequence, no valid address is returned from BTB.
endif::[]

ifeval::[{RASDepth} > 0]
[[ras-return-address-stack-submodule]]
RAS (Return Address Stack) submodule
++++++++++++++++++++++++++++++++++++

RAS is implemented as a LIFO which is composed of {RASDepth} entries.

When a "call" JAL instruction (rd = x1 or x5) is added to the instruction queue, the PC of the instruction following the JAL instruction is pushed into the stack.

When a "ret" JALR instruction (rs1 = x1 or x5, and rd != rs1) is added to the instruction queue, the predicted return address is popped from the stack.
If the predicted return address is wrong due for instance to speculation or RAS depth limitation, a mispredict will be generated.

The RAS is never flushed.

include::port_ras.adoc[]
endif::[]