Update CFU example: use XTEA as "real world" demo application (#855)

This commit is contained in:
stnolting 2024-03-19 17:29:59 +01:00 committed by GitHub
commit c6fa219f5d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 368 additions and 255 deletions

View file

@ -29,6 +29,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12
| Date | Version | Comment | Link |
|:----:|:-------:|:--------|:----:|
| 18.03.2024 | 1.9.6.9 | :sparkles: update CFU example: now implementing the Extended Tiny Encryption Algorithm (XTEA) | [#855](https://github.com/stnolting/neorv32/pull/855) |
| 16.03.2024 | 1.9.6.8 | rework cache system: L1 + L2 caches, all based on the generic cache component | [#853](https://github.com/stnolting/neorv32/pull/853) |
| 16.03.2024 | 1.9.6.7 | cache optimizations: add read-only option, add option to disable direct/uncached accesses | [#851](https://github.com/stnolting/neorv32/pull/851) |
| 15.03.2024 | 1.9.6.6 | :warning: clean-up configuration generics (remove XBUS endianness configuration; refine JEDED/VENDORID configuration); rearrange SYSINFO.SOC bits | [#850](https://github.com/stnolting/neorv32/pull/850) |

View file

@ -2,8 +2,8 @@
:sectnums:
=== Custom Functions Unit (CFU)
The Custom Functions Unit is the central part of the <<_zxcfu_isa_extension>> and represents
the actual hardware module, which can be used to implement _custom RISC-V instructions_.
The Custom Functions Unit (CFU) is the central part of the NEORV32-specific <<_zxcfu_isa_extension>> and
represents the actual hardware module that can be used to implement **custom RISC-V instructions**.
The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
@ -13,23 +13,27 @@ use-cases might include:
* **Cryptographic:** bit substitution and permutation
* **Communication:** conversions like binary to gray-code; multiply-add operations
* **Image processing:** look-up-tables for color space transformations
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by NEORV32
[NOTE]
The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators
The CFU is not intended for complex and **CPU-independent** functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped
<<_custom_functions_subsystem_cfs>>. A comparison of all NEORV32-specific chip-internal hardware extension
options is provided in the user guide section
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].
.Default CFU Hardware Example
[TIP]
The default CFU module (`rtl/core/neorv32_cpu_cp_cfu.vhd`) implements the _Extended Tiny Encryption Algorithm (XTEA)_
as "real world" application example.
:sectnums:
==== CFU Instruction Formats
The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed
Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom` opcodes to identify the instructions implemented
by the CFU and to differentiate between the different instruction formats. The according binary encoding of these
encoding space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed
Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom-*` opcodes to identify the instructions implemented
by the CFU and to differentiate between the available instruction formats. The according binary encoding of these
opcodes is shown below:
* `custom-0`: `0001011` RISC-V standard, used for <<_cfu_r3_type_instructions>>
@ -44,9 +48,10 @@ opcodes is shown below:
The R3-type CFU instructions operate on two source registers `rs1` and `rs2` and return the processing result to
the destination register `rd`. The actual operation can be defined by using the `funct7` and `funct3` bit fields.
These immediates can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or
shift-amounts. However, the actual functionality is entirely user-defined.
shift-amounts. However, the actual functionality is entirely user-defined. Note that all immediate values are
always compile-time-static.
Example operation: `rd <= rs1 xnor rs2`
Example operation: `rd <= rs1 xnor rs2` (bit-wise XNOR)
.CFU R3-type instruction format
image::cfu_r3type_instruction.png[align=center]
@ -74,9 +79,10 @@ R3-type instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 differen
The R4-type CFU instructions operate on three source registers `rs1`, `rs2` and `rs2` and return the processing
result to the destination register `rd`. The actual operation can be defined by using the `funct3` bit field.
Alternatively, this immediate can also be used to pass additional data to the CFU like offsets, look-up-tables
addresses or shift-amounts. However, the actual functionality is entirely user-defined.
addresses or shift-amounts. However, the actual functionality is entirely user-defined. Note that all immediate
values are always compile-time-static.
Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`
Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]` (multiply-and-accumulate; "MAC")
.CFU R4-type instruction format
image::cfu_r4type_instruction.png[align=center]
@ -111,9 +117,9 @@ The R5-type CFU instructions operate on four source registers `rs1`, `rs2`, `rs3
processing result to the destination register `rd`. As all bits of the instruction word are used to encode the
five registers and the opcode, no further immediate bits are available to specify the actual operation. There
are two different R5-type instruction with two different opcodes available. Hence, only two R5-type operations
can be implemented out of the box.
can be implemented by default.
Example operation: `rd <= rs1 & rs2 & rs3 & rs4`
Example operation: `rd <= rs1 & rs2 & rs3 & rs4` (bit-wise AND of 4 operands)
.CFU R5-type instruction A format
image::cfu_r5type_instruction_a.png[align=center]
@ -207,7 +213,6 @@ neorv32_cpu_csr_write(CSR_CFUREG0, 0xabcdabcd); // write data to CFU CSR 0
uint32_t tmp = neorv32_cpu_csr_read(CSR_CFUREG3); // read data from CFU CSR 3
----
.Additional CFU-internal CSRs
[TIP]
If more than four CFU-internal CSRs are required the designer can implement an "indirect access mechanism" based
@ -215,35 +220,35 @@ on just two of the default CSRs: one CSR is used to configure the index while th
data with the indexed CFU-internal CSR - this concept is similar to the RISC-V Indirect CSR Access Extension
Specification (`Smcsrind`).
.Security Considerations
[NOTE]
The CFU CSRs are mapped to the user-mode CSR space so software running at _any privilege level_ can access these
CSRs. However, accesses can be constrained to certain privilege level (see <<_custom_instructions_hardware>>).
:sectnums:
==== Custom Instructions Hardware
The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`. This file is highly commented to illustrate the
hardware design considerations.
CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
and may also include internal states and memories. The CFU's internal control unit takes care of
interfacing the custom user logic to the CPU pipeline.
.CFU Hardware Example & More Details
[TIP]
The default CFU hardware module already implement some exemplary instructions that are used for illustration
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
is highly commented to explain the available signals, implementation options and the handshake with the CPU pipeline.
.CFU Hardware Resource Requirements
[NOTE]
Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using
the according operands for the CFU hardware) will add one or two, respectively, additional read ports to
the core's register file significantly increasing resource requirements.
.CFU Access
.CFU Access Privilege Levels
[NOTE]
The CFU is accessible from all privilege modes (including CFU-internal registers accessed via the indirects CSR
access mechanism). It is the task of the CFU designers to add according access-constraining logic if certain CFU
states shall not be exposed to all privilege levels (i.e. exncryption keys).
states shall not be exposed to all privilege levels (i.e. encryption keys).
.CFU Execution Time
[NOTE]

View file

@ -1,5 +1,5 @@
-- #################################################################################################
-- # << NEORV32 CPU - Co-Processor: Custom (Instructions) Functions Unit >> #
-- # << NEORV32 CPU - Co-Processor: Custom (RISC-V Instructions) Functions Unit (CFU) >> #
-- # ********************************************************************************************* #
-- # For custom/user-defined RISC-V instructions (R3-type, R4-type and R5-type formats). See the #
-- # CPU's documentation for more information. Also take a look at the "software-counterpart" of #
@ -67,7 +67,7 @@ end neorv32_cpu_cp_cfu;
architecture neorv32_cpu_cp_cfu_rtl of neorv32_cpu_cp_cfu is
-- CFU Control - do not modify! ----------------------------
-- CFU Control ---------------------------------------------
-- ------------------------------------------------------------
type control_t is record
busy : std_ulogic; -- CFU is busy
@ -85,29 +85,39 @@ architecture neorv32_cpu_cp_cfu_rtl of neorv32_cpu_cp_cfu is
constant r5typeA_c : std_ulogic_vector(1 downto 0) := "10"; -- R5-type instruction A (custom-2 opcode)
constant r5typeB_c : std_ulogic_vector(1 downto 0) := "11"; -- R5-type instruction B (custom-3 opcode)
-- User-Defined Logic --------------------------------------
-- ------------------------------------------------------------
-- multiply-add unit (r4-type instruction example) --
type madd_t is record
sreg : std_ulogic_vector(2 downto 0); -- 3 cycles latency = 3 bits in arbitration shift register
done : std_ulogic;
--
opa : std_ulogic_vector(XLEN-1 downto 0);
opb : std_ulogic_vector(XLEN-1 downto 0);
opc : std_ulogic_vector(XLEN-1 downto 0);
mul : std_ulogic_vector(2*XLEN-1 downto 0);
res : std_ulogic_vector(2*XLEN-1 downto 0);
end record;
signal madd : madd_t;
-- xtea instructions (funct3 bit-field) --
constant xtea_enc_v0_c : std_ulogic_vector(2 downto 0) := "000";
constant xtea_enc_v1_c : std_ulogic_vector(2 downto 0) := "001";
constant xtea_dec_v0_c : std_ulogic_vector(2 downto 0) := "010";
constant xtea_dec_v1_c : std_ulogic_vector(2 downto 0) := "011";
constant xtea_init_c : std_ulogic_vector(2 downto 0) := "100";
-- custom control and status registers (CSRs) --
signal cfu_csr_0, cfu_csr_1 : std_ulogic_vector(XLEN-1 downto 0);
-- xtea round-key adjusting --
constant xtea_delta_c : std_ulogic_vector(31 downto 0) := x"9e3779b9";
-- xtea key storage (accessed via CFU CSRs) --
type key_mem_t is array (0 to 3) of std_ulogic_vector(31 downto 0);
signal key_mem : key_mem_t;
-- xtea processing logic --
type xtea_t is record
done : std_ulogic_vector(1 downto 0); -- multi-cycle operation SREG
opa : std_ulogic_vector(31 downto 0); -- input operand a
opb : std_ulogic_vector(31 downto 0); -- input operand b
sum : std_ulogic_vector(31 downto 0); -- round key buffer
res : std_ulogic_vector(31 downto 0); -- operation results
end record;
signal xtea : xtea_t;
-- xtea helper --
signal tmp_a, tmp_b, tmp_x, tmp_y, tmp_z, tmp_r : std_ulogic_vector(31 downto 0);
begin
-- **************************************************************************************************************************
-- This controller is required to handle the CFU <-> CPU interface. Do not modify!
-- This controller is required to handle the CFU <-> CPU interface.
-- **************************************************************************************************************************
-- CFU Controller -------------------------------------------------------------------------
@ -122,13 +132,13 @@ begin
control.busy <= '0';
elsif rising_edge(clk_i) then
res_o <= (others => '0'); -- default; all CPU co-processor outputs are logically OR-ed
if (control.busy = '0') then -- idle
if (start_i = '1') then -- trigger new CFU operation
control.busy <= '1';
if (control.busy = '0') then -- CFU is idle
control.busy <= start_i; -- trigger new CFU operation
else -- CFU operation in progress
res_o <= control.result; -- output result only if CFU is processing; has to be all-zero otherwise
if (control.done = '1') or (ctrl_i.cpu_trap = '1') then -- operation done or abort if trap (exception)
control.busy <= '0';
end if;
elsif (control.done = '1') or (ctrl_i.cpu_trap = '1') then -- operation done? abort if trap (exception)
res_o <= control.result; -- output result for just one cycle, CFU output has to be all-zero otherwise
control.busy <= '0';
end if;
end if;
end process cfu_control;
@ -143,7 +153,7 @@ begin
-- **************************************************************************************************************************
-- CFU Interface Documentation
-- CFU Hardware Documentation
-- **************************************************************************************************************************
-- ----------------------------------------------------------------------------------------
@ -221,7 +231,6 @@ begin
--
-- [NOTE] If the <control.done> signal is not set within a bound time window (default = 512 cycles) the CFU operation is
-- automatically terminated by the hardware and an illegal instruction exception is raised. This feature can also be
-- be used to implement custom CFU exceptions (for example to indicate invalid CFU operations).
-- ----------------------------------------------------------------------------------------
-- CFU-Internal Control and Status Registers (CFU-CSRs)
@ -241,145 +250,130 @@ begin
-- **************************************************************************************************************************
-- Actual CFU User Logic Example - replace this with your custom logic
-- Actual CFU User Logic Example: XTEA - Extended Tiny Encryption Algorithm (replace this with your custom logic)
-- **************************************************************************************************************************
-- CFU-Internal Control and Status Registers (CFU-CSRs) -----------------------------------
-- This CFU example implements the Extended Tiny Encryption Algorithm (XTEA).
-- The CFU provides 5 custom instructions to accelerate encryption and decryption using dedicated hardware.
-- The RTL code is not optimized (not for area, not for clock speed, not for performance) and was
-- implemented according to a software C reference (https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm).
-- CFU-Internal Control and Status Registers (CFU-CSRs): 128-Bit Key Storage --------------
-- -------------------------------------------------------------------------------------------
-- synchronous write access --
csr_write_access: process(rstn_i, clk_i)
begin
if (rstn_i = '0') then
cfu_csr_0 <= (others => '0');
cfu_csr_1 <= (others => '0');
key_mem <= (others => (others => '0'));
elsif rising_edge(clk_i) then
if (csr_we_i = '1') and (csr_addr_i = "00") then
cfu_csr_0 <= csr_wdata_i;
end if;
if (csr_we_i = '1') and (csr_addr_i = "01") then
cfu_csr_1 <= csr_wdata_i;
if (csr_we_i = '1') then
key_mem(to_integer(unsigned(csr_addr_i))) <= csr_wdata_i;
end if;
end if;
end process csr_write_access;
-- asynchronous read access --
csr_read_access: process(csr_addr_i, cfu_csr_0, cfu_csr_1)
begin
case csr_addr_i is
when "00" => csr_rdata_o <= cfu_csr_0; -- CSR0: simple read/write register
when "01" => csr_rdata_o <= cfu_csr_1; -- CSR1: simple read/write register
when "10" => csr_rdata_o <= x"1234abcd"; -- CSR2: hardwired/read-only register
when others => csr_rdata_o <= (others => '0'); -- CSR3: not implemented
end case;
end process csr_read_access;
csr_rdata_o <= key_mem(to_integer(unsigned(csr_addr_i)));
-- Iterative Multiply-Add Unit ------------------------------------------------------------
-- XTEA Processing Core ------------------------------------------------------------------
-- -------------------------------------------------------------------------------------------
-- iteration control --
madd_control: process(rstn_i, clk_i)
xtea_core: process(rstn_i, clk_i)
begin
if (rstn_i = '0') then
madd.sreg <= (others => '0');
xtea.done <= (others => '0');
xtea.opa <= (others => '0');
xtea.opb <= (others => '0');
xtea.sum <= (others => '0');
xtea.res <= (others => '0');
elsif rising_edge(clk_i) then
-- operation trigger --
if (control.busy = '0') and -- CFU is idle (ready for next operation)
(start_i = '1') and -- CFU is actually triggered by a custom instruction word
(control.rtype = r4type_c) and -- this is a R4-type instruction
(control.funct3(2 downto 1) = "00") then -- trigger only for specific funct3 values
madd.sreg(0) <= '1';
else
madd.sreg(0) <= '0';
-- shift register for computation delay --
xtea.done(0) <= '0'; -- default
xtea.done(1) <= xtea.done(0);
-- trigger new operation --
if (start_i = '1') and (control.rtype = r3type_c) then -- execution trigger and correct instruction type
xtea.opa <= rs1_i; -- buffer input operand rs1 (for improved physical timing)
xtea.opb <= rs2_i; -- buffer input operand rs2 (for improved physical timing)
xtea.done(0) <= '1'; -- result is available in the 2nd cycle
end if;
-- simple shift register for tracking operation --
madd.sreg(madd.sreg'left downto 1) <= madd.sreg(madd.sreg'left-1 downto 0); -- shift left
-- data processing --
if (xtea.done(0) = '1') then -- second-stage execution trigger
-- update "sum" round key --
if (control.funct3(2) = '1') then -- initialize
xtea.sum <= xtea.opa; -- set initial round key
elsif (control.funct3(1 downto 0) = xtea_enc_v0_c(1 downto 0)) then -- encrypt v0
xtea.sum <= std_ulogic_vector(unsigned(xtea.sum) + unsigned(xtea_delta_c));
elsif (control.funct3(1 downto 0) = xtea_dec_v1_c(1 downto 0)) then -- decrypt v1
xtea.sum <= std_ulogic_vector(unsigned(xtea.sum) - unsigned(xtea_delta_c));
end if;
-- process "v" operands --
if (control.funct3(1) = '0') then -- encrypt
xtea.res <= std_ulogic_vector(unsigned(tmp_b) + unsigned(tmp_r));
else -- decrypt
xtea.res <= std_ulogic_vector(unsigned(tmp_b) - unsigned(tmp_r));
end if;
end if;
end if;
end process madd_control;
end process xtea_core;
-- processing has reached last stage (= done) when sreg's MSB is set --
madd.done <= madd.sreg(madd.sreg'left);
-- arithmetic core --
madd_core: process(rstn_i, clk_i)
begin
if (rstn_i = '0') then
madd.opa <= (others => '0');
madd.opb <= (others => '0');
madd.opc <= (others => '0');
madd.mul <= (others => '0');
madd.res <= (others => '0');
elsif rising_edge(clk_i) then
-- stage 0: buffer input operands --
madd.opa <= rs1_i;
madd.opb <= rs2_i;
madd.opc <= rs3_i;
-- stage 1: multiply rs1 and rs2 --
madd.mul <= std_ulogic_vector(unsigned(madd.opa) * unsigned(madd.opb));
-- stage 2: add rs3 to multiplication result --
madd.res <= std_ulogic_vector(unsigned(madd.mul) + unsigned(madd.opc));
end if;
end process madd_core;
-- helpers --
tmp_a <= xtea.opb when (control.funct3(0) = '0') else xtea.opa; -- v1 / v0 select
tmp_b <= xtea.opa when (control.funct3(0) = '0') else xtea.opb; -- v0 / v1 select
tmp_x <= xtea.opb(27 downto 0) & "0000" when (control.funct3(0) = '0') else xtea.opa(27 downto 0) & "0000"; -- v << 4
tmp_y <= "00000" & xtea.opb(31 downto 5) when (control.funct3(0) = '0') else "00000" & xtea.opa(31 downto 5); -- v >> 5
tmp_z <= key_mem(to_integer(unsigned(xtea.sum(1 downto 0)))) when (control.funct3(0) = '0') else -- key[sum & 3]
key_mem(to_integer(unsigned(xtea.sum(12 downto 11)))); -- key[(sum >> 11) & 3]
tmp_r <= std_ulogic_vector(unsigned(tmp_x xor tmp_y) + unsigned(tmp_a)) xor std_ulogic_vector(unsigned(xtea.sum) + unsigned(tmp_z));
-- Output select --------------------------------------------------------------------------
-- Function Result Select -----------------------------------------------------------------
-- -------------------------------------------------------------------------------------------
out_select: process(control, rs1_i, rs2_i, rs3_i, rs4_i, madd)
result_select: process(control, xtea)
begin
case control.rtype is
when r3type_c => -- R3-type instructions
when r3type_c => -- R3-type instructions; function select via "funct7" and "funct3"
-- ----------------------------------------------------------------------
-- This is a simple ALU that implements four pure-combinatorial instructions.
-- The actual function is selected by the "funct3" bit-field.
case control.funct3 is
when "000" => -- funct3 = "000": bit-reversal of rs1
control.result <= bit_rev_f(rs1_i);
case control.funct3 is -- Just check "funct3" here; "funct7" bit-field is ignored
when xtea_enc_v0_c | xtea_enc_v1_c | xtea_dec_v0_c | xtea_dec_v1_c => -- encryption/decryption
control.result <= xtea.res; -- processing result
control.done <= xtea.done(xtea.done'left); -- multi-cycle processing done when set
when others => -- initialization and all further unspecified operations
control.result <= (others => '0'); -- just output zero
control.done <= '1'; -- pure-combinatorial, so we are done "immediately"
when "001" => -- funct3 = "001": XNOR input operands
control.result <= not (rs1_i xor rs2_i);
control.done <= '1'; -- pure-combinatorial, so we are done "immediately"
when others => -- not implemented
control.result <= (others => '0');
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
end case;
when r4type_c => -- R4-type instructions
when r4type_c => -- R4-type instructions; function select via "funct3"
-- ----------------------------------------------------------------------
-- This is an iterative multiply-and-add unit that requires several cycles for processing.
-- The actual function is selected by the lowest bit of the "funct3" bit-field.
case control.funct3 is
when "000" => -- funct3 = "000": multiply-add low-part result: rs1*rs2+r3 [31:0]
control.result <= madd.res(31 downto 0);
control.done <= madd.done; -- iterative, wait for unit to finish
when "001" => -- funct3 = "001": multiply-add high-part result: rs1*rs2+r3 [63:32]
control.result <= madd.res(63 downto 32);
control.done <= madd.done; -- iterative, wait for unit to finish
when others => -- not implemented
control.result <= (others => '0');
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
end case;
control.result <= (others => '0'); -- no logic implemented
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
when r5typeA_c => -- R5-type instruction A
-- ----------------------------------------------------------------------
-- No function/immediate bit-fields are available for this instruction type.
-- Hence, there is just one operation that can be implemented.
control.result <= rs1_i and rs2_i and rs3_i and rs4_i; -- AND-all
control.done <= '1'; -- pure-combinatorial, so we are done "immediately"
control.result <= (others => '0'); -- no logic implemented
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
when r5typeB_c => -- R5-type instruction B
-- ----------------------------------------------------------------------
-- No function/immediate bit-fields are available for this instruction type.
-- Hence, there is just one operation that can be implemented.
control.result <= rs1_i xor rs2_i xor rs3_i xor rs4_i; -- XOR-all
control.done <= '1'; -- pure-combinatorial, so we are done "immediately"
control.result <= (others => '0'); -- no logic implemented
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
when others => -- undefined
-- ----------------------------------------------------------------------
control.result <= (others => '0');
control.done <= '0';
control.result <= (others => '0'); -- no logic implemented
control.done <= '0'; -- this will cause an illegal instruction exception after timeout
end case;
end process out_select;
end process result_select;
end neorv32_cpu_cp_cfu_rtl;

View file

@ -52,7 +52,7 @@ package neorv32_package is
-- Architecture Constants -----------------------------------------------------------------
-- -------------------------------------------------------------------------------------------
constant hw_version_c : std_ulogic_vector(31 downto 0) := x"01090608"; -- hardware version
constant hw_version_c : std_ulogic_vector(31 downto 0) := x"01090609"; -- hardware version
constant archid_c : natural := 19; -- official RISC-V architecture ID
constant XLEN : natural := 32; -- native data path width

View file

@ -36,7 +36,7 @@
/**********************************************************************//**
* @file demo_cfu/main.c
* @author Stephan Nolting
* @brief Example program showing how to use the CFU's custom instructions.
* @brief Example program showing how to use the CFU's custom instructions (XTEA example).
* @note Take a look at the highly-commented "hardware-counterpart" of this CFU
* example in 'rtl/core/neorv32_cpu_cp_cfu.vhd'.
**************************************************************************/
@ -49,8 +49,59 @@
/**@{*/
/** UART BAUD rate */
#define BAUD_RATE 19200
/** Number of test cases per CFU instruction */
#define TESTCASES 4
/** Number XTEA cycles */
#define XTEA_CYCLES 20
/** Input data size (in number of 32-bit words), has to be even */
#define DATA_NUM 64
/**@}*/
/**********************************************************************//**
* @name Define macros for easy custom instruction wrapping
**************************************************************************/
/**@{*/
#define xtea_hw_init(sum) neorv32_cfu_r3_instr(0b0000000, 0b100, sum, 0)
#define xtea_hw_enc_v0_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b000, v0, v1)
#define xtea_hw_enc_v1_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b001, v0, v1)
#define xtea_hw_dec_v0_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b010, v0, v1)
#define xtea_hw_dec_v1_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b011, v0, v1)
/**@}*/
// The CFU custom instructions can be used as plain C functions as they are simple "intrinsics".
// There are 4 "prototype primitives" for the CFU instructions (define in sw/lib/include/neorv32_cfu.h):
//
// > neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) - for r3-type instructions (custom-0 opcode)
// > neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) - for r4-type instructions (custom-1 opcode)
// > neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) - for r5-type instruction A (custom-2 opcode)
// > neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) - for r5-type instruction B (custom-3 opcode)
//
// Every instance of these functions is converted into a single 32-bit RISC-V instruction word
// without any calling overhead at all (see the generated assembly code).
//
// The "rs*" source operands can be literals, variables, function return values, ... - you name it.
// The 7-bit immediate ("funct7") and the 3-bit immediate ("funct3") values can be used to pass
// compile-time static literal data to the CFU or to do a fine-grained function selection.
//
// Each "neorv32_cfu_r*" function returns a 32-bit data word of type uint32_t that represents
// the processing result of the according instruction.
/**********************************************************************//**
* @name Global variables
**************************************************************************/
/**@{*/
/** XTEA delta (round-key update) */
const uint32_t xtea_delta = 0x9e3779b9;
/** Encryption/decryption key (128-bit) */
const uint32_t key[4] = {0x207230ba, 0x1ffba710, 0xc45271ef, 0xdd01768a};
/** Encryption input data */
uint32_t input_data[DATA_NUM];
/** Encryption results */
uint32_t cypher_data_sw[DATA_NUM], cypher_data_hw[DATA_NUM];
/** Decryption results */
uint32_t plain_data_sw[DATA_NUM], plain_data_hw[DATA_NUM];
/** Timing data */
uint32_t time_enc_sw, time_enc_hw, time_dec_sw, time_dec_hw;
/**@}*/
@ -72,173 +123,235 @@ uint32_t xorshift32(void) {
/**********************************************************************//**
* Main function
* XTEA encryption - software reference
*
* @note This program requires the CFU and UART0.
* Source: https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm
*
* @param[in] num_cycles Number of encryption cycles.
* @param[in,out] v Encryption data/result array (2x32-bit).
* @param[in] k Encryption key array (4x32-bit).
**************************************************************************/
void xtea_sw_encipher(uint32_t num_cycles, uint32_t *v, const uint32_t k[4]) {
uint32_t i = 0;
uint32_t v0 = v[0];
uint32_t v1 = v[1];
uint32_t sum = 0;
for (i=0; i < num_cycles; i++) {
v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + k[sum & 3]);
sum += xtea_delta;
v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + k[(sum>>11) & 3]);
}
v[0] = v0;
v[1] = v1;
}
/**********************************************************************//**
* XTEA decryption - software reference
*
* Source: https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm
*
* @param[in] num_cycles Number of encryption cycles.
* @param[in,out] v Decryption data/result array (2x32-bit).
* @param[in] k Decryption key array (4x32-bit).
**************************************************************************/
void xtea_sw_decipher(unsigned int num_cycles, uint32_t *v, const uint32_t k[4]) {
uint32_t i = 0;
uint32_t v0 = v[0];
uint32_t v1 = v[1];
uint32_t sum = xtea_delta * num_cycles;
for (i=0; i < num_cycles; i++) {
v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + k[(sum>>11) & 3]);
sum -= xtea_delta;
v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + k[sum & 3]);
}
v[0] = v0;
v[1] = v1;
}
/**********************************************************************//**
* Main function: run pure-SW XTEA and compare with HW-XTEA
*
* @note This program requires the CFU, UART0 and the Zicntr ISA extension.
*
* @return 0 if execution was successful
**************************************************************************/
int main() {
uint32_t i, rs1, rs2, rs3, rs4;
uint32_t i, j;
uint32_t v[2];
// initialize NEORV32 run-time environment
neorv32_rte_setup();
// setup UART at default baud rate, no interrupts
neorv32_uart0_setup(BAUD_RATE, 0);
// check if UART0 is implemented
if (neorv32_uart0_available() == 0) {
return 1; // UART0 not available, exit
return -1; // UART0 not available, exit
}
// check if the CFU is implemented at all (the CFU is wrapped in the core's "Zxcfu" ISA extension)
// setup UART0 at default baud rate, no interrupts
neorv32_uart0_setup(BAUD_RATE, 0);
// check if the CFU is implemented (the CFU is wrapped in the core's "Zxcfu" ISA extension)
if (neorv32_cpu_cfu_available() == 0) {
neorv32_uart0_printf("ERROR! CFU ('Zxcfu' ISA extensions) not implemented!\n");
return 1;
return -1;
}
// check if the CPU base counters are implemented
if ((neorv32_cpu_csr_read(CSR_MXISA) & (1 << CSR_MXISA_ZICNTR)) == 0) {
neorv32_uart0_printf("ERROR! Base counters ('Zicntr' ISA extensions) not implemented!\n");
return -1;
}
// check if data size configuration is even
if ((DATA_NUM & 1) != 0) {
neorv32_uart0_printf("ERROR! DATA_NUM has to be even!\n");
return -1;
}
// intro
neorv32_uart0_printf("\n<<< NEORV32 Custom Functions Unit (CFU) - Custom Instructions Example >>>\n\n");
neorv32_uart0_printf("[NOTE] This program assumes the _default_ CFU hardware module, which\n"
" implements simple and exemplary data processing instructions.\n\n");
/*
The CFU custom instructions can be used as plain C functions as they are simple "intrinsics".
There are 4 "prototype primitives" for the CFU instructions (define in sw/lib/include/neorv32_cfu.h):
> neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) - for r3-type instructions (custom-0 opcode)
> neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) - for r4-type instructions (custom-1 opcode)
> neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) - for r5-type instruction A (custom-2 opcode)
> neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) - for r5-type instruction B (custom-3 opcode)
Every "call" of these functions is turned into a single 32-bit ISC-V instruction word
without any calling overhead at all (see the generated assembly code).
The "rs*" operands can be literals, variables, function return values, ... - you name it.
The 7-bit immediate ("funct7") and the 3-bit immediate ("funct3") values can be used to pass
_compile-time static_ literals to the CFU or to do a fine-grained function selection.
Each "neorv32_cfu_r*" function returns a 32-bit data word of type uint32_t that represents
the result of the according instruction.
*/
neorv32_uart0_printf("[NOTE] This program assumes the default CFU hardware module that\n"
" implements the Extended Tiny Encryption Algorithm (XTEA).\n\n");
// ----------------------------------------------------------
// R3-type instructions (up to 1024 custom instructions)
// XTEA example
// ----------------------------------------------------------
neorv32_uart0_printf("\n--- CFU R3-Type: Bit-Reversal Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r3_instr( funct7=0b1111111, funct3=0b000, [rs1]=0x%x, [rs2]=0x%x ) = ", i, rs1, 0);
// here we are setting the funct7 bit-field to all-one; however, this is not used at all by the default CFU hardware module.
neorv32_uart0_printf("0x%x\n", neorv32_cfu_r3_instr(0b1111111, 0b000, rs1, 0));
}
// set XTEA-CFU key storage (the CFU CSRs)
neorv32_cpu_csr_write(CSR_CFUREG0, key[0]);
neorv32_cpu_csr_write(CSR_CFUREG1, key[1]);
neorv32_cpu_csr_write(CSR_CFUREG2, key[2]);
neorv32_cpu_csr_write(CSR_CFUREG3, key[3]);
neorv32_uart0_printf("\n--- CFU R3-Type: XNOR Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r3_instr( funct7=0b0000000, funct3=0b001, [rs1]=0x%x, [rs2]=0x%x ) = ", i, rs1, rs2);
neorv32_uart0_printf("0x%x\n", neorv32_cfu_r3_instr(0b0000000, 0b001, rs1, rs2));
// read-back CSRs and print key
neorv32_uart0_printf("XTEA key: 0x%x%x%x%x\n\n",
neorv32_cpu_csr_read(CSR_CFUREG0),
neorv32_cpu_csr_read(CSR_CFUREG1),
neorv32_cpu_csr_read(CSR_CFUREG2),
neorv32_cpu_csr_read(CSR_CFUREG3));
// generate "random" data for the plain text
for (i=0; i<DATA_NUM; i++) {
input_data[i] = xorshift32();
}
// ----------------------------------------------------------
// R4-type instructions (up to 8 custom instructions)
// XTEA encryption
// ----------------------------------------------------------
// You can use macros to simplify the usage of the custom instructions.
#define madd_lo(a, b, c) neorv32_cfu_r4_instr(0b000, a, b, c)
#define madd_hi(a, b, c) neorv32_cfu_r4_instr(0b001, a, b, c)
// encryption using software only
neorv32_uart0_printf("XTEA SW encryption (%u rounds, %u words)...\n", 2*XTEA_CYCLES, DATA_NUM);
neorv32_uart0_printf("\n--- CFU R4-Type: Multiply-Add (Low-Part) Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
rs3 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r4_instr( funct3=0b000, [rs1]=0x%x, [rs2]=0x%x, [rs3]=0x%x ) = ", i, rs1, rs2, rs3);
neorv32_uart0_printf("0x%x\n", madd_lo(rs1, rs2, rs3));
neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
for (i=0; i<(DATA_NUM/2); i++) {
v[0] = input_data[i*2+0];
v[1] = input_data[i*2+1];
xtea_sw_encipher(XTEA_CYCLES, v, key);
cypher_data_sw[i*2+0] = v[0];
cypher_data_sw[i*2+1] = v[1];
}
time_enc_sw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing
neorv32_uart0_printf("\n--- CFU R4-Type: Multiply-Add (High-Part) Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
rs3 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r4_instr( funct3=0b001, [rs1]=0x%x, [rs2]=0x%x, [rs3]=0x%x ) = ", i, rs1, rs2, rs3);
neorv32_uart0_printf("0x%x\n", madd_hi(rs1, rs2, rs3));
// encryption using the XTEA CFU
neorv32_uart0_printf("XTEA HW encryption (%u rounds, %u words)...\n", 2*XTEA_CYCLES, DATA_NUM);
neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
for (i=0; i<(DATA_NUM/2); i++) {
v[0] = input_data[i*2+0];
v[1] = input_data[i*2+1];
xtea_hw_init(0);
for (j=0; j<XTEA_CYCLES; j++) {
v[0] = xtea_hw_enc_v0_step(v[0], v[1]);
v[1] = xtea_hw_enc_v1_step(v[0], v[1]);
}
cypher_data_hw[i*2+0] = v[0];
cypher_data_hw[i*2+1] = v[1];
}
time_enc_hw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing
// ----------------------------------------------------------
// R5-type instruction A (only 1 custom instruction)
// ----------------------------------------------------------
neorv32_uart0_printf("\n--- CFU R5-Type A: AND-All Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
rs3 = xorshift32();
rs4 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r5_instr_a( [rs1]=0x%x, [rs2]=0x%x, [rs3]=0x%x, [rs3]=0x%x ) = ", i, rs1, rs2, rs3, rs4);
neorv32_uart0_printf("0x%x\n", neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4));
// compare results
neorv32_uart0_printf("Comparing results... ");
for (i=0; i<DATA_NUM; i++) {
if (cypher_data_sw[i] != cypher_data_hw[i]) {
neorv32_uart0_printf("FAILED\n");
return -1;
}
}
neorv32_uart0_printf("OK\n");
// ----------------------------------------------------------
// R5-type instruction B (only 1 custom instruction)
// XTEA decryption
// ----------------------------------------------------------
neorv32_uart0_printf("\n");
neorv32_uart0_printf("\n--- CFU R5-Type B: XOR-All Instruction ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
rs3 = xorshift32();
rs4 = xorshift32();
neorv32_uart0_printf("%u: neorv32_cfu_r5_instr_b( [rs1]=0x%x, [rs2]=0x%x, [rs3]=0x%x, [rs3]=0x%x ) = ", i, rs1, rs2, rs3, rs4);
neorv32_uart0_printf("0x%x\n", neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4));
// decryption using software only
neorv32_uart0_printf("XTEA SW decryption (%u rounds, %u words)...\n", 2*XTEA_CYCLES, DATA_NUM);
neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
for (i=0; i<(DATA_NUM/2); i++) {
v[0] = cypher_data_sw[i*2+0];
v[1] = cypher_data_sw[i*2+1];
xtea_sw_decipher(XTEA_CYCLES, v, key);
plain_data_sw[i*2+0] = v[0];
plain_data_sw[i*2+1] = v[1];
}
time_dec_sw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing
// ----------------------------------------------------------
// Unimplemented R3-type (=illegal) instruction
// ----------------------------------------------------------
// decryption using the XTEA CFU
neorv32_uart0_printf("XTEA HW decryption (%u rounds, %u words)...\n", 2*XTEA_CYCLES, DATA_NUM);
neorv32_uart0_printf("\n--- CFU Unimplemented (= illegal) R3-Type ---\n");
for (i=0; i<TESTCASES; i++) {
rs1 = xorshift32();
rs2 = xorshift32();
// this funct3 is NOT implemented by the default CFU hardware causing an illegal instruction exception
// due to a multi-cycle execution timeout (processing does not complete within a bound time)
neorv32_uart0_printf("%u: neorv32_cfu_r3_instr( funct7=0b0000000, funct3=0b111, [rs1]=0x%x, [rs2]=0x%x ) = ", i, rs1, rs2);
neorv32_uart0_printf("0x%x\n", neorv32_cfu_r3_instr(0b0000000, 0b111, rs1, rs2));
neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
for (i=0; i<(DATA_NUM/2); i++) {
v[0] = cypher_data_hw[i*2+0];
v[1] = cypher_data_hw[i*2+1];
xtea_hw_init(XTEA_CYCLES * xtea_delta);
for (j=0; j<XTEA_CYCLES; j++) {
v[1] = xtea_hw_dec_v1_step(v[0], v[1]);
v[0] = xtea_hw_dec_v0_step(v[0], v[1]);
}
plain_data_hw[i*2+0] = v[0];
plain_data_hw[i*2+1] = v[1];
}
time_dec_hw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing
// compare results
neorv32_uart0_printf("Comparing results... ");
for (i=0; i<DATA_NUM; i++) {
if (plain_data_sw[i] != plain_data_hw[i]) {
neorv32_uart0_printf("FAILED\n");
return -1;
}
}
neorv32_uart0_printf("OK\n");
// ----------------------------------------------------------
// CFU-internal control and status registers (CFU-CSRs)
// Print benchmarking results
// ----------------------------------------------------------
neorv32_uart0_printf("\n--- CFU CSRs: Control and Status Registers ---\n");
neorv32_uart0_printf("\nExecution benchmarking:\n");
neorv32_uart0_printf("ENC SW = %u cycles\n", time_enc_sw);
neorv32_uart0_printf("ENC HW = %u cycles\n", time_enc_hw);
neorv32_uart0_printf("DEC SW = %u cycles\n", time_dec_sw);
neorv32_uart0_printf("DEC HW = %u cycles\n", time_dec_hw);
neorv32_cpu_csr_write(CSR_CFUREG0, 0xffffffff); // just write some exemplary data to CSR
neorv32_uart0_printf("CFU-CSR 0 = 0x%x\n", neorv32_cpu_csr_read(CSR_CFUREG0)); // read-back data from CSR
neorv32_cpu_csr_write(CSR_CFUREG1, 0x12345678);
neorv32_uart0_printf("CFU-CSR 1 = 0x%x\n", neorv32_cpu_csr_read(CSR_CFUREG1));
neorv32_cpu_csr_write(CSR_CFUREG2, 0x22334455);
neorv32_uart0_printf("CFU-CSR 2 = 0x%x\n", neorv32_cpu_csr_read(CSR_CFUREG2));
neorv32_cpu_csr_write(CSR_CFUREG3, 0xabcdabcd);
neorv32_uart0_printf("CFU-CSR 3 = 0x%x\n", neorv32_cpu_csr_read(CSR_CFUREG3));
neorv32_uart0_printf("\nCFU demo program completed.\n");
return 0;