[docs] extended section "Rational"

Why a multi-cycle architecture?
2025-04-24 22:27:21 -04:00 · 2022-07-21 21:31:01 +02:00 · 2022-07-21 21:31:01 +02:00 · d84f56c581
commit d84f56c581
parent eed2be4bca
1 changed files with 35 additions and 0 deletions
--- a/docs/datasheet/rationale.adoc
+++ b/docs/datasheet/rationale.adoc
@ -68,3 +68,38 @@ Furthermore, the NEORV32 pays special focus on _execution safety_ using <<_full_
 provide fall-backs for _everything that could go wrong_. This includes malformed instruction words, privilege escalations
 and even memory accesses that are checked for address space holes and deterministic response times of memory-mapped
 devices. Precise exceptions allow a defined and fully-synchronized state of the CPU at every time an in every situation.
+
+
+**A multi-cycle architecture?!?**
+
+Most mainstream CPUs out there are pipelined architectures to increase throughput. In contrast, most CPUs used for
+teaching are single-cycle designs since they are probably the most easiest to understand. But what about the
+multi-cycle architectures?
+
+In terms of energy, throughput, area and maximal clock frequency multi-cycle architectures are somewhere in between
+single-single and fully-pipelined designs: they provide higher throughput and clock speed when compared to their
+single-cycle counterparts and have less complexity (= area) then a fully-pipelined designs. I decided to use the
+multi-cycle approach because of the following reasons:
+
+* Multi-cycle architecture are damn small! There is no need for pipeline hazard detection and resolution logic
+(e.g. forwarding) plus you can "re-use" parts of the core to do several tasks (e.g. the ALU is used for the actual data
+processing, but also for address generation, branch condition check and branch target computation).
+* Single-cycle architectures require memories that can be read asynchronously - a thing that is not feasible to implement
+in real world applications (i.e. FPGA block is entirely synchronous). Furthermore, such design usually have a very (very!)
+long critical path tremendously reducing maximal operating frequency.
+* Pipelined designs increase performance by having several instruction "in fly" at the same time. But this also means
+there is some kind of "out-of-order" behavior: if an instruction at the end of the pipeline causes an exception
+all the instructions in earlier stages have to be invalidated. Potential architecture state changes have to be made _undone_
+requiring additional (-> exception-handling) logic. In a multi-cycle architecture this situation cannot occur because only a
+single instruction is "in fly" at a time.
+* Having only a single instruction in fly does not only reduce hardware costs, it also simplifies simulation/verification/debugging,
+state preservation/restoring during exceptions and extensibility (no need to care about pipeline hazards) - but of course at the
+cost of reduced throughput.
+* To partly counteract this loss of performance the NEORV32 CPU uses a _mixed_ approach: instruction fetch (front-end) and
+instruction execution (back-end) are de-coupled to operate independently of each other. Data is interchanged via a queue
+building a simple 2-stage pipeline. Each "pipeline" stage in terms is implemented as multi-cycle architecture to simplify
+the hardware and to provide _precise_ state control (e.g. during exceptions).
+
+.CPU Architecture Details
+[TIP]
+Want to know more? Check out the description in the CPU's <<_architecture>> section.