EMU Test / Changes Detection (push) Waiting to runDetails
EMU Test / Generate Verilog (push) Blocked by required conditionsDetails
EMU Test / EMU - Basics (push) Blocked by required conditionsDetails
EMU Test / EMU - CHI (push) Blocked by required conditionsDetails
EMU Test / Docker Build (push) Blocked by required conditionsDetails
EMU Test / EMU - Performance (push) Blocked by required conditionsDetails
EMU Test / EMU - MC (push) Blocked by required conditionsDetails
EMU Test / SIMV - Basics (push) Blocked by required conditionsDetails
EMU Test / Upload Artifacts (push) Blocked by required conditionsDetails
EMU Test / Check Submodules (push) Blocked by required conditionsDetails
EMU Test / Check Format (push) Blocked by required conditionsDetails
Release Jobs / build-xsdev-image (push) Waiting to runDetails
Previously, when accessing mmio space after splitting, we generated an
exception in misalignbuffer, which prevented the exception address from
being written to exceptionbuffer.
Therefore, we now choose to generate an exception in pipeline so that
the exception address can be written correctly.
EMU Test / Changes Detection (push) Has been cancelledDetails
Release Jobs / build-xsdev-image (push) Has been cancelledDetails
EMU Test / Generate Verilog (push) Has been cancelledDetails
EMU Test / EMU - Basics (push) Has been cancelledDetails
EMU Test / EMU - CHI (push) Has been cancelledDetails
EMU Test / Docker Build (push) Has been cancelledDetails
EMU Test / EMU - Performance (push) Has been cancelledDetails
EMU Test / EMU - MC (push) Has been cancelledDetails
EMU Test / SIMV - Basics (push) Has been cancelledDetails
EMU Test / Upload Artifacts (push) Has been cancelledDetails
EMU Test / Check Submodules (push) Has been cancelledDetails
EMU Test / Check Format (push) Has been cancelledDetails
Misaligned load that cause TLB miss are no longer allowed to enter the
loadmisalignbuffer.
This is because it would cause subsequent exception addresses to be
incorrect :
The original address of the misaligned load is 0x07, while the first
request address after splitting is 0x00.
Thus, when a page fault exception occurs, 0x00 would be considered the
exception address instead of the original 0x07.
---
Store have always been handled this way and do not require modification.
**Bug Trigger:** In a self-modifying program, the program modifies its
own instructions in a region where PBMT=NC and PMA=MM. If difftest is
skipped in this case, NEMU will not execute the corresponding memory
access instruction. This causes NEMU and DUT to execute different
instructions later on, ultimately leading to an error.
**Solution:** For regions where PBMT=NC and PMA=MM, difftest should not
be skipped, since PMA=MM indicates that NEMU can perform normal
synchronization. However, for regions with PMA=IO, difftest should still
be skipped because NEMU might not be able to access the corresponding
devices. Instruction self-modification in PMA=IO regions is generally
not a concern, as such regions are typically non-writable. Therefore,
synchronization of self-modifying IO instructions is not handled here
(as doing so would be overly complex).
* mmio or nc should report `Hardware Error` when response with `nderr`
* loadunit should report `Hardware Error` when it should be `delay kill`
from fast replay
EMU Test / Changes Detection (push) Has been cancelledDetails
EMU Test / Generate Verilog (push) Has been cancelledDetails
EMU Test / EMU - Basics (push) Has been cancelledDetails
EMU Test / EMU - CHI (push) Has been cancelledDetails
EMU Test / EMU - Performance (push) Has been cancelledDetails
EMU Test / EMU - MC (push) Has been cancelledDetails
EMU Test / SIMV - Basics (push) Has been cancelledDetails
EMU Test / Upload Artifacts (push) Has been cancelledDetails
EMU Test / Check Submodules (push) Has been cancelledDetails
EMU Test / Check Format (push) Has been cancelledDetails
The prior design reassigns `io.lsq.ldin.bits.rep_info.need_rep` to 0
when source comes from MisalignBuffer, preventing cancellation of
rar/raw enqueue requests during misaligned instruction reissuance.
Thus, we must use `io.misalign_ldout.bits.rep_info.need_rep` to
determine whether to revoke rar/raw enqueue requests when source is from
MisalignBuffer.
For misaligned accesses, say if the access after the split goes to `nc`
space, then a misaligned exception should also be generated.
Co-authored-by: Yanqin Li <maxpicca@qq.com>
PerfCCT(performance counter commit trace) is a Instruction-level
granularity perfCounter like GEM5
How to use this:
1. Make with "WITH_CHISELDB=1" argument
2. Run with "--dump-db --dump-select-db lifetime", then get the database
3. Instruction lifetime visualize run "python3 scripts/perfcct.py
"the-db-file-path" -p 1 -v | less"
4. Analysis script now is in XS-GEM5 repo, see
https://github.com/OpenXiangShan/GEM5/blob/xs-dev/util/ClockAnalysis.py
How it works:
1. Allocate one unique tag "seqNum" like GEM5 for each instruction at
fetch stage
2. Passing the "seqNum" in each pipeline
3. Recording perf data through the DPIC interface
As the comment says, even if a `PF` is generated, an address is still
generated for `PMP/PMA` checking, which can lead to some strange
responses.
Since the previous(https://github.com/OpenXiangShan/XiangShan/pull/4426)
modification removed `s2_exception`, this resulted in the incorrect
generation of `s2_uncache`.
This is now represented using clearer semantics:
`s2_actually_uncache`: this real physical address is for uncache space.
The `s2_uncache` has been retained to distinguish if it's a request from
prefetching, which may be handled in a subsequent change to **YQ senior
sister**.
I synchronised the changes to StoreUnit in this
pr(https://github.com/OpenXiangShan/XiangShan/pull/4441).
The loadAddrMisaligned exception is generated when misaligned accesses
uncache space.
---
A misaligned load sets a loadAddrMisaligned exception at the s0 flag to
ensure that it only enters the loadmisalignbuffer and has no other side
effects.
So it will prevent s2_uncache from spawning properly.
Previously we used an additional `s2_un_misalign_exception` to flag
this.
Now, after examining the semantics of s2_uncache, the semantics of
s2_uncache can be appropriately represented by directly removing the
excepiont related signals
Since a load instruction that cross 16Byte needs to be split and
accessed twice, it needs to enter the `RAR Queue` twice, but occupies
only one `virtual load queue`, so in the extreme case it may happen that
36 load instructions that span 16Byte fill all 72 `RAR queues`.
---
There is some problem with our previous handling; if the oldest load
instruction spanning 16Byte enters the `replayqueue` and at the same
time there exists an instruction in the `loadmisalignbuffer` that can't
finish executing because the `RAR Queue` is full, then the oldest load
instruction is never cannot be issued because the `loadmisalignbuffer`
has instructions in it all the time.
---
Therefore, we use a more violent scheme to do this.
When the RAR is full, we let the misaligned load generate a rollback,
and the next load instruction that the loadmisalignbuffer can receive
must be the oldest (if it is misaligned).
* In order to ensure timing, the RAR enqueue conditions need to be
compromised, worst source of timing from `pmp` and `missQueue`.
* if `LoadQueueRARSize` == `VirtualLoadQueueSize`, just need to exclude
prefetching.
* if `LoadQueueRARSize` < `VirtualLoadQueueSize`, need to consider the
situation of `s2_can_query`
`s0_src_valid_vec` is not `s0_src_select_vec`, and bit corresponding to
`s0_src_valid_vec` is valid when any of the inputs `valid`. Therefore,
`misalign wakeup` needs to globally control `s0_valid`.
* Because of `LoadQueueRARSize == VirtualLoadQueueSize`, so no need to
add additional logic for rar enq
* When no need fast replay, loadunit allocate raw entry
when `io.dcache.req.ready` is false, misalign load will be stall, but
`wakeup` still work normally and is not canceled in `s3`, which will
cause the backend to get wrong data.
`prefetch.w` sends a write request to `TLB/PMA/PMP`.
As a result, `PMA/PMP` returns a permission check (`io.pmp.st`) for the
write request.
---
Previously, we only handled the case where `prefetch.r` did not have
read permissions, not handled the case where `prefetch.w` did not have
write permissions.
**So, when `prefetch.w` has an address without write permissions, the
request will still be sent to `Dcache`, which generates an error.**
**This pr fixes that, when `PMA/PMP` returns `io.pmp.st`, we generate
`dcache.s2_kill`.**
# Background
## Problem
How to design a more efficient entry rule for a new load/store request
when a load/store with the same address already exists in the `ubuffer`?
* **Old Design**: Always **reject** the new request.
* **New Desig**n: Consider **merging** requests.
## Merge Scenarios
‼️If the new one can be merge into the existing one, both need to be
`NC`.
1. **New Store Request:**
1. **Existing Store:** Merge (the new store is younger).
2. **Existing Load:** Reject.
2. **New Load Request:**
1. **Existing Load:** Merge (the new load may be younger or older. Both
are ok to merge).
2. **Existing Store:** Reject.
# What this PR do?
## 1. Entry Actions
1. **Allocate** a new entry and mark as `valid`
1. When there is no matching address.
2. **Allocate** a new entry and mark as `valid` and `waitSame`:
1. When there is a matching address, and:
* The virtual addresses and attributes are the same.
* The older entry is either selected to issue or issued.
3. **Merge** into an Existing Entry:
1. When there is a matching address, and:
* The virtual addresses and attributes are the same.
* The older entry is **not** selected to issue or issued.
4. **Reject** the New Request:
1. When the ubuffer is full.
2. When there is a matching address, but:
* The virtual addresses or attributes are **different**.
**NOTE:** According to the definition in the TL-UL SPEC, the `mask` must
be continuous and naturally aligned, and the `addr` must correspond to
the mask. Therefore, the "**same attributes**" here introduces a new
condition: the merged `mask` must meet the requirements of being
continuous and naturally aligned (function `continueAndAlign`). During
merging, the block offset of addr must be synchronously updated in
`UncacheEntry.update`.
## 2. Handshake Mechanism Between `LoadQueueUncache (M)` and `Uncache
(S)`
> `mid`: master id
>
> `sid`: slave id
**Old Design:**
- `M` sends a `req` with a **`mid`**.
- `S` receives the `req`, records the **`mid`**.
- `S` sends a `resp` with the **`mid`**.
- `M` receives the `resp` and matches it with the recorded **`mid`**.
**New Design:**
- `M` sends a `req` with a **`mid`**.
- `S` receives the `req` and responds with `{mid, sid}` .
- `M` matches it with the **`mid`** and updates its record with the
received **`sid`**.
- `S` sends a `resp` with the its **`sid`**.
- `M` receives the `resp` and matches it with the recorded **`sid`**.
**Benefit:** The new design allows `S` to merge requests when new
request enters.
## 3. Forwarding Mechanism
**Old Design:** Each address in the `ubuffer` is **unique**, so
forwarding is straightforward based on a match.
**New Design:**
* A single address may have up to two entries matched in the `ubuffer`.
* If it has two matched enties, it must be true that one entry is marked
`inflight` and the other entry is marked `waitSame`. In this case, the
forwarded data comes from the merged data of two entries, with the
`inflight` entry being the older one.
## 4. Bug Fixes
1. In the `loadUnit`, `!tlbMiss` cannot be directly used as `tlbHit`,
because when `tlbValid` is false, `!tlbMiss` can still be true.
2. `Uncache` state machine transition: The state indicating "**able to
send requests**" (previously `s_refill_req`, now `s_inflight`) should
not be triggered by `reqFire` but rather by `acquireFire`.
<img width="747" alt="image"
src="https://github.com/user-attachments/assets/75fbc761-1da8-43d9-a0e6-615cc58cefef"
/>
# Evaluation
- ✅ timing
- ✅ performance
| Type | 4B*1000 | Speedup1-IO | 1B*4096 | Speedup2-IO |
| -------------- | ------- | ----------- | ------- | ----------- |
| IO | 51026 | 1 | 208149 | 1.00 |
| NC | 42343 | 1.21 | 169248 | 1.23 |
| NC+OT | 20379 | 2.50 | 160101 | 1.30 |
| NC+OT+mergeOpt | 16308 | 3.13 | 126369 | 1.65 |
| cache | 1298 | 39.31 | 4410 | 47.20 |
optimization load unit writeback data generate logic
* merge multi source data at `s2`, select and expand data at `s3`
* select data use one-hot instead of shifter
For `fast replay`, there is no need to request access to the `RAW/RAW
Queue`.
This prevents the `RAW Queue` from constantly ping-ponging between `not
full/full` due to `revoke`.
These two lines were removed because it would lead to combinatorial
logic loops and it was an unwanted condition:
dfc474ebe1/src/main/scala/xiangshan/mem/pipeline/LoadUnit.scala (L1269-L1270)
---
**This may result in some performance gains.**
Fixed the bug of abnormal signal loss when writing back.
Previously, we expected to compare only the ports of the writebacks that
triggered the exception and pick the oldest.
But amazingly, I just realised that the implementation doesn't match the
annotation. The current implementation can be problematic in that if
the write-back port that did not have an exception is older, the port that
triggered the exception is not elected.
Use s3_exception to try to optimise timing.
1. `lqIdx` has less bit width.
2. for vectors, the `robIdx` is the same for multiple `flow`s.
Previously, for vectors, we would additionally use `uopIdx` for
judgement. But actually, in theory, we only need to use `lqIdx/sqIdx`.
Here we change the age judgement for vectors to `lqIdx` to ensure
accurate age judgement. And change the age judgement of scalar to
`lqIdx` as well to reduce the cost.
Vector load should be the same as scalar load.
Priority judgement needs to be made with the instructions for replay.
Otherwise it will generate a stuck.
Currently, when `RAW` is full, a `RAW nack` is generated, which leads to
`LoadQueueReplay`.
And when `RAW` is non-empty, commands are reissued from `Replay`.
Currently, a load instruction goes into `LoadUnit` at `S2`, and then if
an exception occurs, a `revoke` is generated at `S3`.
Therefore, this will happen:
`RAW` has only one item remaining.
The instructions in `LoadQueueReplay` are sent to `LoadUnit1`.
The Load instruction also exists in `LoadUnit0`, so `LoadUnit0` has
access to `RAW`, while a Load in `LoadUnit1` produces a `RAW nack`.
And `LoadUnit0` and `LoadUnit1` would generate `bank conflict`, thus
causing `LoadUnit0` to get to `S3` to generate a `fast replay` and
`revoke`, which would result in `RAW` being non-full, which would result
in `RAW in `LoadQueueReplay` nack` command would be allowed to reissue.
The reissued instruction will in turn create a `bank conflict` with
`fast replay` and cause itself to create another `RAW nack` due to
priority issues.
When the above loop expands, it causes this to happen over and over
again, leading to a jam.
`Wu Shen` suggested that this bug could be solved by allowing `fast
replay` to spawn only once.
When DCache refill reponses with `denied` or `corrupt` asserted, the
loads belonging to the cache line should report load access fault. This
is accomplished by including a `corrupt` bit in the DCache MSHR
forwarding and TileLink channel D forwarding logic and triggering
excepion when `corrupt` is detected.
Store non-data error that comes from DCache store miss is unable to
trigger a precise access fault trap but an imprecise bus-error
interrupt. And it will be included in another commit.
1. Only if no `pf/af` occurs can it be considered a `mmio`. Thus
allowing a non-aligned Load to generate a misalign exception.
The store also suffers from this problem, but I will modify `StoreUnit`
later in some other way
2. Prefetching shouldn't produce non-alignment, and I previously placed
the logic for prefetching processing in the wrong place.
As data in WhenContext is not acessible in another module. To support
XSLog collection, we move all XSLog and related signal outside
WhenContext. For example, when(cond1){XSDebug(cond2, pable)} to
XSDebug(cond1 && cond2, pable)
When misaligned encounters mmio, we should actually generate the
misaligned exception and write it back directly. Therefore
`s2_real_exception`, instead of `s2_exception`, should be used for
`s_safe_writeback` and `s2_wakeup` judgement.
# L1 DCache RAS extension support
The L1 DCache supports the part of Reliability, Availability, and
Serviceability (RAS) Extension.
* L1 DCache protection with Single Error Correct Double Error Detect
(SECDED) ECC on the RAMs. This includes the L1 DChace tag and data RAMs.
Not recovery error tag or data.
* Fault Handling Interrupt (Bus Error Unit Interrupt,BEU, 65)
* Error inject
## ECC Error Detect
An error might be triggered, when access L1 DCache.
* **Error Report**:
* Tag ECC Error: As long as an ECC error occurs on a certain path, it
is judged that an ECC error has occurred.
* Data ECC Error: If an ECC error occurs in the hit line, it is
considered
that an ECC error has occurred. If it does not hit, it will not be
processed.
* If an instruction access triggers an ECC error, a Hardware error is
considered and an exception is reported.
* Whenever there is an error in starting, an error message needs to
be sent to BEU.
* When the hardware detects an error, it reports it to the BEU and
triggers the NMI external interrupt(65).
* **Load instruction**:
* Only ECC errors of tags or data will be triggered during execution,
and the errors will be reported to the BEU and a `Hardware Error`
will be reported.
* **Probe/Snoop**:
* If a tag ecc error occurs, there is no need to change the cache
status,
and a `ProbeAck` with `corrupt=1` needs to be returned to l2.
* If a data ecc error occurs, change the cache status according to
the rules. If data needs to be returned, `ProbeAckData` with `corrupt=1`
needs to be returned to l2.
* **Replace/Evict**:
* `ReleaseData` with `corrupt=1` needs to be returned to l2.
* **Store to L1 DCache**:
* If a tag ecc error occurs, the cacheline is released according to the
`Repalce/Evict` process and the data is written to L1 DCache without
reporting errors to l2.
* If a data ecc error occurs, the data is written directly without
reporting
the error to l2.
* **Atomics**:
* report `Hardware Error`, do not report errors to l2.
## Error Inject
Each core's L1 DCache is configured with a memory map
register-controlled
controller, and each hardware unit that supports ECC is configured with
a
control bank. After the Bank register configuration is completed, L1
DCache
will trigger an ecc error for the first access L1 DCache.
<div style="text-align: center;">
<img
src="https://github.com/user-attachments/assets/8c4d23c5-0324-4e52-bcf4-29b47a282d72"
alt="err_inject" width="200" />
</div>
### Address Space
Address space `0x38022000`-`0x3802207F`, a total of 128 bytes of space,
this space is the local space of each hart.
<div style="text-align: center;">
<img width="292" alt="ctl_bank"
src="https://github.com/user-attachments/assets/89f88b24-37a4-4786-a192-401759eb95cf">
</div>
### L1 DCache Control Bank
Each Control Bank contains registers: `ECCCTL`, `ECCEID`, `ECCMASK`,
each register is 8 bytes.
<img width="414" alt="eccctl"
src="https://github.com/user-attachments/assets/b22ff437-d05d-4b3c-a353-dbea1afdc156">
* ECCCTL(ECC Control): ECC injection control register.
* `ese(error signaling enable)`: Indicates that the injection is valid
and is initialized to 0. When the injection is successful and `pst==0`,
ese will be clean.
* `pst(persist)`: Continuously inject signals. When `pst==1`,
the `ECCEID`
counter decreases to 0 and after successful injection, the
injection timer will be restored to the last set `ECCEID` and
re-injected;
when `pst==0`, it will be injected only once.
* `ede(error delay enable)`: Indicates that counter is valid and
initialized to 0. If
* `ese==1` and `ede==0`, error injection is effective immediately.
* `ese==1` and `ede==1`, you need to wait until `ECCEID`
decrements to 0 before the injection is effective.
* `cmp(component)`: Injection target, initialized to 0.
* 1'b0: The injection object is tag.
* 1'b1: The injection object is data.
* `bank`: The bank valid signal is initialized to 0. When the bit in
the `bank` is set, the corresponding mask is valid.
<img width="414" alt="ecceid"
src="https://github.com/user-attachments/assets/8cea0d8d-2540-44b1-b1f9-c1ed6ec5341e">
* ECCEID(ECC Error Inject Delay): ECC injection delay controller.
* When `ese==1` and `ede==1`, it
starts to decrease until it reaches 0. Currently, the same clock as
the core frequency is used, which can also be divided. Since ECC
injection relies on L1 DCache access, the time of the `EID` and the
time when the ECC error is triggered may not be consistent.
<img width="414" alt="eccmask"
src="https://github.com/user-attachments/assets/b1be83fd-17a6-4324-8aa6-45858249c476">
* ECCMASK(ECC Mask): ECC injection mask register.
* 0 means no inversion, 1 means flip.
Tag injection only uses the bits in `ECCMASK0` corresponding to
the tag length.
### Error Inject Example
```
1 # set control bank base address
2 mv x3, $(BASEADDR)
3
4 # set eid
5 mv x5, 500 # delay 500 cycles
6 sd x5, 8(x3) # mmio store
7
8 # set mask
9 mv x5, 0x1 # flip bit 0
10 sd x5, 16(x3) # mmio store
11
12 # set ctl
13 mv x5, 0x7 # comp = 0, ede = 1, pst = 1, ese = 1
14 sd x5, 0(x3) # mmio store
```
* style(pbmt): remove outstanding constant which is just for self-test
* fix(uncache): added mask comparison for `addrMatch`
* style(mem): code normalization
* fix(pbmt): handle cases where the load unit is byte, word, etc
* style(uncache): fix an import
* fix(uncahce): address match should use non-offset address when forwading
In this case, to ensure correct forwarding, stores with the same address but overlapping masks cannot be entered at the same time.
* style(RAR): remove redundant design of `nc` reg