...
- Modified Virtual Address: For the Zynq it's the same as the plain old Virtual Address.
- Point of Coherency (data accesses): All the levels of the memory hierarchy starting from L1 data cache of the CPU making the change out to and including the PoC must be adjusted to reflect the change in order to guarantee that all agents in the system can see the change. An agent can be a CPU, DMA engine, or whatnot. For the Zynq the PoC is main memory. Note that I said that agents can see the change, not that they will. If they have any data caches between themselves and the PoC then they will need to be notified so that they can invalidate the right entries in them, or some coherence mechanism must do it for them. On the Zynq the Snoop Control Unit will examine the attributes of the VA and invalidate data cache entries at L1 for at least some of the other CPUs and at L2 if need be:
- Normal, cached memory: The CPUs affected will be those in the sharability domain specified for the VA.
- Strongly ordered, cached memory: The CPUs in the same Outer sharability domain as the CPU making the change will be affected.
- Shared, cached device memory: The ARMv7-A Architecture Manual says the behavior is implementation defined in the absence of the LVA extension, but the Cortex-A9 tech refs don't define it.
- Point of Unification (instruction accesses): All levels of the memory get entries invalidated from the L1 instruction cache out to and including that level, the PoU, which is in common to the CPU's instruction fetches, data fetches, and table walk fetches. For the Zynq the PoU is the unified L2 cache. In this case the SCU won't invalidate any instruction cache entries for other CPUs. It seems as if code modification such as performed by a dynamic linker will have to involve inter-CPU signalling in order to get software to perform all the required instruction cache invalidations. Or perhaps we can just make a region of memory unshared and non-executable, load code into it and perform the relocations, then make the memory shared and executable again.
- Sharability domain: A set of memory bus masters (or "agents"), e.g., CPUs and DMA engines, that share access to a given VA, according to the sharing attributes assigned to the VA by the MMU. CPUs . A system can be partitioned into non-overlapping "inner" sharability domains while an "outer" domain can include all CPUs.
Operations:
- DCIMVAC (Data Cache Invalidate by MVA to PoC)
- DCCMVAC (like the above but cleans)
- DCCIMVAC (like the above but cleans and invalidates)
- DCCMVAU (Data Cache Clean by MVA to PoU)
- ICIMVAU (Instruction Cache Invalidate by MVA to PoU)
- a set of disjoint "outer" domains each of which can be further partitioned into disjoint "inner" domains. Domain membership is determined by how the various memories and caches are wired up to each other and to bus masters. Whether the local CPU's changes to a physical location are to be shared only with other members of the same inner domain ("inner sharable") or with all members of the same outer domain ("outer sharable") is determined by attributes in the VA's translation table entry in the local MMU provided that the hardware allows sharing to take place.
Operations:
- DCIMVAC (Data Cache Invalidate by MVA to PoC)
- DCCMVAC (like the above but cleans)
- DCCIMVAC (like the above but cleans and invalidates)
- DCCMVAU (Data Cache Clean by MVA to PoU)
- ICIMVAU (Instruction Cache Invalidate by MVA to PoU)
Coprocessor instructions for cache maintenance
Cache maintenance operations as implemented as operations by coprocessor 15 (p15) involving at coprocessor register 7 (c7) along with various opcodes and secondary processor registers:
Operation | Instruction | GPR operand |
---|---|---|
ICIALLU | MCR P15, 0, GPR, C7, C5, 0 | Ignored |
BPIALL | MCR P15, 0, GPR, C7, C5, 6 | " |
ICIALLUIS | MCR P15, 0, GPR, C7, C1, 0 | " |
BPIALLIS | MCR P15, 0, GPR, C7, C1, 6 | " |
DCISW | MCR P15, 0, GPR, C7, C6, 2 | Packed level, set, way |
DCCSW | MCR P15, 0, GPR, C7, C10, 2 | " |
DCCISW | MCR P15, 0, GPR, C7, C14, 2 | " |
DCIMVAC | MCR P15, 0, GPR, C7, C6, 1 | Virtual address |
DCCMVAC | MCR P15, 0, GPR, C7, C10, 1 | " |
DCCIMVAC | MCR P15, 0, GPR, C7, C14, 1 | " |
DCCMVAU | MCR P15, 0, GPR, C7, C11, 1 | " |
ICIMVAU | MCR P15, 0, GPR, C7, C5, 1 | " |
For set/way operations the format of the operand in the general-purpose register is
(way << (32 - A)) | (set << L) | (level << 1) where:
- way is the way number, starting at 0
- A is ceil(log2(number of ways))
- set is the set number starting from 0
- L is log2(cache line size)
- level is the cache level (0 for L1, 1 for L2, etc.)
For the Zynq L1 I and D caches this reduces to (way << 30) | (set << 5), where way < 4 and set < 256. For the L2 cache it reduces to (way << 29) | (set << 5) | 2 where way < 8 and set < 2048.
For VA operations the GPR contains the virtual address. It needn't be cache-line aligned. The number of bytes affected by any individual operation is hard to determine due to the possible presence of a merging write-back buffer between cache and main memory. For this reason you should use the set/way form in a loop in order to operate on an entire data cache.
Synchronizing after cache maintenance for SMP
By themselves the cache maintenance operations don't do the whole job; you also have to use memory barrier operations to broadcast the result of a data cache or instruction cache operation to other CPUs. Even that doesn't do everything because the remote CPUs still need to be told to dump their instruction fetch pipelines; they might be have become inconsistent with their new I-cache states. Some sort of explicit inter-CPU signaling is needed; in the following example from ARM the code assumes the use of a simple semaphore:
No Format |
---|
; First CPU P1: STR R11, [R1] ; R11 contains a new instruction to store in program memory DCCMVAU R1 ; clean to PoU makes visible to instruction cache DSB ; ensure completion of the clean on all processors ICIMVAU R1 ; ensure instruction cache/branch predictor discards stale data BPIMVA R1 DSB ; ensure completion of the ICache and branch predictor ; invalidation on all processors STR R0, [R2] ; set flag to signal completion ISB ; synchronize context on this processor BX R1 ; branch to new code ; Second CPU P2-Px: WAIT ([R2] == 1) ; wait for flag signaling completion ISB ISB ; synchronize context on this processor BX R1 BX R1 ; branch to new code |
Initializing the caches after a cold reset
...