Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Modified Virtual Address: For the Zynq it's the same as the plain old Virtual Address.
  • Point of Coherency (data accesses): All the levels of the memory hierarchy starting from L1 data cache of the CPU making the change out to and including the PoC must be adjusted to reflect the change in order to guarantee that all agents in the system can see the change. An agent can be a CPU, DMA engine, or whatnot. For the Zynq the PoC is main memory. Note that I said that agents can see the change, not that they will. If they have any data caches between themselves and the PoC then they will need to be notified so that they can invalidate the right entries in them, or some coherence mechanism must do it for them. On the Zynq the Snoop Control Unit will examine the attributes of the VA and invalidate data cache entries at L1 for at least some of the other CPUs and at L2 if need be:
    • Normal, cached memory: The CPUs affected will be those in the sharability domain specified for the VA.
    • Strongly ordered, cached memory: The CPUs in the same Outer sharability domain as the CPU making the change will be affected.
    • Shared, cached device memory: The ARMv7-A Architecture Manual says the behavior is implementation defined in the absence of the LVA extension, but the Cortex-A9 tech refs don't define it.
  • Point of Unification (instruction accesses): All levels of the memory get entries invalidated from the L1 instruction cache out to and including that level, the PoU, which is in common to the CPU's instruction fetches, data fetches, and table walk fetches. For the Zynq the PoU is the unified L2 cache. In this case the SCU won't invalidate any instruction cache entries for other CPUs. It seems as if code modification such as performed by a dynamic linker will have to involve inter-CPU signalling in order to get software to perform all the required instruction cache invalidations. Or perhaps we can just make a region of memory unshared and non-executable, load code into it and perform the relocations, then make the memory shared and executable again.
  • Sharability domain: A set of memory bus masters (or "agents"), e.g., CPUs and DMA engines, that share access to a given VA, according to the sharing attributes assigned to the VA by the MMU. CPUs . A system can be partitioned into non-overlapping "inner" sharability domains while an "outer" domain can include all CPUs.

Operations:

  • DCIMVAC (Data Cache Invalidate by MVA to PoC)
  • DCCMVAC (like the above but cleans)
  • DCCIMVAC (like the above but cleans and invalidates)
  • DCCMVAU (Data Cache Clean by MVA to PoU)
  • ICIMVAU (Instruction Cache Invalidate by MVA to PoU)
  • a set of disjoint "outer" domains each of which can be further partitioned into disjoint "inner" domains. Domain membership is determined by how the various memories and caches are wired up to each other and to bus masters. Whether the local CPU's changes to a physical location are to be shared only with other members of the same inner domain ("inner sharable") or with all members of the same outer domain ("outer sharable") is determined by attributes in the VA's translation table entry in the local MMU provided that the hardware allows sharing to take place.

Operations:

  • DCIMVAC (Data Cache Invalidate by MVA to PoC)
  • DCCMVAC (like the above but cleans)
  • DCCIMVAC (like the above but cleans and invalidates)
  • DCCMVAU (Data Cache Clean by MVA to PoU)
  • ICIMVAU (Instruction Cache Invalidate by MVA to PoU)

Coprocessor instructions for cache maintenance

Cache maintenance operations as implemented as operations by coprocessor 15 (p15) involving at coprocessor register 7 (c7) along with various opcodes and secondary processor registers:

Operation

Instruction

GPR operand

ICIALLU

MCR P15, 0, GPR, C7, C5, 0

Ignored

BPIALL

MCR P15, 0, GPR, C7, C5, 6

"

ICIALLUIS

MCR P15, 0, GPR, C7, C1, 0

"

BPIALLIS

MCR P15, 0, GPR, C7, C1, 6

"

DCISW

MCR P15, 0, GPR, C7, C6, 2

Packed level, set, way

DCCSW

MCR P15, 0, GPR, C7, C10, 2

"

DCCISW

MCR P15, 0, GPR, C7, C14, 2

"

DCIMVAC

MCR P15, 0, GPR, C7, C6, 1

Virtual address

DCCMVAC

MCR P15, 0, GPR, C7, C10, 1

"

DCCIMVAC

MCR P15, 0, GPR, C7, C14, 1

"

DCCMVAU

MCR P15, 0, GPR, C7, C11, 1

"

ICIMVAU

MCR P15, 0, GPR, C7, C5, 1

"

For set/way operations the format of the operand in the general-purpose register is
(way << (32 - A)) | (set << L) | (level << 1) where:

  • way is the way number, starting at 0
  • A is ceil(log2(number of ways))
  • set is the set number starting from 0
  • L is log2(cache line size)
  • level is the cache level (0 for L1, 1 for L2, etc.)

For the Zynq L1 I and D caches this reduces to (way << 30) | (set << 5), where way < 4 and set < 256. For the L2 cache it reduces to (way << 29) | (set << 5) | 2 where way < 8 and set < 2048.

For VA operations the GPR contains the virtual address. It needn't be cache-line aligned. The number of bytes affected by any individual operation is hard to determine due to the possible presence of a merging write-back buffer between cache and main memory. For this reason you should use the set/way form in a loop in order to operate on an entire data cache.

Synchronizing after cache maintenance for SMP

By themselves the cache maintenance operations don't do the whole job; you also have to use memory barrier operations to broadcast the result of a data cache or instruction cache operation to other CPUs. Even that doesn't do everything because the remote CPUs still need to be told to dump their instruction fetch pipelines; they might be have become inconsistent with their new I-cache states. Some sort of explicit inter-CPU signaling is needed; in the following example from ARM the code assumes the use of a simple semaphore:

No Format
; First CPU
P1:
    STR R11, [R1]    ; R11 contains a new instruction to store in program memory
    DCCMVAU R1       ; clean to PoU makes visible to instruction cache
    DSB              ; ensure completion of the clean on all processors
    ICIMVAU R1       ; ensure instruction cache/branch predictor discards stale data
    BPIMVA R1
    DSB              ; ensure completion of the ICache and branch predictor
                     ; invalidation on all processors
    STR R0, [R2]     ; set flag to signal completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

; Second CPU
P2-Px:
    WAIT ([R2] == 1)     ; wait for flag signaling completion
ISB    ISB              ; synchronize context on this processor
BX R1   BX R1            ; branch to new code

Initializing the caches after a cold reset

...