...

Feature	Look for
Architecture	ARMv7-A
Processor	Cortex-A9, Cortex-A9 MPCore
Instruction sets	ARM, Thumb, Jazelle, ThumbEE
Floating point	VFP3-32
Vector operations	NEON, Advanced SIMD
DSP-like ops	EDSP
Timers	Generic Timer
Extra security	TrustZone, Security Extension
Debugging	JTAG, CoreSight
Multiprocessing	SMP, MPCore, cache coherence, Snooper Snoop Control Unit (SCU)

The Cortex family of ARM processors incorporate as standard some features that used to be optional in earlier families and were designated by letters following the family names: (T)humb instruction set, (D)ebugging using JTAG, faster (M)ultiplication instructions, embedded (I)CE trace/debug and (E)xtended instructions allowing interoperation of ARM and Thumb code. Oddly, Cortex processors don't have any integer division instructions. MPCore variants have new synchronization instructions favored over the older Swap (SWP): Load Register Exclusive (LDREX) and Store Register Exclusive (STREX).

...

Automatic replacement of TLB entries normally uses a "pseudo round robin-random" algorithm, not the "least recently used" algorithm implemented in the PowerPC. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

...

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some special bits in the Translation Table Base Register telling it that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Caches

The Zynq-7000 system has both L1 and L2 caching. Each CPU has its own L1 instruction and data caches. All CPUs share a unified L2 cache. The L1 caches don't support entry locking but the L2 cache does. The L2 cache can operate in a so-called exclusion mode which prevents a cache line from appearing both in any L1 data cache and at the same time in the L2 cache.

Both the L1 caches and the L2 cache have the same line size of 32 bytes.

Cache sizes:

L1 I and L1 D: 32KB, 256 sets, 4 ways(lines)/set.
L2: 512KB, 2K sets, 8 ways/set.

The Snoop Control Unit, part of the Multiprocessing extension, ensures that all the L1 data caches and the L2 cache remain coherent, that is, provide the same view of memory contents to all CPUs. The SCU does not, however, keep the L1 instruction caches coherent with the other caches. To handle the case of modification to executable code the programmer has to use explicit cache maintenance operations on the CPU that's performing the modifications, cleaning the affected data cache lines and invalidating the affected instruction cache lines. That brings the L1 instruction cache into line with the L1 data cache on the CPU making the changes; the SCU will do the rest provided the programmer has used the right variants of the cache operations. Because of this modifications need only be done on the L1 caches; the programmer can ignore the L2 cache.

For ARMv7 VMSA the only cache policy choice implemented is write-back with line allocation on writes.

Cache maintenance operations

The are fewer valid cache operations for ARMv7 VMSA than for v6, some of the remaining valid ones are deprecated, and v7 supports the maintenance of more than one level of cache. One must be mindful of this when reading books not covering v7 or when using generic ARM support code present in RTEMS or other OSes. In addition the meaning of a given cache operation is profoundly affected by the presence of the Multiprocessing and Security extensions (and by the Large Physical Address and the Virtualization extensions which we don't have to worry about).

Cache maintenance operations can be classified into two major groups:

Those that affect all cache lines or affect lines specified by cache level, set and way. This group is generally used during system initialization when the SMP mode of operation has not been established. For example, all cache lines and branch predictions need to be invalidated before caches and MMU are enabled. They will also be used for cache cleaning prior to system shutdown.
Those that affect the cache line corresponding to a given virtual address. These would be used to complete changes to the properties of memory regions during normal operation. In SMP mode these operations would be propagated to all CPUs via the SCU. There are variations on this set of operations that restrict propagation to those CPUs that are in the same "inner sharability domain" as the one making the changes but, for RTEMS at least I don't think we'll be defining any such domains.

ARM manuals speak of "invalidation" or "flushing" which means marking cache lines as unused, and "cleaning" which means writing cache lines out to the next lower level of caching (or to main memory) and marking them as clean. Note that most folks using "flushing" to mean what ARM calls "cleaning".

Entire cache or by (level, set, way)

Entire cache:

ICIALLU (Instruction Cache Invalidate All to point of Unification): invalidate the L1 I-cache for the local CPU and invalidate all its branch predictions.
ICIALLUIS (Instruction Cache Invalidate All to point of Unification for the Inner Shared domain): invalidate the L1 I-caches and branch predictions for all CPUs in the same Inner Sharability domain.
BPIALL, BPIALLIS: Similar to the above but invalidate only branch predictions, leaving the I-caches untouched.

By level, set, and way:

DCISW (Data Cache Invalidate by Set/Way): Invalidate for the local CPU a single cache line which is selected by specifying the set and the way.
DCCSW (Data Cache Clean by Set/Way): Clean a single cache line for the local CPU.
DCCISW (Data Cache Clean and Invalidate by Set/Way): Clean and invalidate a single cache line for the local CPU.

Notice that there's no set/way operation that invalidates or cleans an entire data cache; to do that one has to loop over all sets and ways. Nor are there variants affecting multiple CPUs; for that one needs to use operations that take virtual addresses. It's not clear to me whether the SCU obviates the corresponding L2 operation after one manipulates a L1 cache line this way (no pun intended). It can't hurt to do an explicit L2 invalidation at startup after the L1 invalidation is done.

By virtual address

For a virtual address that is Normal and is cached, the following operations affect all CPUs in the sharability domain corresponding to the sharability attributes of the address as specified in the MMU's translation map. For a VA with the Ordered attribute, the CPUs affected are those in the Outer sharability domain that includes the local CPU.

ARM-speak:

Modified Virtual Address: For Zynq it's the same as the plain old Virtual Address.
Point of Coherency (data accesses): All the levels of the memory hierarchy starting from L1 data cache of the CPU making the change out to and including the PoC must be adjusted to reflect the change in order to guarantee that all agents in the system can see the change. An agent can be a CPU, DMA engine, or whatnot. For the Zynq the PoC is main memory. Note that I said that agents can see the change, not that they will. If they have any data caches between themselves and the PoC then they will need to be notified so that they can invalidate the right entries in them, or some coherence mechanism must do it for them. On the Zynq the Snoop Control Unit will examine the attributes of the VA and invalidate data cache entries at L1 for at least some of the other CPUs and at L2 if need be:
- Normal, cached memory: The CPUs affected will be those in the sharability domain specified for the VA.
- Strongly ordered, cached memory: The CPUs in the same Outer sharability domain as the CPU making the change will be affected.
- Shared, cached device memory: The ARMv7-A Architecture Manual says the behavior is implementation defined in the absence of the LVA extension, but the Cortex-A9 tech refs don't define it.
Point of Unification (instruction accesses): All levels of the memory get entries invalidated from the L1 instruction cache out to and including that level, the PoU, which is in common to the CPU's instruction fetches, data fetches, and table walk fetches. For the Zynq the PoU is the unified L2 cache. In this case the SCU won't invalidate any instruction cache entries for other CPUs. It seems as if code modification such as performed by a dynamic linker will have to involve inter-CPU signalling in order to get software to perform all the required instruction cache invalidations.

Operations:

DCIMVAC (Data Cache Invalidate by MVA to PoC)
DCCMVAC (like the above but cleans)
DCCIMVAC (like the above but cleans and invalidates)
DCCMVAU (Data Cache Clean by MVA to PoU)
ICIMVAU (Instruction Cache Invalidate by MVA to PoU)

Initializing the caches after a cold reset

CPU0 is the first to come up. Presumably it does the L2 initialization and CPU1 does not. The L2 cache can be invalidated using set/way cache operations.

Each CPU should invalidate all entries in its own L1 data and instruction caches before entering SMP mode.

Child pages

Versions Compared

Old Version 7

New Version 8

Key

Caches

Cache maintenance operations

Entire cache or by (level, set, way)

By virtual address

Initializing the caches after a cold reset

Multiprocessor support

Sharable memory

and sharablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

Child pages

Page History

Versions Compared

Old Version 7

New Version 8

Key

Caches

Cache maintenance operations

Entire cache or by (level, set, way)

By virtual address

Initializing the caches after a cold reset

Multiprocessor support

Sharable memory

and sharablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset