Classification of ARM implementations

ARM is an old semi-RISC processor design; the first design was released in 1985. Implementations are classified broadly by architecture and by the design of the processor core implementing the architecture:

Architecture. This is the view of the processor seen by programmers (privileged and not). The architecture revision is referred to as ARMv5, ARMv7, etc. There used to be only one current version of the architecture but lately this has been split into three "profiles":
- A or Application profile. Intended for use with multi-user operating systems such as Linux, A-profile architectures include a Virtual Memory System Architecture (VMSA) wherein an MMU (or several) provide full-blown address remapping and memory attributes such as cached, non-executable, etc.
- R or Real-Time profile. Meant for single-user RTOSes such as VxWorks or RTEMS. Incorporates a Protected Memory System Architecture (PMSA) wherein an MPU provides memory attributes but not address remapping.
- M or Microcontroller profile. Intended for the simplest embedded systems which don't run a true operating system.
Processor core implementation. There have been many implementations, until recently designated by "ARM" followed by a processor family number and a bunch of letters telling what extra features are present, e.g., ARM7TDMI. Note that the family number doesn't indicate which ARM architecture revision is implemented, e.g., ARM7 processors implement architecture v5. Lately this scheme has been abandoned in favor of a family name, architecture profile letter and family revision number such as "Cortex-A9".
Number of cores. In later systems a number of ARM cores may share main memory, some or all peripherals and some caches. Such systems have the word "MPCore" appended to the classification.

Classification and feature set of the Zynq-7000 SoC

I'll list the system features here along with key terms you should look for when navigating the ARM documentation forest.

Feature	Look for
Architecture	ARMv7-A
Processor	Cortex-A9, Cortex-A9 MPCore
Instruction sets	ARM, Thumb, Jazelle, ThumbEE
Floating point	VFP3-32
Vector operations	NEON, Advanced SIMD
DSP-like ops	EDSP
Timers	Generic Timer
Extra security	TrustZone, Security Extension
Debugging	JTAG, CoreSight
Multiprocessing	SMP, MPCore, cache coherence, Snoop Control Unit (SCU)

The Cortex family of ARM processors incorporate as standard some features that used to be optional in earlier families and were designated by letters following the family names: (T)humb instruction set, (D)ebugging using JTAG, faster (M)ultiplication instructions, embedded (I)CE trace/debug and (E)xtended instructions allowing interoperation of ARM and Thumb code. Oddly, Cortex processors don't have any integer division instructions. MPCore variants have new synchronization instructions favored over the older Swap (SWP): Load Register Exclusive (LDREX) and Store Register Exclusive (STREX).

The same block of silicon, NEON, implements scalar single and double-float operations as well as SIMD for integer and single-float operands.

The following extensions are not implemented in the processor: obsolete floating-point (FP) independent of NEON, alternate NEON floating point (VFP3-16 or VFP4-anything), 40-bit physical addresses (Large Physical Address Extension) or virtualization (hypervisor support).

GNU toolkit options

Use -mcpu=cortex-a9 when compiling in order to get the full instruction set including LDREX and STREX. This is already done in our make system. If you don't specify this you'll get the default -mcpu=arm7tdmi which is for a much older ARM implementation.

Processor "state" vs. "mode" and "privilege level"

Both mode and state are reflected in bits in the Current Processor State Register, or CPSR. "State" refers to the instruction set being executed. "Mode" and "privilege" determine the view of the processor the programmer sees; some instructions may be forbidden and a the visible bank of registers may differ.

Instruction sets:

The standard ARM instruction set. Each instruction is 32 bits long and aligned on a 32-bit boundary. The full set of general registers is available. Shift operations may be combined with arithmetic and logical operations. This is the instruction set we'll be using for our project. Oddly, an integer divide instruction is optional and the Zynq CPUs don't have it.
Thumb-2. Designed for greater code density. Contains a mix of 16-bit and 32-bit instructions. Many instructions can access only general registers 0-7.
Jazelle. Similar to Java byte code.
ThumbEE. A sort of hybrid of Thumb and Jazelle, actually a CPU operation mode. Intended for environments where code modification is frequent, such as ones having a JIT compiler.

Privilege levels:

Level 0 (PL0): Unprivileged.
Level 1 (PL1): Privileged.

The two privilege levels are duplicated in Secure mode and Non-Secure mode, which have distinct address spaces. Under RTEMS we'll be using only Secure mode.

Coprocessors

The ARM instruction set has a standard coprocessor interface which allows up to 16 distinct coprocessors.

Coprocessor 15, CP15, is a pseudo-coprocessor which performs cache and MMU control as well as other system control functions.

CPs 12, 13 and 14 are reserved for floating point and vector hardware, which in this system are both part of the NEON extension.

MMU

There can be up to four independent MMUs per CPU (though they may be implemented with a single block of silicon with multiple banks of control registers). Without the security or virtualization extensions there is just one MMU which is used for both privileged and non-privileged accesses. Adding the security extension adds another for secure code, again for both privilege levels. Adding the virtualization extension adds two more MMUs, one for the hypervisor and one for a second stage of translation for code running in a virtual machine. The first stage of translation in a virtual machine maps VM-virtual to VM-real addresses while the second stage maps VM-real to actual hardware addresses. The hypervisor's MMU maps only once, from hyper-virtual to actual hardware addresses.

The Zynq CPUs have just the security extension and so each has two MMUs. All the MMUs present come up disabled after a reset, with garbage in the TLB entries. If all the relevant MMUs for a particular CPU state are disabled the system is still operable. Data accesses assume a memory type of Ordered, so there is no prefetching or reordering; data caches must be disabled or contain only invalid entries since a cache hit in this state results in unpredictable action. Instruction fetches assume a memory type of Normal, are uncached but still speculative, so that addresses up to 8 KB above the start of the current instruction may be accessed.

Unlike for PowerPC there is no way to examine MMU TLB entries nor to set them directly; you have to have a page table of some sort set up when you enable the MMU. In the RTEMS GIT repository the lpc32xx BSP implements a simple and practical approach. It maintains a single-level page table with 4096 32-bit entries, where each entry covers a 1 MB section of address space. The whole table therfore covers the entire 4 GB address space of the CPU. The upper 12 bits of each virtual address is the same as the upper 12 bits of the physical address so that the address mapping is the identity. Finding the index of the table entry for a particular address just requires a logical right shift of the VA by 20. The property-granularity of 1 MB imposed by this organization seems a small price to pay for avoiding the complications of second-level page tables and variable page sizes.

Automatic replacement of TLB entries normally uses a "pseudo-random" or "round robin" algorithm, not the "least recently used" algorithm implemented in the PowerPC. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

In a multi-core system like the Zynq all the CPUs can share the same translation table if all of the following conditions are met:

All CPUs are in SMP mode.
All CPUs are in TLB maintenance broadcast mode.
All the MMUs are given the same real base address for
the translation table.
The translation table is in memory marked Normal, Sharable with write-back caching.

Under these conditions any CPU can change an address translation as if
it were alone and have the changes broadcast to the other CPUs.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some special bits in the Translation Table Base Register telling it that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Proposed MMU translation tables for RTEMS

As far as possible we keep to a single Level 1 translation table, where each entry describes a 1 MB "section" of address space. 4096 entries will cover all of the 4 GB address space. All entries specify access domain zero which will be set up for Client access, meaning that the translation table entries specify access permissions. All entries will be global entries, meaning that they apply to all code regardless of threading; the table is never modified by context switches. The address mapping is the identity, i. e., virtual == real.

For a few 1 MB sections we require a finer granularity and provide second-level tables for "small" pages of 4 KB each. The on-chip memory and the RCE protocol plugins will most likely receive this treatment; how many addresses the latter require will be calculated at system startup.

The first level translation table is part of the system image containing RTEMS. It has its own object code section named ".mmu_table" so that the linker script used to create the image can put it in some suitable place independently of the placement of other data; this includes giving it the proper alignment. If we keep to a small number of second-level tables, say ten so so, we can reserve space for them statically at the end of the .mmu_table section. Each second-level table occupies 1 KB.

The default Level 1 memory map

This default map is established at system startup. OCM and DDR have the attributes Normal, Read/Write, Executable and Cacheable. Memory mapped I/O regions are Device, Read/Write, Non-executable and Outer Shareable. Regions reserved by Xilinx will be No-access, causing a Permission exception on attempted use.

Address range	Region size	Mapping type	Resources included
0x00000000-0x3fffffff	1 GB	RAM	DDR + low OCM
0x40000000-0x7fffffff	1 GB	I/O	PL AXI slave port 0
0x80000000-0xbfffffff	1 GB	I/O	PL AXI slave port 1
0xc0000000-0xdfffffff	512 MB	No access
0xe0000000-0xefffffff	256 MB	I/O	IOP devices
0xf0000000-0xf7ffffff	128 MB	No access
0xf8000000-0xf9ffffff	32 MB	I/O	Registers on AMBA APB bus
0xfa000000-0xfbffffff	32 MB	No access
0xfc000000-0xfdffffff	32 MB	I/O	Quad-SPI
0xfe000000-0xffffffff	32 MB	No access	Unsupported Quad-SPI + high OCM

Caches

The Zynq-7000 system has both L1 and L2 caching. Each CPU has its own L1 instruction and data caches. All CPUs share a unified L2 cache. The L1 caches don't support entry locking but the L2 cache does. The L2 cache can operate in a so-called exclusion mode which prevents a cache line from appearing both in any L1 data cache and at the same time in the L2 cache.

Both the L1 caches and the L2 cache have the same line size of 32 bytes.

Cache sizes:

L1 I and L1 D: 32KB, 256 sets, 4 ways(lines)/set.
L2: 512KB, 2K sets, 8 ways/set.

The Snoop Control Unit, part of the Multiprocessing extension, ensures that all the L1 data caches and the L2 cache remain coherent, that is, provide the same view of memory contents to all CPUs. The SCU does not, however, keep the L1 instruction caches coherent with the other caches. To handle the case of modification to executable code the programmer has to use explicit cache maintenance operations on the CPU that's performing the modifications, cleaning the affected data cache lines and invalidating the affected instruction cache lines. That brings the L1 instruction cache into line with the L1 data cache on the CPU making the changes; the SCU will do the rest provided the programmer has used the right variants of the cache operations. Because of this modifications need only be done on the L1 caches; the programmer can ignore the L2 cache.

For ARMv7 VMSA the only cache policy choice implemented is write-back with line allocation on writes.

Cache maintenance operations

The are fewer valid cache operations for ARMv7 VMSA than for v6, some of the remaining valid ones are deprecated, and v7 supports the maintenance of more than one level of cache. One must be mindful of this when reading books not covering v7 or when using generic ARM support code present in RTEMS or other OSes. In addition the meaning of a given cache operation is profoundly affected by the presence of the Multiprocessing and Security extensions (and by the Large Physical Address and the Virtualization extensions which we don't have to worry about).

Cache maintenance operations can be classified into two major groups:

Those that affect all cache lines or affect lines specified by cache level, set and way. This group is generally used during system initialization when the SMP mode of operation has not been established. For example, all cache lines and branch predictions need to be invalidated before caches and MMU are enabled. They will also be used for cache cleaning prior to system shutdown.
Those that affect the cache line corresponding to a given virtual address. These would be used to complete changes to the properties of memory regions during normal operation. In SMP mode these operations would be propagated to all CPUs via the SCU. There are variations on this set of operations that restrict propagation to those CPUs that are in the same "inner sharability domain" as the one making the changes but, for RTEMS at least I don't think we'll be defining any such domains.

ARM manuals speak of "invalidation" or "flushing" which means marking cache lines as unused, and "cleaning" which means writing cache lines out to the next lower level of caching (or to main memory) and marking them as clean. Note that most folks using "flushing" to mean what ARM calls "cleaning".

Entire cache or by (level, set, way)

Entire cache:

ICIALLU (Instruction Cache Invalidate All to point of Unification): invalidate the L1 I-cache for the local CPU and invalidate all its branch predictions.
ICIALLUIS (Instruction Cache Invalidate All to point of Unification for the Inner Shared domain): invalidate the L1 I-caches and branch predictions for all CPUs in the same Inner Sharability domain.
BPIALL, BPIALLIS: Similar to the above but invalidate only branch predictions, leaving the I-caches untouched.

By level, set, and way:

DCISW (Data Cache Invalidate by Set/Way): Invalidate for the local CPU a single cache line which is selected by specifying the set and the way.
DCCSW (Data Cache Clean by Set/Way): Clean a single cache line for the local CPU.
DCCISW (Data Cache Clean and Invalidate by Set/Way): Clean and invalidate a single cache line for the local CPU.

Notice that there's no set/way operation that invalidates or cleans an entire data cache; to do that one has to loop over all sets and ways. Nor are there variants affecting multiple CPUs; for that one needs to use operations that take virtual addresses. It's not clear to me whether the SCU obviates the corresponding L2 operation after one manipulates a L1 cache line this way (no pun intended). It can't hurt to do an explicit L2 invalidation at startup after the L1 invalidation is done.

By virtual address

ARM-speak:

Modified Virtual Address: For the Zynq it's the same as the plain old Virtual Address.
Point of Coherency (data accesses): All the levels of the memory hierarchy starting from L1 data cache of the CPU making the change out to and including the PoC must be adjusted to reflect the change in order to guarantee that all agents in the system can see the change. An agent can be a CPU, DMA engine, or whatnot. For the Zynq the PoC is main memory. Note that I said that agents can see the change, not that they will. If they have any data caches between themselves and the PoC then they will need to be notified so that they can invalidate the right entries in them, or some coherence mechanism must do it for them. On the Zynq the Snoop Control Unit will examine the attributes of the VA and invalidate data cache entries at L1 for at least some of the other CPUs and at L2 if need be:
- Normal, cached memory: The CPUs affected will be those in the sharability domain specified for the VA.
- Strongly ordered, cached memory: The CPUs in the same Outer sharability domain as the CPU making the change will be affected.
- Shared, cached device memory: The ARMv7-A Architecture Manual says the behavior is implementation defined in the absence of the LVA extension, but the Cortex-A9 tech refs don't define it.
Point of Unification (instruction accesses): All levels of the memory get entries invalidated from the L1 instruction cache out to and including that level, the PoU, which is in common to the CPU's instruction fetches, data fetches, and table walk fetches. For the Zynq the PoU is the unified L2 cache. In this case the SCU won't invalidate any instruction cache entries for other CPUs. It seems as if code modification such as performed by a dynamic linker will have to involve inter-CPU signalling in order to get software to perform all the required instruction cache invalidations. Or perhaps we can just make a region of memory unshared and non-executable, load code into it and perform the relocations, then make the memory shared and executable again.
Sharability domain: A set of memory bus masters (or "agents"), e.g., CPUs and DMA engines, that share access. A system can be partitioned into a set of disjoint "outer" domains each of which can be further partitioned into disjoint "inner" domains. Domain membership is determined by how the various memories and caches are wired up to each other and to bus masters. Whether the local CPU's changes to a physical location are to be shared only with other members of the same inner domain ("inner sharable") or with all members of the same outer domain ("outer sharable") is determined by attributes in the VA's translation table entry in the local MMU provided that the hardware allows sharing to take place.

Operations:

DCIMVAC (Data Cache Invalidate by MVA to PoC)
DCCMVAC (like the above but cleans)
DCCIMVAC (like the above but cleans and invalidates)
DCCMVAU (Data Cache Clean by MVA to PoU)
ICIMVAU (Instruction Cache Invalidate by MVA to PoU)

Coprocessor instructions for cache maintenance

Cache maintenance operations as implemented as operations by coprocessor 15 (p15) involving at coprocessor register 7 (c7) along with various opcodes and secondary processor registers:

Operation	Instruction	GPR operand
ICIALLU	MCR P15, 0, GPR, C7, C5, 0	Ignored
BPIALL	MCR P15, 0, GPR, C7, C5, 6	"
ICIALLUIS	MCR P15, 0, GPR, C7, C1, 0	"
BPIALLIS	MCR P15, 0, GPR, C7, C1, 6	"
DCISW	MCR P15, 0, GPR, C7, C6, 2	Packed level, set, way
DCCSW	MCR P15, 0, GPR, C7, C10, 2	"
DCCISW	MCR P15, 0, GPR, C7, C14, 2	"
DCIMVAC	MCR P15, 0, GPR, C7, C6, 1	Virtual address
DCCMVAC	MCR P15, 0, GPR, C7, C10, 1	"
DCCIMVAC	MCR P15, 0, GPR, C7, C14, 1	"
DCCMVAU	MCR P15, 0, GPR, C7, C11, 1	"
ICIMVAU	MCR P15, 0, GPR, C7, C5, 1	"

For set/way operations the format of the operand in the general-purpose register is
(way << (32 - A)) | (set << L) | (level << 1) where:

way is the way number, starting at 0
A is ceil(log2(number of ways))
set is the set number starting from 0
L is log2(cache line size)
level is the cache level (0 for L1, 1 for L2, etc.)

For the Zynq L1 I and D caches this reduces to (way << 30) | (set << 5), where way < 4 and set < 256. For the L2 cache it reduces to (way << 29) | (set << 5) | 2 where way < 8 and set < 2048.

For VA operations the GPR contains the virtual address. It needn't be cache-line aligned. The number of bytes affected by any individual operation is hard to determine due to the possible presence of a merging write-back buffer between cache and main memory. For this reason you should use the set/way form in a loop in order to operate on an entire data cache.

Synchronizing after cache maintenance for SMP

By themselves the cache maintenance operations don't do the whole job; you also have to use memory barrier operations to broadcast the result of a data cache or instruction cache operation to other CPUs. Even that doesn't do everything because the remote CPUs still need to be told to dump their instruction fetch pipelines; they might have become inconsistent with their new I-cache states. Some sort of explicit inter-CPU signaling is needed; in the following example from ARM the code assumes the use of a simple semaphore:

; First CPU
P1:
    STR R11, [R1]    ; R11 contains a new instruction to store in program memory
    DCCMVAU R1       ; clean to PoU makes visible to instruction cache
    DSB              ; ensure completion of the clean on all processors
    ICIMVAU R1       ; ensure instruction cache/branch predictor discards stale data
    BPIMVA R1
    DSB              ; ensure completion of the ICache and branch predictor
                     ; invalidation on all processors
    STR R0, [R2]     ; set flag to signal completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

; Second CPU
P2-Px:
    WAIT ([R2] == 1) ; wait for flag signaling completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

Initializing the caches after a cold reset

CPU0 is the first to come up. Presumably it does the L2 initialization and CPU1 does not. The L2 cache can be invalidated using set/way cache operations.

Each CPU should invalidate all entries in its own L1 data and instruction caches before entering SMP mode.

Multiprocessor support

Sharable memory and sharablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

State	ARM
Mode	Supervisor
Privilege level	1
Exceptions	All disabled save for Reset
Security level	Secure
Secure mode MMU	Disabled (SCTLR.M == 0)
Nonsecure mode MMU	Random (SCTLR.M random) but not functioning since Secure mode is active
PC	0x00000000 or 0xFFFF0000 depending on reset behavior of SCTLR.V bit
Caches	All disabled, contents random
Branch tracer (BTAC)	???
Snooper	???
SMP mode	???
TLB broadcast mode	???
SP	Random
Other GPRs	Random
CPSR I-bit	1
CPSR F-bit	1

The MMU's TLB entries have random content so one must at least invalidate all TLB entries before enabling the MMU. With the MMU disabled all instruction fetches are assumed to be to Normal memory while data accesses are assumed to be to Ordered memory.

All cache lines should be invalidated prior to enabling the caches.

The Reset exception is of the highest priority so in effect all others are disabled. Reset itself can never be disabled.

The Reset handler must set the initial SP values for all modes.

Child pages

Classification of ARM implementations

Classification and feature set of the Zynq-7000 SoC

GNU toolkit options

Processor "state" vs. "mode" and "privilege level"

Coprocessors

MMU

Proposed MMU translation tables for RTEMS

The default Level 1 memory map

Caches

Cache maintenance operations

Entire cache or by (level, set, way)

By virtual address

Coprocessor instructions for cache maintenance

Synchronizing after cache maintenance for SMP

Initializing the caches after a cold reset

Multiprocessor support

Sharable memory and sharablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

References

Child pages

Cortex-A9 MPCore notes

Classification of ARM implementations

Classification and feature set of the Zynq-7000 SoC

GNU toolkit options

Processor "state" vs. "mode" and "privilege level"

Coprocessors

MMU

Proposed MMU translation tables for RTEMS

The default Level 1 memory map

Caches

Cache maintenance operations

Entire cache or by (level, set, way)

By virtual address

Coprocessor instructions for cache maintenance

Synchronizing after cache maintenance for SMP

Initializing the caches after a cold reset

Multiprocessor support

Sharable memory and sharablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

References