Table of Contents

System overview

The Zynq system-on-a-chip at the highest level consists of:

...

Programs running in the UPCPUs of the cores see the non-core resources of the PS, such as the L2 cache controller, the UART and the GPIC as sets of memory-mapped registers. They see the in-core resources through the standard coprocessor interface instructions, which can address up to 16 coprocessors numbered 0-15. Coprocessor 15 is reserved for the MMU, L1 caches and various core control functions. Coprocessors 12-14 are reserved for SIMD and floating-point operations provided by the NEON.

The ARM Cortex-A9

ARM is an old semi-RISC processor design; the first version of it was released in 1985. Implementations are classified broadly by architecture and by the design of the processor core implementing the architecture:

Architecture.This is the view of the UPCPU seen by programmers (privileged and not). The architecture revision is referred to as ARMv5, ARMv7, etc. There used to be only one current version of the architecture but lately this has been split into three "profiles":
- A or Application profile. Intended for use with multi-user operating systems such as Linux, A-profile architectures include a Virtual Memory System Architecture (VMSA) wherein an MMU (or several) provide full-blown address remapping and memory attributes such as cached, non-executable, etc.
- R or Real-Time profile. Meant for single-user RTOSes such as VxWorks or RTEMS. Incorporates a Protected Memory System Architecture (PMSA) wherein an MPU provides memory attributes but not address mapping. Integer division is implemented in hardware. There are "tightly coupled" memories that are closer to the UPCPU than any L1 caches.
- M or Microcontroller profile. Intended for the simplest embedded systems which don't run a true operating system.
UPCPU implementation. There have been many implementations, until recently designated by "ARM" followed by a processor family number and a bunch of letters telling what extra features are present, e.g., ARM7TDMI. Note that the family number doesn't indicate which ARM architecture revision is implemented, e.g., ARM7 processors implement architecture v5. Lately this scheme has been abandoned in favor of a family name, architecture profile letter and family revision number such as "Cortex-A9".
Number of cores. In later systems a number of ARM cores may share main memory, some or all peripherals and some caches. Such systems have the word "MPCore" appended to the classification.

ARM features implemented in the Zync

The APU on the Zync implements these ARM features:

...

Obsolete floating-point (FP) independent of NEON
Alternate NEON floating point (VFP3-16 or VFP4-anything)
40-bit physical addresses (Large Physical Address Extension)
Virtualization (hypervisor support)
SWP instruction.

UPCPU "state" vs. "mode" and "privilege level"

Both mode and state are reflected in bits in the Current Processor State Register, or CPSR. "State" refers to the instruction set being executed. "Mode" and "privilege" determine the view of the UPCPU the programmer sees; some instructions may be forbidden and the visible bank of registers may differ.

...

The two privilege levels are duplicated in Secure mode and Non-Secure mode, which have distinct address spaces. Under RTEMS we'll be using only Secure mode running at PL1.

MMU

There can be up to four logical MMUs per ARM core (they may be implemented with a single block of silicon with multiple banks of working registers). Without the security or virtualization extensions there is just one logical MMU which is used for both privileged and non-privileged accesses. Adding the security extension adds another for secure code, again for both privilege levels. Adding the virtualization extension adds two more logical MMUs, one for the hypervisor and one for a second stage of translation for code running in a virtual machine. The first stage of translation in a virtual machine maps VM-virtual to VM-real addresses while the second stage maps VM-real to actual hardware addresses. The hypervisor's MMU maps only once, from hyper-virtual to actual hardware addresses.

...

Unlike for PowerPC there is no way to examine MMU TLB entries nor to set them directly; you have to have a page table of some sort set up when you enable the MMU. In the RTEMS GIT repository the lpc32xx BSP implements a simple and practical approach. It maintains a single-level page table with 4096 32-bit entries, where each entry covers a 1 MB section of address space. The whole table therefore covers the entire 4 GB address space of the CPU. The upper 12 bits of each virtual address is the same as the upper 12 bits of the physical address so that the address mapping is the identity. Finding the index of the table entry for a particular address just requires a logical right shift of the VA by 20. The property-granularity of 1 MB imposed by this organization seems a small price to pay for avoiding the complications of second-level page tables and variable page sizes.

Automatic replacement of TLB entries normally uses a "pseudo-random" or "round robin" algorithm, not the "least recently used" algorithm implemented in the PowerPC. On the Zynq pseudo-random is the power-on default. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

In a multi-core system like the Zynq all the ARM cores can share the same MMU translation table if all of the following conditions are met:

All cores have SMP enabled..
All cores have TLB maintenance broadcast enabled.
All the MMUs are given the same real base address for
the translation table.
The translation table is in memory marked Normal, Sharable with write-back caching.

Under these conditions any core can change an address translation as if it were alone and have the changes broadcast to the others.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some low-order bits in the Translation Table Base Register indicating that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Translation table entries specify an access domain number from zero to 15. For each domain there are two bits in the Domain Access Control Register which determine the access rights for the domain. The choices are:

No Access.
Client. Further bits in the translation table entry determine access rights.
Manager. All rights granted.

If No Access is specified in the DACR, access attempts generate Domain exceptions. If a type of access is denied by the translation table entry it causes a Permission exception.

Proposed MMU translation tables for RTEMS

As far as possible we keep to a single, first-level translation table, where each entry describes a 1 MB "section" of address space. 4096 entries will cover all of the 4 GB address space. All entries specify access domain zero which will be set up for Client access, meaning that the translation table entries specify access permissions. All entries will be global entries, meaning that they apply to all code regardless of threading; the table is never modified by context switches. The address mapping is the identity, i. e., virtual == real.

For a few 1 MB sections we require a finer granularity and provide second-level tables for "small" pages of 4 KB each. The on-chip memory and the RCE protocol plugins will most likely receive this treatment; how many addresses the latter require will be calculated at system startup.

The first level translation table is part of the system image containing RTEMS. It has its own object code section named ".mmu_tbl" so that the linker script used to create the image can put it in some suitable place independently of the placement of other data; this includes giving it the proper alignment. If we keep to a small number of second-level tables, say ten so so, we can reserve space for them statically at the end of the .mmu_tbl section. Each second-level table occupies 1 KB.

The default memory map

This default map is established at system startup. OCM and DDR have the attributes Normal, Read/Write, Executable and Cacheable. Memory mapped I/O regions are Device, Read/Write, Non-executable and Outer Shareable. Regions reserved by Xilinx will have No Access specified in the translation table entries, so that attempted use will generate Permission exceptions.

Automatic replacement of TLB entries normally uses a "pseudo-random" or "round robin" algorithm, not the "least recently used" algorithm implemented in the PowerPC. On the Zynq pseudo-random is the power-on default. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

In a multi-core system like the Zynq all the ARM cores can share the same MMU translation table if all of the following conditions are met:

All cores have SMP enabled..
All cores have TLB maintenance broadcast enabled.
All the MMUs are given the same real base address for
the translation table.
The translation table is in memory marked Normal, shareable with write-back caching.

Under these conditions any core can change an address translation as if it were alone and have the changes broadcast to the others.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some low-order bits in the Translation Table Base Register indicating that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Translation table entries specify an access domain number from zero to 15. For each domain there are two bits in the Domain Access Control Register which determine the access rights for the domain. The choices are:

No Access.
Client. Further bits in the translation table entry determine access rights.
Manager. All rights granted.

If No Access is specified in the DACR, access attempts generate Domain exceptions. If a type of access is denied by the translation table entry it causes a Permission exception.

MMU translation tables implemented for RTEMS

The 4 GB address space is divided into Regions, where each Region is assigned a page size and default memory access rights based on the the intended use of the Region. Simple dynamic allocation is supported for Regions where that makes sense. In Regions where we don't need to change access permissions on relatively small areas, e.g., I/O buffers, we set the page size to 1 MB, the size of a section. Regions which require finer-grained control of access permissions, e.g., where shared libraries are loaded, are given a page size of 4 KB. All MMU table entries specify access domain 15 which is set up for Client access, meaning that the translation table entries specify access permissions. All entries are global entries, meaning that they apply to all code regardless of thread; the table is never modified by context switches. The address mapping is the identity, i. e., virtual == real.

The following table shows the memory map currently being used for RTEMS. Address ranges not in the table are reserved and will cause an access exception if a program tries to use them. The access permissions shown below are the defaults. When a shared library is loaded the memory occupied by each of its loadable segments is given access permissions compatible with the segment's flags. So executable segments are given execution permission but not write permission. Data segments are given write permission but execute permission.

Permissions: (R)ead, (W)rite, e(X)ecute, (C)ached, (S)hared.

Region	Addresses	Mem type	Page size	Allocatable	Permissions	Purpose
Null catcher	0x00000000-0x00003fff	Reserved	4K	n/a	None	Catch attempts to deference null pointers.
Syslog	0x00004000-0x00103fff	Normal	4K	Yes	RWXC.	Circular buffer for captured console output.
MMU	0x00104000-0x00167fff	Strongly ordered	4K	Yes	R...S	MMU tables.
Run-time support	0x00168000-0x0a9fffff	Normal	4K	Yes	RWXC.	Loaded shared libraries.
Workspace	0x0aa00000-0x153fffff	Normal	1M	Yes	RW.C.	RTEMS workspace, C/C++ heap, application data.
Uncached	0x15400000-0x1fffffff	Normal	1M	Yes	RW...	I/O buffers.
AXI 0	0x40000000-0x400fffff	Device	1M	No	RW..S	AXI (firmware) window 0 (unused for now).
AXI 1	0x80000000-0x80bfffff	Device	1M	No	RW..S	AXI window 1 (RCE firmware).
IOP	0xe0000000-0xe02fffff	Device	1M	No	RW..S	Various I/O devices.
Static	0xe1000000-0xe5ffffff	Device	1M	No	RW..S	Static memory controllers, e.g., flash.
High-reg	0xf8000000-0xfffbffff	Device	4K	No	RW..S	Various I/O devices.
OCM	0xfffc0000-0xffffffff	Normal	4K	Yes	RW.CS	On-chip memory.
Address range	Region size	Mapping type	Use
0x00000000-0x3fffffff	1 GB	RAM	DDR + low OCM
0x40000000-0x7fffffff	1 GB	I/O	PL AXI slave port 0
0x80000000-0xbfffffff	1 GB	I/O	PL AXI slave port 1
0xc0000000-0xdfffffff	512 MB	No access	Undocumented
0xe0000000-0xefffffff	256 MB	I/O	IOP devices
0xf0000000-0xf7ffffff	128 MB	No access	Reserved by Xilinx
0xf8000000-0xf9ffffff	32 MB	I/O	Registers on AMBA APB bus
0xfa000000-0xfbffffff	32 MB	No access	Reserved by Xilinx
0xfc000000-0xfdffffff	32 MB	I/O	Quad-SPI
0xfe000000-0xffffffff	32 MB	No access	Unsupported Quad-SPI + high OCM

Caches

The Zynq-7000 system has two levels of cache between a memory-access instruction and main memory. Each ARM core in the APU has semi-private L1 instruction and data caches; "semi-private" because the SCU keeps the L1 caches of the cores in sync. Outside of the ARM cores the APU has a unified instruction+data cache, also managed by the SCU, which keeps it coherent with the L1 caches in the cores. Xilinx refers to this shared, unified cache as a L2 cache. This nomenclature clashes with the definition of a level of cache as defined by the ARMv7 architecture; that would be managed via coprocessor 15 and would be considered part of the ARM core. The APU L2 cache by contrast does not belong to either ARM core and is managed via memory-mapped APU registers. In the ARM architecture manual such caches are referred to as "system caches". Anyway, from here on in I'll just use the term "L2 cache" as Xilinx does.

...

L1 I and L1 D: 32KB, 256 sets, 4 ways(lines)/set.
L2: 512KB, 2K sets, 8 ways/set.

L1

Barring any explicit cache manipulation by software the Snoop Control Unit ensures that the L1 data caches in both cores remain coherent and does the same for both cores' L1 instruction caches. The SCU does not, however, keep the instruction caches coherent with the data caches; it doesn't need to unless instructions have been modified. To handle that case the programmer has to use explicit cache maintenance operations in the ARM core that's performing the modifications, cleaning the affected data cache lines and invalidating the affected instruction cache lines. That brings the instruction cache into line with the data cache on the CPU core making the changes; the SCU will do the rest provided the programmer has used the right variants of the cache operations.

...

Point/Level of Coherency: According the the ARM Architecture, a cache operation going up to the point of coherency ensures that all "agents" accessing memory have a coherent picture of it. For the Zynq an agent would therefore appear to be anything accessing memory via the SCU since the SCU is what maintains data coherency. The MMUs, branch predictors and L1 caches in each ARM core all access memory this way, as does programmable logic which uses the ACP. However, PL that uses the direct access route will not in general get a view of memory coherent with these other users of memory. The level of coherency is simply the number of levels of cache you have to manipulate in order to make sure that your changes to memory reach the point of coherency. You operate first on the level of cache closest to the CPU (L1), then the next further out (if any), etc.
Point/Level of Unification, Uniprocessor: The is the point at which changes to memory become visible to the CPU, branch predictors, MMU and L1 caches belonging to the core making the changes, and the number of levels of cache you need to manipulate to get to that point. Often ARM docs drop the "Uniprocessor" when discussing this.
Point/Level of Unification, Inner Shared: The point/level at which changes become visible to the CPUs, branch predictors, L1 caches and MMUs for all the cores in the Inner Sharable shareable group to which the core making the change belongs. For the Zynq this means both cores.

...

By the way, ARM manuals speak of "invalidation" or "flushing" which means marking cache lines as unused, and "cleaning" which means writing cache lines out to the next lower level of caching (or to main memory) and marking them as clean. Note that most folks using "flushing" to mean what ARM calls "cleaning".

Whole-cache operations or those requiring (level, set, way)

Entire cache:

ICIALLU (Instruction Cache Invalidate All to point of Unification): invalidate the L1 I-cache for the local CPU and invalidate all its branch predictions.
ICIALLUIS (Instruction Cache Invalidate All to point of Unification for the Inner Shared domain): invalidate the L1 I-caches and branch predictions for all CPUs in the same Inner Sharability domain.
BPIALL, BPIALLIS: Similar to the above but invalidate only branch predictions, leaving the I-caches untouched.

...

Notice that there's no operation that invalidates or cleans an entire data cache; to do that one has to loop over all sets and ways. N

Operations using virtual addresses

Operations:

DCIMVAC (Data Cache Invalidate by MVA to PoC)
DCCMVAC (like the above but cleans)
DCCIMVAC (like the above but cleans and invalidates)
DCCMVAU (Data Cache Clean by MVA to PoU)
ICIMVAU (Instruction Cache Invalidate by MVA to PoU)

Coprocessor instructions for cache maintenance

Cache maintenance operations as implemented as operations by coprocessor 15 (p15) involving at coprocessor register 7 (c7) along with various opcodes and secondary processor registers:

...

For VA operations the GPR contains the virtual address. It needn't be cache-line aligned. The entire cache line containing the VA is affected.

Synchronizing after cache maintenance for SMP

By themselves the cache maintenance operations don't do the whole job; you also have to use memory barrier operations to broadcast the result of a data cache or instruction cache operation to other CPUs. Even that doesn't do everything because the remote CPUs still need to be told to dump their instruction fetch pipelines; they might have become inconsistent with their new I-cache states. Some sort of explicit inter-CPU signaling is needed; in the following example from ARM the code assumes the use of a simple semaphore:

No Format


; First CPU
P1:
    STR R11, [R1]    ; R11 contains a new instruction to store in program memory
    DCCMVAU R1       ; clean to PoU makes visible to instruction cache
    DSB              ; ensure completion of the clean on all processors
    ICIMVAU R1       ; ensure instruction cache/branch predictor discards stale data
    BPIMVA R1
    DSB              ; ensure completion of the ICache and branch predictor
                     ; invalidation on all processors
    STR R0, [R2]     ; set flag to signal completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

; Second CPU
P2-Px:
    WAIT ([R2] == 1) ; wait for flag signaling completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

Initializing the caches after a cold reset

CPU0 is the first to come up.

Each CPU should invalidate all entries in its own L1 data and instruction caches before entering SMP mode.

Child pages

Versions Compared

Old Version 19

New Version 20

Key

Proposed MMU translation tables for RTEMS

The default memory map

Caches

L1

Whole-cache operations or those requiring (level, set, way)

Operations using virtual addresses

Coprocessor instructions for cache maintenance

Synchronizing after cache maintenance for SMP

Initializing the caches after a cold reset

Multiprocessor support

Shareable memory and

shareablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

Child pages

Page History

Versions Compared

Old Version 19

New Version 20

Key

Proposed MMU translation tables for RTEMS

The default memory map

Caches

L1

Whole-cache operations or those requiring (level, set, way)

Operations using virtual addresses

Coprocessor instructions for cache maintenance

Synchronizing after cache maintenance for SMP

Initializing the caches after a cold reset

Multiprocessor support

Shareable memory and

shareablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset