System overview

The Zynq system-on-a-chip at the highest level consists of:

  • The Processing System (PS)
  • Programmable Logic (PL)

These notes will concentrate on the PS.

The PS consists of:

  • The Application Processor Unit (APU)
  • I/O devices
    • USB, UART, Gigabit Ethernet, I2C, SD card interface, etc.
  • Memory interfaces
    • DDR, flash of various sorts, SRAM
  • A Central Interconnect tying together all of the above
  • Logic for reset and for clock generation
  • Direct (non-coherent) interface from PL to DDR and to OCM (see below)
  • Accelerator Coherency Port (ACP) from PL to SCU (see below).

The APU consists of:

  • Cortex-A9 MPCore processor with two cores which share the other APU resources
  • 512 KB unified L2 cache and controller
  • 256 KB of static On-Chip Memory (OCM)
  • General Interrupt Controller (GIC)
  • Snoop Control Unit (SCU)
  • Various timers and counters
  • DMA channel

Each core in the APU consists of:

  • Cortex-A9 uniprocessor CPU (UPCPU)
  • 32 KB L1 instruction cache
  • 32 KB L1 data cache
  • Memory Management Unit (MMU)
  • NEON floating point and vector unit

The SCU makes sure that the L1 caches in each core and the shared L2 cache in the APU are coherent, that is, they agree on the value associated with a given address. It also keeps the two cores' views of memory coherent. In order to do the latter it has access to the L1 caches and to the translation look-aside buffers (TLBs) in the MMUs.

Programs running in the UPCPUs of the cores see the non-core resources of the PS, such as the L2 cache controller, the UART and the GPIC as sets of memory-mapped registers. They see the in-core resources through the standard coprocessor interface instructions, which can address up to 16 coprocessors numbered 0-15. Coprocessor 15 is reserved for the MMU, L1 caches and various core control functions. Coprocessors 12-14 are reserved for SIMD and floating-point operations provided by the NEON.

The ARM Cortex-A9

ARM is an old semi-RISC processor design; the first version of it was released in 1985. Implementations are classified broadly by architecture and by the design of the processor core implementing the architecture:

  • Architecture.This is the view of the UPCPU seen by programmers (privileged and not). The architecture revision is referred to as ARMv5, ARMv7, etc. There used to be only one current version of the architecture but lately this has been split into three "profiles":
    • A or Application profile. Intended for use with multi-user operating systems such as Linux, A-profile architectures include a Virtual Memory System Architecture (VMSA) wherein an MMU (or several) provide full-blown address remapping and memory attributes such as cached, non-executable, etc.
    • R or Real-Time profile. Meant for single-user RTOSes such as VxWorks or RTEMS. Incorporates a Protected Memory System Architecture (PMSA) wherein an MPU provides memory attributes but not address mapping. Integer division is implemented in hardware. There are "tightly coupled" memories that are closer to the UPCPU than any L1 caches.
    • M or Microcontroller profile. Intended for the simplest embedded systems which don't run a true operating system.
  • UPCPU implementation. There have been many implementations, until recently designated by "ARM" followed by a processor family number and a bunch of letters telling what extra features are present, e.g., ARM7TDMI. Note that the family number doesn't indicate which ARM architecture revision is implemented, e.g., ARM7 processors implement architecture v5. Lately this scheme has been abandoned in favor of a family name, architecture profile letter and family revision number such as "Cortex-A9".
  • Number of cores. In later systems a number of ARM cores may share main memory, some or all peripherals and some caches. Such systems have the word "MPCore" appended to the classification.

ARM features implemented in the Zync

The APU on the Zync implements these ARM features:

  • ARM architecture v7 (ARMv7) for the UPCPUs
  • NEON floating point and advanced SIMD
    • Floating point VFP3-32
  • VMSA, i.e., ARMv7-A
  • Generic Timer
  • Security extension
  • Multiprocessor extension
  • Debugging
    • JTAG
    • CoreSight
  • Instruction sets:
    • ARM
    • DSP-like instructions (EDSP feature)
    • Thumb-2
    • ARM/Thumb inter-operation instructions
    • Jazelle and ThumbEE, hardware assists for Java Virtual Machines

The Cortex family of ARM processors incorporate as standard some features that used to be optional in earlier families and were designated by letters following the family names: (T)humb instruction set, (D)ebugging using JTAG, faster (M)ultiplication instructions, embedded (I)CE trace/debug and (E)xtended instructions allowing easy switching between ARM and Thumb code. MPCore variants have new synchronization instructions that replace the older Swap (SWP): Load Register Exclusive (LDREX) and Store Register Exclusive (STREX).

The following are not implemented:

  • Obsolete floating-point (FP) independent of NEON
  • Alternate NEON floating point (VFP3-16 or VFP4-anything)
  • 40-bit physical addresses (Large Physical Address Extension)
  • Virtualization (hypervisor support)
  • SWP instruction.

UPCPU "state" vs. "mode" and "privilege level"

Both mode and state are reflected in bits in the Current Processor State Register, or CPSR. "State" refers to the instruction set being executed. "Mode" and "privilege" determine the view of the UPCPU the programmer sees; some instructions may be forbidden and the visible bank of registers may differ.

Instruction sets:

  • The standard ARM instruction set. Each instruction is 32 bits long and aligned on a 32-bit boundary. The full set of general registers is available. Shift operations may be combined with arithmetic and logical operations. This is the instruction set we'll be using for our project.
  • Thumb-2. Designed for greater code density. Contains a mix of 16-bit and 32-bit instructions. Many instructions can access only general registers 0-7.
  • Jazelle. Similar to Java byte code.
  • ThumbEE. A sort of hybrid of Thumb and Jazelle, actually a CPU operation mode. Intended for environments where code modification is frequent, such as ones having a JIT compiler.

Privilege levels:

  • Level 0 (PL0): Unprivileged.
  • Level 1 (PL1): Privileged.

The two privilege levels are duplicated in Secure mode and Non-Secure mode, which have distinct address spaces. Under RTEMS we'll be using only Secure mode running at PL1.

MMU

There can be up to four logical MMUs per ARM core (they may be implemented with a single block of silicon with multiple banks of working registers). Without the security or virtualization extensions there is just one logical MMU which is used for both privileged and non-privileged accesses. Adding the security extension adds another for secure code, again for both privilege levels. Adding the virtualization extension adds two more logical MMUs, one for the hypervisor and one for a second stage of translation for code running in a virtual machine. The first stage of translation in a virtual machine maps VM-virtual to VM-real addresses while the second stage maps VM-real to actual hardware addresses. The hypervisor's MMU maps only once, from hyper-virtual to actual hardware addresses.

The Zynq CPU cores have just the security extension and so each has two logical MMUs. Since under RTEMS we'll be running only in secure mode only one MMU per core will be of use to us.

All the MMUs come up disabled after a reset, with garbage in their TLB entries. If all the relevant MMUs for a particular CPU state are disabled the system is still operable. Data accesses assume a memory type of Ordered, so there is no prefetching or reordering; data caches must be disabled or contain only invalid entries since a cache hit in this state results in unpredictable action. Instruction fetches assume a memory type of Normal, are uncached but still speculative, so that addresses up to 8 KB above the start of the current instruction may be accessed.

Unlike for PowerPC there is no way to examine MMU TLB entries nor to set them directly; you have to have a page table of some sort set up when you enable the MMU.

Automatic replacement of TLB entries normally uses a "pseudo-random" or "round robin" algorithm, not the "least recently used" algorithm implemented in the PowerPC. On the Zynq pseudo-random is the power-on default. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

In a multi-core system like the Zynq all the ARM cores can share the same MMU translation table if all of the following conditions are met:

  1. All cores have SMP enabled..
  2. All cores have TLB maintenance broadcast enabled.
  3. All the MMUs are given the same real base address for
    the translation table.
  4. The translation table is in memory marked Normal, shareable with write-back caching.

Under these conditions any core can change an address translation as if it were alone and have the changes broadcast to the others.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some low-order bits in the Translation Table Base Register indicating that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Translation table entries specify an access domain number from zero to 15. For each domain there are two bits in the Domain Access Control Register which determine the access rights for the domain. The choices are:

  • No Access.
  • Client. Further bits in the translation table entry determine access rights.
  • Manager. All rights granted.

If No Access is specified in the DACR, access attempts generate Domain exceptions. If a type of access is denied by the translation table entry it causes a Permission exception.

MMU translation tables implemented for RTEMS

The 4 GB address space is divided into Regions, where each Region is assigned a page size and default memory access rights based on the the intended use of the Region. Simple dynamic allocation is supported for Regions where that makes sense. In Regions where we don't need to change access permissions on relatively small areas, e.g., I/O buffers, we set the page size to 1 MB, the size of a section. Regions which require finer-grained control of access permissions, e.g., where shared libraries are loaded, are given a page size of 4 KB. All MMU table entries specify access domain 15 which is set up for Client access, meaning that the translation table entries specify access permissions. All entries are global entries, meaning that they apply to all code regardless of thread; the table is never modified by context switches. The address mapping is the identity, i. e., virtual == real.

 

The following table shows the memory map currently being used for RTEMS. Address ranges not in the table are reserved and will cause an access exception if a program tries to use them. The access permissions shown below are the defaults. When a shared library is loaded the memory occupied by each of its loadable segments is given access permissions compatible with the segment's flags. So executable segments are given execution permission but not write permission. Data segments are given write permission but execute permission.

Permissions: (R)ead, (W)rite, e(X)ecute, (C)ached, (S)hared.

 

Region
Addresses
Mem type
Page size
Allocatable
Permissions (RWXCS)
Purpose
Null catcher
0x00000000-0x00003fff
Reserved
4K
n/a
None
Catch attempts to deference null pointers.
Syslog
0x00004000-0x00103fff
Normal
4K
Yes
RWXC.
Circular buffer for captured console output.
MMU
0x00104000-0x00167fff
Strongly ordered
4K
Yes
R...S
MMU tables.
Run-time support
0x00168000-0x0a9fffff
Normal
4K
Yes
RWXC.
Loaded shared libraries.
Workspace
0x0aa00000-0x153fffff
Normal
1M
Yes
RW.C.
RTEMS workspace, C/C++ heap, application data.
Uncached
0x15400000-0x1fffffff
Normal
1M
Yes
RW...
I/O buffers.
Socket
0x40000000-0x500fffff
Device
1M
No
RW..S
Interfaces to firmware sockets.
AXI0 test
0x50100000-0x501fffff
Device
1M
No
RW..S
Reserved for interfaces to non-standard firmware.
Firmware Version/Control
0x80000000-0x800fffff
Device
1M
No
RW..S
Version info, some control registers.
BootStrap Information
0x84000000-0x840fffff
Device
1M
No
RW..S
Bootstrap info structure.
AXI1 test
0x84100000-0x841fffff
Device
1M
No
RW..S
Reserved for interfaces to non-standard firmware.
IOP
0xe0000000-0xe02fffff
Device
1M
No
RW..S
Various I/O devices.
Static
0xe1000000-0xe5ffffff
Device
1M
No
RW..S
Static memory controllers, e.g., flash.
High-reg
0xf8000000-0xfffbffff
Device
4K
No
RW..S
Various I/O devices.
OCM
0xfffc0000-0xffffffff
Normal
4K
Yes
RW.CS
On-chip memory.

Caches

The Zynq-7000 system has two levels of cache between a memory-access instruction and main memory. Each ARM core in the APU has semi-private L1 instruction and data caches; "semi-private" because the SCU keeps the L1 caches of the cores in sync. Outside of the ARM cores the APU has a unified instruction+data cache, also managed by the SCU, which keeps it coherent with the L1 caches in the cores. Xilinx refers to this shared, unified cache as a L2 cache. This nomenclature clashes with the definition of a level of cache as defined by the ARMv7 architecture; that would be managed via coprocessor 15 and would be considered part of the ARM core. The APU L2 cache by contrast does not belong to either ARM core and is managed via memory-mapped APU registers. In the ARM architecture manual such caches are referred to as "system caches". Anyway, from here on in I'll just use the term "L2 cache" as Xilinx does.

The L1 caches don't support entry locking but the L2 cache does. The L2 cache can operate in a so-called exclusion mode which prevents a cache line from appearing simultaneously in any L1 data cache and in the L2 cache.

Both the L1 caches and the L2 cache have the same line size of 32 bytes.

Cache sizes:

  • L1 I and L1 D: 32KB, 256 sets, 4 ways(lines)/set.
  • L2: 512KB, 2K sets, 8 ways/set.

L1

Barring any explicit cache manipulation by software the Snoop Control Unit ensures that the L1 data caches in both cores remain coherent and does the same for both cores' L1 instruction caches. The SCU does not, however, keep the instruction caches coherent with the data caches; it doesn't need to unless instructions have been modified. To handle that case the programmer has to use explicit cache maintenance operations in the ARM core that's performing the modifications, cleaning the affected data cache lines and invalidating the affected instruction cache lines. That brings the instruction cache into line with the data cache on the CPU core making the changes; the SCU will do the rest provided the programmer has used the right variants of the cache operations.

For ARMv7 VMSA the only cache policy choice implemented is write-back with line allocation on writes.

The are fewer valid cache operations for ARMv7 than for v6, some of the remaining valid ones are deprecated, and v7 supports the maintenance of more than one level of cache. One must be mindful of this when reading books not covering v7 or when using generic ARM support code present in RTEMS or other OSes. In addition the meaning of a given cache operation is profoundly affected by the presence of the Multiprocessing and Security extensions (and by the Large Physical Address and the Virtualization extensions which we don't have to worry about).

Remember also that the CP15 interface in a Zynq APU core covers only the L1 caches belonging to that core. The Cache Level ID Register will report that the core has only one level of cache with separate I and D caches.

The Cache Level ID Register will also report that the levels of coherence and unification are all equal to one. These terms are used to define the effects of cache operations that take virtual addresses as arguments:

  • Point/Level of Coherency: According the the ARM Architecture, a cache operation going up to the point of coherency ensures that all "agents" accessing memory have a coherent picture of it. For the Zynq an agent would therefore appear to be anything accessing memory via the SCU since the SCU is what maintains data coherency. The MMUs, branch predictors and L1 caches in each ARM core all access memory this way, as does programmable logic which uses the ACP. However, PL that uses the direct access route will not in general get a view of memory coherent with these other users of memory. The level of coherency is simply the number of levels of cache you have to manipulate in order to make sure that your changes to memory reach the point of coherency. You operate first on the level of cache closest to the CPU (L1), then the next further out (if any), etc.
  • Point/Level of Unification, Uniprocessor: The is the point at which changes to memory become visible to the CPU, branch predictors, MMU and L1 caches belonging to the core making the changes, and the number of levels of cache you need to manipulate to get to that point. Often ARM docs drop the "Uniprocessor" when discussing this.
  • Point/Level of Unification, Inner Shared: The point/level at which changes become visible to the CPUs, branch predictors, L1 caches and MMUs for all the cores in the Inner shareable group to which the core making the change belongs. For the Zynq this means both cores.

All this would seem to imply that the SCU won't maintain coherence between cores unless you use the Inner Shared versions of the cache operations that have them. Then the non-IS versions of these operations and the versions that take set/way would affect only the local core, requiring one in addition to use resources outside the core in order to get coherency. For the Zynq that would mean manipulating the L2 cache of the APU.

By the way, ARM manuals speak of "invalidation" or "flushing" which means marking cache lines as unused, and "cleaning" which means writing cache lines out to the next lower level of caching (or to main memory) and marking them as clean. Note that most folks using "flushing" to mean what ARM calls "cleaning".

Whole-cache operations or those requiring (level, set, way)

Entire cache:

  • ICIALLU (Instruction Cache Invalidate All to point of Unification): invalidate the L1 I-cache for the local CPU and invalidate all its branch predictions.
  • ICIALLUIS (Instruction Cache Invalidate All to point of Unification for the Inner Shared domain): invalidate the L1 I-caches and branch predictions for all CPUs in the same Inner Sharability domain.
  • BPIALL, BPIALLIS: Similar to the above but invalidate only branch predictions, leaving the I-caches untouched.

By level, set, and way:

  • DCISW (Data Cache Invalidate by Set/Way): Invalidate for the local CPU a single cache line which is selected by specifying the set and the way.
  • DCCSW (Data Cache Clean by Set/Way): Clean a single cache line for the local CPU.
  • DCCISW (Data Cache Clean and Invalidate by Set/Way): Clean and invalidate a single cache line for the local CPU.

Notice that there's no operation that invalidates or cleans an entire data cache; to do that one has to loop over all sets and ways. N

Operations using virtual addresses

Operations:

  • DCIMVAC (Data Cache Invalidate by MVA to PoC)
  • DCCMVAC (like the above but cleans)
  • DCCIMVAC (like the above but cleans and invalidates)
  • DCCMVAU (Data Cache Clean by MVA to PoU)
  • ICIMVAU (Instruction Cache Invalidate by MVA to PoU)
Coprocessor instructions for cache maintenance

Cache maintenance operations as implemented as operations by coprocessor 15 (p15) involving at coprocessor register 7 (c7) along with various opcodes and secondary processor registers:

Operation

Instruction

GPR operand

ICIALLU

MCR P15, 0, GPR, C7, C5, 0

Ignored

BPIALL

MCR P15, 0, GPR, C7, C5, 6

"

ICIALLUIS

MCR P15, 0, GPR, C7, C1, 0

"

BPIALLIS

MCR P15, 0, GPR, C7, C1, 6

"

DCISW

MCR P15, 0, GPR, C7, C6, 2

Packed level, set, way

DCCSW

MCR P15, 0, GPR, C7, C10, 2

"

DCCISW

MCR P15, 0, GPR, C7, C14, 2

"

DCIMVAC

MCR P15, 0, GPR, C7, C6, 1

Virtual address

DCCMVAC

MCR P15, 0, GPR, C7, C10, 1

"

DCCIMVAC

MCR P15, 0, GPR, C7, C14, 1

"

DCCMVAU

MCR P15, 0, GPR, C7, C11, 1

"

ICIMVAU

MCR P15, 0, GPR, C7, C5, 1

"

For set/way operations the format of the operand in the general-purpose register is
(way << (32 - A)) | (set << L) | (level << 1) where:

  • way is the way number, starting at 0
  • A is ceil(log2(number of ways))
  • set is the set number starting from 0
  • L is log2(cache line size)
  • level is the cache level (0 for L1, etc.)

For the Zynq L1 I and D caches this reduces to (way << 30) | (set << 5), where way < 4 and set < 256.

For VA operations the GPR contains the virtual address. It needn't be cache-line aligned. The entire cache line containing the VA is affected.

Synchronizing after cache maintenance for SMP

By themselves the cache maintenance operations don't do the whole job; you also have to use memory barrier operations to broadcast the result of a data cache or instruction cache operation to other CPUs. Even that doesn't do everything because the remote CPUs still need to be told to dump their instruction fetch pipelines; they might have become inconsistent with their new I-cache states. Some sort of explicit inter-CPU signaling is needed; in the following example from ARM the code assumes the use of a simple semaphore:

; First CPU
P1:
    STR R11, [R1]    ; R11 contains a new instruction to store in program memory
    DCCMVAU R1       ; clean to PoU makes visible to instruction cache
    DSB              ; ensure completion of the clean on all processors
    ICIMVAU R1       ; ensure instruction cache/branch predictor discards stale data
    BPIMVA R1
    DSB              ; ensure completion of the ICache and branch predictor
                     ; invalidation on all processors
    STR R0, [R2]     ; set flag to signal completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code

; Second CPU
P2-Px:
    WAIT ([R2] == 1) ; wait for flag signaling completion
    ISB              ; synchronize context on this processor
    BX R1            ; branch to new code
Initializing the caches after a cold reset

CPU0 is the first to come up.

Each CPU should invalidate all entries in its own L1 data and instruction caches before entering SMP mode.

Multiprocessor support

Shareable memory and shareablility domains

SMP mode

TLB broadcast mode

System state immediately after a reset

State

ARM

Mode

Supervisor

Privilege level

1

Exceptions

All disabled save for Reset

Security level

Secure

Secure mode MMU

Disabled (SCTLR.M == 0)

Nonsecure mode MMU

Random (SCTLR.M random) but not functioning since Secure mode is active

PC

0x00000000 or 0xFFFF0000 depending on reset behavior of SCTLR.V bit

Caches

All disabled, contents random

Branch tracer (BTAC)

???

Snooper

???

SMP mode

???

TLB broadcast mode

???

SP

Random

Other GPRs

Random

CPSR I-bit

1

CPSR F-bit

1

The MMU's TLB entries have random content so one must at least invalidate all TLB entries before enabling the MMU. With the MMU disabled all instruction fetches are assumed to be to Normal memory while data accesses are assumed to be to Ordered memory.

All cache lines should be invalidated prior to enabling the caches.

The Reset exception is of the highest priority so in effect all others are disabled. Reset itself can never be disabled.

The Reset handler must set the initial SP values for all modes.

References

  • No labels