...

The Application Processor Unit (APU)
I/O devices
- USB, UART, Gigabit Ethernet, I2C, SD card interface, etc.
Memory interfaces
- DDR, flash of various sorts, SRAM
A Central Interconnect tying together all of the above
Logic for reset and for clock generation
High-speed Direct (non-coherent) interface from PL to DDR interface and to OCM (see below)
Lower-speed ACP interface Accelerator Coherency Port (ACP) from PL to SCU (see below).

...

Cortex-A9 uniprocessor CPU (UPCPU)
32 KB L1 instruction cache
32 KB L1 data cache
Memory Management Unit (MMU)
NEON floating point and vector unit

The SCU makes sure that the L1 caches in each core and the shared L2 cache in the APU are coherent, that is, they agree on the value associated with a given address. It also keeps the two cores' views of memory coherent. In order to do the latter it has access to the L1 caches and to the translation look-aside buffers (TLBs) in the MMUs.

Programs running in the CPU UPCPUs of the cores see the non-core resources of the PS, such as the L2 cache controller, the UART and the GPIC as sets of memory-mapped registers. They see the in-core resources through the standard coprocessor interface instructions, which can address up to 16 coprocessors numbered 0-15. Coprocessor 15 is reserved for the MMU, L1 caches and various core control functions. Coprocessors 12-14 are reserved for SIMD and floating-point operations provided by the NEON.

...

Architecture. This is the view of the processor UPCPU seen by programmers (privileged and not). The architecture revision is referred to as ARMv5, ARMv7, etc. There used to be only one current version of the architecture but lately this has been split into three "profiles":
- A or Application profile. Intended for use with multi-user operating systems such as Linux, A-profile architectures include a Virtual Memory System Architecture (VMSA) wherein an MMU (or several) provide full-blown address remapping and memory attributes such as cached, non-executable, etc.
- R or Real-Time profile. Meant for single-user RTOSes such as VxWorks or RTEMS. Incorporates a Protected Memory System Architecture (PMSA) wherein an MPU provides memory attributes but not address mapping. Integer division is implemented in hardware. There are "tightly coupled" memories that are closer to the UPCPU than any L1 caches.
- M or Microcontroller profile. Intended for the simplest embedded systems which don't run a true operating system.
Processor core UPCPU implementation. There have been many implementations, until recently designated by "ARM" followed by a processor family number and a bunch of letters telling what extra features are present, e.g., ARM7TDMI. Note that the family number doesn't indicate which ARM architecture revision is implemented, e.g., ARM7 processors implement architecture v5. Lately this scheme has been abandoned in favor of a family name, architecture profile letter and family revision number such as "Cortex-A9".
Number of cores. In later systems a number of ARM cores may share main memory, some or all peripherals and some caches. Such systems have the word "MPCore" appended to the classification.

ARM features implemented

...

in the Zync

The APU on the Zync implements these ARM features:

ARM architecture v7 (ARMv7) for the UPCPUs
NEON floating point and advanced SIMD
- Floating point VFP3-32
VMSA, i.e., ARMv7-A
Generic Timer
Security extension
Multiprocessor extension
Debugging
- JTAG
- CoreSight
Instruction sets:
- ARM
- DSP-like instructions (EDSP feature)
- Thumb-2
- ARM/Thumb inter-operation instructions
- Jazelle and ThumbEE, hardware assists for Java Virtual Machines

The Cortex family of ARM processors incorporate as standard some features that used to be optional in earlier families and were designated by letters following the family names: (T)humb instruction set, (D)ebugging using JTAG, faster (M)ultiplication instructions, embedded (I)CE trace/debug and (E)xtended instructions allowing easy switching between ARM and Thumb code. Oddly, Cortex processors don't have any integer division instructions. MPCore variants have new synchronization instructions that replace the older Swap (SWP): Load Register Exclusive (LDREX) and Store Register Exclusive (STREX).

...

Obsolete floating-point (FP) independent of NEON
Alternate NEON floating point (VFP3-16 or VFP4-anything)
40-bit physical addresses (Large Physical Address Extension)
Virtualization (hypervisor support)
SWP instruction.

...

UPCPU "state" vs. "mode" and "privilege level"

Both mode and state are reflected in bits in the Current Processor State Register, or CPSR. "State" refers to the instruction set being executed. "Mode" and "privilege" determine the view of the processor UPCPU the programmer sees; some instructions may be forbidden and a the visible bank of registers may differ.

...

The two privilege levels are duplicated in Secure mode and Non-Secure mode, which have distinct address spaces. Under RTEMS we'll be using only Secure mode running at PL1.

MMU

There can be up to four logical MMUs per CPU ARM core (they may be implemented with a single block of silicon with multiple banks of working registers). Without the security or virtualization extensions there is just one logical MMU which is used for both privileged and non-privileged accesses. Adding the security extension adds another for secure code, again for both privilege levels. Adding the virtualization extension adds two more logical MMUs, one for the hypervisor and one for a second stage of translation for code running in a virtual machine. The first stage of translation in a virtual machine maps VM-virtual to VM-real addresses while the second stage maps VM-real to actual hardware addresses. The hypervisor's MMU maps only once, from hyper-virtual to actual hardware addresses.

The Zynq CPU cores have just the security extension and so each has two logical MMUs. Since under RTEMS we'll be running only in secure mode only one MMU per core will be of use to useus.

All the MMUs come up disabled after a reset, with garbage in the their TLB entries. If all the relevant MMUs for a particular CPU state are disabled the system is still operable. Data accesses assume a memory type of Ordered, so there is no prefetching or reordering; data caches must be disabled or contain only invalid entries since a cache hit in this state results in unpredictable action. Instruction fetches assume a memory type of Normal, are uncached but still speculative, so that addresses up to 8 KB above the start of the current instruction may be accessed.

...

In a multi-core system like the Zynq all the ARM cores can share the same MMU translation table if all of the following conditions are met:

All CPU cores are in have SMP modeenabled..
All CPU cores are in have TLB maintenance broadcast modeenabled.
All the MMUs are given the same real base address for
the translation table.
The translation table is in memory marked Normal, Sharable with write-back caching.

Under these conditions any CPU core can change an address translation as if it were alone and have the changes broadcast to the other CPUsothers.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some special low-order bits in the Translation Table Base Register telling it indicating that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Proposed MMU translation tables for RTEMS

Translation table entries specify an access domain number from zero to 15. For each domain there are two bits in the Domain Access Control Register which determine the access rights for the domain. The choices are:

No Access.
Client. Further bits in the translation table entry determine access rights.
Manager. All rights granted.

If No Access is specified in the DACR, access attempts generate Domain exceptions. If a type of access is denied by the translation table entry it causes a Permission exception.

Proposed MMU translation tables for RTEMS

As far as possible we keep to a single, first-level translation table, where As far as possible we keep to a single Level 1 translation table, where each entry describes a 1 MB "section" of address space. 4096 entries will cover all of the 4 GB address space. All entries specify access domain zero which will be set up for Client access, meaning that the translation table entries specify access permissions. All entries will be global entries, meaning that they apply to all code regardless of threading; the table is never modified by context switches. The address mapping is the identity, i. e., virtual == real.

...

The first level translation table is part of the system image containing RTEMS. It has its own object code section named ".mmu_tbletbl" so that the linker script used to create the image can put it in some suitable place independently of the placement of other data; this includes giving it the proper alignment. If we keep to a small number of second-level tables, say ten so so, we can reserve space for them statically at the end of the .mmu_tbl section. Each second-level table occupies 1 KB.

...

This default map is established at system startup. OCM and DDR have the attributes Normal, Read/Write, Executable and Cacheable. Memory mapped I/O regions are Device, Read/Write, Non-executable and Outer Shareable. Regions reserved by Xilinx will be No-access, causing a Permission exception on attempted usehave No Access specified in the translation table entries, so that attempted use will generate Permission exceptions.

Address range	Region size	Mapping type	Use
0x00000000-0x3fffffff	1 GB	RAM	DDR + low OCM
0x40000000-0x7fffffff	1 GB	I/O	PL AXI slave port 0
0x80000000-0xbfffffff	1 GB	I/O	PL AXI slave port 1
0xc0000000-0xdfffffff	512 MB	No access	Undocumented
0xe0000000-0xefffffff	256 MB	I/O	IOP devices
0xf0000000-0xf7ffffff	128 MB	No access	Reserved by Xilinx
0xf8000000-0xf9ffffff	32 MB	I/O	Registers on AMBA APB bus
0xfa000000-0xfbffffff	32 MB	No access	Reserved by Xilinx
0xfc000000-0xfdffffff	32 MB	I/O	Quad-SPI
0xfe000000-0xffffffff	32 MB	No access	Unsupported Quad-SPI + high OCM

Caches

The Zynq-7000 system has both L1 and L2 caching. Each CPU two levels of cache between a memory-access instruction and main memory. Each ARM core in the APU has semi-private L1 instruction and data caches; "semi-private" because the SCU keeps the L1 caches of the cores in sync. When the programmer needs to explicitly manage the L1 caches he uses control registers defined for coprocessor 15. The shared unified L2 cache falls outside the scope of the CP15 interface (because it's shared?); instead it's of the cores in sync. Outside of the ARM cores the APU has a unified instruction+data cache, also managed by the SCU, which keeps it coherent with the L1 caches in the cores. Xilinx refers to this shared, unified cache as a L2 cache. This nomenclature clashes with the definition of a level of cache as defined by the ARMv7 architecture; that would be managed via coprocessor 15 and would be considered part of the ARM core. The APU L2 cache by contrast does not belong to either ARM core and is managed via memory-mapped registers defined for the L2 controllerAPU registers. In the ARM architecture manual such caches are referred to as "system caches". Anyway, from here on in I'll just use the term "L2 cache" as Xilinx does.

The L1 caches don't support entry locking but the L2 cache does. The L2 cache can operate in a so-called exclusion mode which prevents a cache line from appearing simultaneously in any L1 data cache and in the L2 cache.

...

Barring any explicit cache manipulation by software the Snoop Control Unit ensures that the L1 data caches in both cores remain coherent and does the same for both cores' L1 instruction caches. The SCU does not, however, keep the instruction caches coherent with the data caches; it doesn't need to unless code has instructions have been modified. To handle that case the programmer has to use explicit cache maintenance operations on in the CPU ARM core that's performing the modifications, cleaning the affected data cache lines and invalidating the affected instruction cache lines. That brings the instruction cache into line with the data cache on the CPU core making the changes; the SCU will do the rest provided the programmer has used the right variants of the cache operations.

...

The are fewer valid cache operations for ARMv7 VMSA than for v6, some of the remaining valid ones are deprecated, and v7 supports the maintenance of more than one level of cache. One must be mindful of this when reading books not covering v7 or when using generic ARM support code present in RTEMS or other OSes. In addition the meaning of a given cache operation is profoundly affected by the presence of the Multiprocessing and Security extensions (and by the Large Physical Address and the Virtualization extensions which we don't have to worry about).

...

The Cache Level ID Register will also report that the levels of coherence and unification are all equal to one. These terms are used to define the effects of L1 cache operations that take virtual addresses as arguments:

Point/Level of Coherency: According the the ARM Architecture, a cache operation going up to the point of coherency ensures that all "agents" accessing memory have a coherent picture of it. For the Zynq an agent would therefore appear to be anything accessing memory via the SCU since the SCU is what maintains data coherency. The CPU cores in the APU, their MMUs, branch predictors and L1 caches in each ARM core all access memory this way, as does programmable logic which uses the ACP. However, PL that uses the direct access route will not in general get a view of memory coherent with these other users of memory. The level of coherency is simply the number of levels of cache you have to manipulate in order to make sure that your changes to memory reach the point of coherency. You operate first on the level of cache closest to the CPU (L1), then the next further out (if any), etc.
Point/Level of Unification, Uniprocessor: The is the point at which changes to memory become visible to the CPU, branch predictors, MMU and L1 caches belonging to the core making the changes, and the number of levels of cache you need to manipulate to get to that point. Often ARM docs drop the "Uniprocessor" when discussing this.
Point/Level of Unification, Inner Shared: The point/level at which changes become visible to the CPUs, branch predictors, L1 caches and MMUs for all the cores in the Inner Sharable group to which the core making the change belongs. For the Zynq this means both cores.

...

Child pages

Versions Compared

Old Version 17

New Version 18

Key

ARM features implemented

in the Zync

UPCPU "state" vs. "mode" and "privilege level"

MMU

Proposed MMU translation tables for RTEMS

Proposed MMU translation tables for RTEMS

Caches

Child pages

Page History

Versions Compared

Old Version 17

New Version 18

Key

ARM features implemented

in the Zync

UPCPU "state" vs. "mode" and "privilege level"

MMU

Proposed MMU translation tables for RTEMS

Proposed MMU translation tables for RTEMS

Caches