Basic architecture

The architecture implemented is ARMv7-A. The "v7" is the instruction set revision whereas the "A" stands for "Application profile". Such processors implement virtual memory; in ARM parlance they define a Virtual Memory System Architecture (VMSA). This means that they have a Memory Management Unit which provides both address remapping and memory attributes such as cached, non-executable, etc.

The other profiles are R and M. Profile R processors have a Memory Protection Unit which like an MMU provides memory attributes which which does no address mapping. ARM calls this a Protected Memory System Architecture (PMSA). Profile M, for Microcontroller, processors have neither VMSA nor PMSA.

Instruction set

v7 actually comprises several different instruction sets:

The standard ARM instruction set. Each instruction is 32 bits long and aligned on a 32-bit boundary. The full set of general registers is available. Shift operations may be combined with arithmetic and logical operations. This is the instruction set we'll be using for our project. Oddly, an integer divide instruction is optional and the Zynq CPUs don't have it.
Thumb-2. Designed for greater code density. Contains a mix of 16-bit and 32-bit instructions. Many instructions can access only general registers 0-7.
Jazelle. Similar to Java byte code.
ThumbEE. A sort of hybrid of Thumb and Jazelle, actually a CPU operation mode. Intended for environments where code modification is frequent, such as ones having a JIT compiler.

Coprocessors

The ARM instruction set has a standard coprocessor interface which allows up to 16 distinct coprocessors.

Coprocessor 15, CP15, is a pseudo-coprocessor which performs cache and MMU control as well as other system control functions.

CPs 12, 13 and 14 are reserved for floating point and vector hardware, which in this system are both part of the NEON extension.

Options and extensions

There are a number of options and extensions available for a Cortex-A CPU. Some features can change the way you have to program the processor even you don't want to use them. The following table lists them and indicates whether they are available on the Zynq.

Name	On Zynq?	Description
ARM instruction set
Thumb-2
Jazelle
ThumbEE
Integer divide instructions
VMSA		Has at least one MMU
PMSA		Has at least one MPU
Fast multiply		Improved integer multiplication
VFP3-32		Scalar floating point rev. 3 with 32 double-sized registers
VFP3-16		Like VFP3-32 but with half the number of registers
VFP4-x		Scalar floating point rev. 4
NEON (Advanced SIMD)		Vector operations with integers and single floats
Large Physical Address (LPA)		40-bit physical addresses
Generic Timer		System counter (clock) plus count-down and count-up timers
Multiprocessing (MPCore)		Multiple CPU cores sharing memory
Security (TrustZone)		Adds distiction between secure and non-secure code
Virtualization (Hypervisor)		Allows creation of virtual machines to run non-secure code

Two extensions that are obsolete as of ARMv7 are Fast Context Switch (FCSE) and old-style Floating Point (FP). You can ignore what the ARM architecture manual has to say about these.

MMU

There can be up to four independent MMUs per CPU (though they may be implemented with a single block of silicon). Without the security or virtualization extensions there is just one MMU which is used for both privileged and non-privileged accesses. Adding the security extension adds another for secure code, again for both privilege levels. Adding the virtualization extension adds two more MMUs, one for the hypervisor and one for a second stage of translation for code running in a virtual machine. The first stage of translation in a virtual machine maps VM-virtual to VM-real addresses while the second stage maps VM-real to actual hardware addresses. The hypervisor's MMU maps only once, from hyper-virtual to actual hardware addresses.

The Zynq CPUs have just the security extension and so each has two MMUs. All the MMUs present come up disabled after a reset, with TLBs disabled and garbage in the TLB entries. If all the relevant MMUs for a particular CPU state are disabled the system is still operable. Data accesses assume a memory type of Ordered, so there is no prefetching or reordering; data caches must be disabled or contain only invalid entries since a cache hit in this state results in unpredictable action. Instruction fetches assume a memory type of Normal, are uncached but still speculative, so that addresses up to 8 KB above the start of the current instruction may be accessed.

Unlike for PowerPC there is no way to examine MMU TLB entries nor to set them directly; you have to have a page table of some sort set up when you enable the MMU. In the RTEMS GIT repository the lpc32xx BSP implements a simple and practical approach. It maintains a single-level page table with 4096 32-bit entries, where each entry covers a 1 MB section of address space. The whole table therfore covers the entire 4 GB address space of the CPU. The upper 12 bits of each virtual address is the same as the upper 12 bits of the physical address so that the address mapping is the identity. Finding the index of the table entry for a particular address just requires a logical right shift of the VA by 20. The property-granularity of 1 MB imposed by this organization seems a small price to pay for avoiding the complications of second-level page tables and variable page sizes.

Automatic replacement of TLB entries normally uses a "pseudo round robin" algorithm, not the "least recently used" algorithm implemented in the PowerPC. The only way to keep heavily used entries in the TLB indefinitely is to explicitly lock them in, which you can do with up to four entries. These locked entries occupy a special part of the TLB which is separate from the normal main TLB, so you don't lose entry slots if you use locking.

In a multi-core system like the Zynq all the CPUs can share the same translation table if all of the following conditions are met:

All CPUs are in SMP mode.
All CPUs are in TLB maintenance broadcast mode.
All the MMUs are given the same real base address for
the translation table.
The translation table is in memory marked Normal, Sharable with write-back caching.

Under these conditions any CPU can change an address translation as if
it were alone and have the changes broadcast to the other CPUs.

When the MMU is fetching translation table entries it will ignore the L1 cache unless you set some special bits in the Translation Table Base Register telling it that the table is write-back cached. Apparently write-through caching isn't good enough but ignoring the L1 cache in that case is correct, if slow.

Caches

SMP support

System state after a reset

MMU: disabled with TLB disabled. Contents of TLB entries are random so one must at least disable all TLB entries before enabling the TLB. With the MMU disabled all instruction fetches are assumed to be to Normal memory while data accesses are assumed to be to Ordered memory.

Child pages

Basic architecture

Instruction set

Coprocessors

Options and extensions

MMU

Caches

SMP support

System state after a reset

References

Child pages

Cortex-A9 MPCore notes

Basic architecture

Instruction set

Coprocessors

Options and extensions

MMU

Caches

SMP support

System state after a reset

References