Interface Between Boot Code and Application Code
------------------------------------------------

christopher o'grady, oct. 26, 2006 (updated feb. 13, 2007).

o There are two independent pieces of code involved in the virtex4
  booting process: the "boot code" will be loaded into readonly (from
  the powerPC perspective) BRAM along with the FPGA fabric from xilinx
  Platform Flash.  This code will in turn load the samsung flash
  "application" image selected by a hardware dipswitch (in the case of a
  hard reset) or the image selected by the transfer control register
  defined by MEH (in the case of a soft reset).

o The format of the application image in the samsung flash is not
  currently defined, but the most straightforward option would be a
  direct copy of the memory after the elf-executable has been loaded.

o All flash related mtdcr/mfdcr instructions will be located
  in the readonly BRAM.  This a partial, but certainly not bulletproof,
  protection mechanism for the flash.

o The publicly exported flash control functions will be

  // upper 16 bits of readFile status are block number (in case of block err).
  // lower 16 bits of readFile status are standard status bits.
  unsigned (*readFile)     (unsigned fileNum, void* memAddr);
  unsigned (*readPage)     (unsigned blockNum,unsigned pageNum,void* memAddr);
  unsigned (*readPageNoECC)(unsigned blockNum,unsigned pageNum,void* memAddr);
  unsigned (*writePage)    (unsigned blockNum,unsigned pageNum,void* memAddr);
  unsigned (*imageSelect)  (unsigned imageNum);
  unsigned (*eraseBlock)   (unsigned blockNum);

  where "page" is the smallest possible data unit that
  can be transferred to and from samsung flash, and "block"
  is the larger size that the chip uses to erase memory.

o Return values will be 0 if an operation is successful, otherwise
  it will return a non-zero error code to be specified by the boot code.

o Application code will call the flash control functions using a
  well-known pointer to an agreed-upon structure of jump instructions
  (i.e. PowerPC unconditional "B" branch instructions).  These jump
  instructions will transfer control to the "real" BRAM-based routines.

o The boot code will poll to determine when flash operations are
  completed.  Application code will have the same behaviour, since
  it calls the boot code for flash operations.  This isn't as efficient
  as an interrupt-driven interface, but it simplifies 

o Transfer control from boot code to application code will be done
  with a branch to the transfer address defined by MEH in chapter 3 of
  the Network Packet Adapter Description document.  At the time of the
  branch, the processor should be in a state identical to the one after
  soft-reset described in the xilinx document "PowerPC Processor
  Reference Guide".  We can't do a soft-reset directly because the
  powerPC first-instruction address is in write protected BRAM, so we
  can't do a write to that address.

o Application code is responsible for managing re-entrancy issues
  associated with the flash filesystem.

o Application code is responsible for ensuring that the image-number
  selected by the dipswitch is always usable, so that control of the
  system is not lost.

**********************************************************************

Eric's BramFunction Return Values

I've attached some sample code for the flash package.  Ultimately I
may clean things up a bit by switching to symbolic names for some of
the bit fields.  At this point, I mostly want to verify with you that
I've got the status reporting approximately commensurate with what
you're expecting.  I've taken the following comment from one of your
old emails and extended it to be valid for all functions:

// upper 16 bits of readFile status are block number (in case of block err).
// lower 16 bits of readFile status are standard status bits.

Of course, imageSelect doesn't really return status, since it results
in an immediate reboot.  Anyway, let me make sure that we're clear on
the following points:

* The upper 16 bits always return the block number of the last attempted
flash operation.  Since the block number is only an 11-bit entity, the
upper 5 bits should always be zeroes.  Note that the block number is
always valid, regardless of the "standard" status bits in the lower 16
bits are zero or non-zero.

*At the moment, there are 6 defined bits in the "standard" status bits.
Not all bits are meaningful for any given function.  (n.b.: by
"meaningful" I really mean "might possibly be asserted.")

*Bit 0 (little-endian) is the correctable R-S decoder error flag -
only meaningful in readPage, readPageNoECC, or readFile.

*Bit 1 is the uncorrectable R-S decoder error flag - again, only
meaningful for those 3 read functions.

*Bit 2 is the chip's hardware error flag - meaningful only for
writePage or eraseBlock.

*Bit 3 is the unexpected null file link flag - meaningful only for
readFile.  This flag is set if the code encounters a null 16-bit value
for a file link.  This can happen either for the file link within a file
handle in page 0, or in the file link in the last used page within a
used block.  One uses the value of the block number in the upper 16 bits
to distinguish between these two cases.  The first case implies that the
file number points to an unused file handle in the file handle vector;
the second implies that one effectively sees an EOF (in the sense that
one can't locate any additional blocks in the file) before exhausting
the file length specified in the file handle.  Note that fileRead code
doesn't really care whether or not all 32 pages in a block (other than
the last block in the file) are used.  As long as there is a valid file
link in the last used page in a block, things continue with the next
block.

*Bit 4 is the missing null file link flag - meaningful only for
readFile.  This is basically the converse condition.  The code expects
the file link in the last used page in the last used block to be null,
and sets this flag if one exhausts the file length specified in the file
handle but there is a non-null file link in the last page.

*Bit 5 is the excess data flag - meaningful only for readFile.  This
flag is set if the "page range" field for the last block indicates that
there are more used pages in that last block than are required to
exhaust the file length specified in the file handle.

*Note that readPage and readPageECC return the specified data volume
(520 bytes for readPage and 528 bytes for readPageNoECC) to the user
buffer regardless of the presence or absence of R-S decoder errors.  The
same is true for the 512 bytes for the internal version of readPage
(FLRDPG) employed by readFile.  However, in that latter case readFile
aborts the read attempt as soon as it encounters a page with any errors.
 (If you don't do that, then it becomes difficult, although perhaps not
impossible, to come up with a clean definition of whether the block
number reported in the upper half of the status refers to the block with
the hardware error or the block with some subsequent file link or excess
data error.)  This tends to support a block replacement philosophy which
leans towards early replacement as soon as you start encountering any
correctable errors.  However, in order to recover the data from the
block with correctable errors, you have to employ readPage (presumably
followed by writePage to a page in the replacement block) for each of
the used pages in the failing block in turn.

*Of course, this makes you wonder what you do when you get correctable
errors in block 0.  I suspect that the answer is to replace the flash
chip.  However, note that the chip is only "guaranteed" to not require
error correction for block 0 for the first 1k program/erase cycles. 
That in turn suggests that you really want to use an in-memory cache for
at least the contents of the bit maps in page 31 of block 0, and
certainly want to avoid rewriting block zero each time you allocate
another page to add onto a file that you're writing.  In other words,
you only want to rewrite block 0 once for each file that you add to the
flash (or remove from the flash).

*For the moment, I have located the .xfrvec section at address
0xffffffc0.  The order of the entries is given by the jump table that is
included in this code.

*The code does NOT make any stack references.  It manages to get by
with using only registers R0 and R3-R6 for the page-level routines, and
R0 and R3-R11 for readFile.  Those are all considered to be scratch
registers in a function call in the PowerPC run-time C environment -
i.e. it is up to the function caller to preserve their contents (if they
are meaningful to the caller), and not up to the function itself.

**********************************************************************

PMC Bootstrap Process

1. DSOCM is at addresses 0 through 0xffff.
2. RLDRAM is at addresses 0x08000000 through 0x0fffffff.
3. ISOCM is at addresses 0xfffe0000 through 0xffffffff.
4. The on-processor DCR address map is defined by the fact that
TIEDCRADDR is connected to 6 bits of zeroes.
5. The off-processor DCR address map is defined by the second half
(starting with PATRN00 at 0x380) of the attached dcr.h file.
6. It's not obvious to me that you're ever going to try to use
interrupts on this board, but if you do, I'll give you some
documentation on the interrupt controllers in this hardware.

ISOCM Software Overview:

1. The code that is in the ISOCM for the high priority processor (first
in the JTAG chain) is a highly modified copy of the work-in-progress
code for the PMC board in its intended application as a PCI to
multi-channel fiber link interface.  However, the invocation of the vast
majority of the code has been stripped out of the initialization thread,
with the result that most of the code that is actually in the ISOCM is
dead code.  The only active code is in the attached two modules (plus
the branch instruction at 0xfffffffc that goes to INBGIN, plus a
two-instruction endless loop with a one-instruction setup at LOABT that
sets the wait state [WE] bit in the machine state register and then
branches back to itself).

2. The code in the .inproc section of inproc.S (beginning with INBGIN)
ends up at the beginning of ISOCM, at address 0xfffe0000.

3. The single instruction in the .kntext section from this module (at
INERR) ends up at 0xfffe1214.

4. The LOABT code is at 0xffffffb4, with the actual two-instruction loop
at 0xffffffb8.

5. The transfer vector (.xfrvec section in flash.S) is at 0xffffffc0.

6. The .kntext section of flash.S (beginning with readPage) ends up at
0xfffe31b8.

7. The majority of the code in inproc.S is tested, up to the point where
it aborts with R-S decoder errors when reading page 0 of block 0 to
obtain the boot option vector (since the ECC bytes for this block are
not yet written).  (Note that I've left in the sanity check for a value
of the file number for entry 0 in this vector which is not -1, but that
it turned out to be unnecessary because an uninitiated flash chip
results in this R-S error).  However, the sanity check, the call to
readFile, and the subsequent jump into the loaded boot file are
untested.

8. The flash code itself in flash.S is untested, although based on code
that I've used for checking out the hardware.  In general, it's been
carefully read, and searched for my great nemesis - attempting to add an
immediate constant to R0 (if you don't know the PowerPC machine
instruction set, this instruction is a special "feature" that skips
adding the constant to R0 and just loads the constant itself into the
destination register - it's the way that they implemented the "load
immediate" instruction without needing a separate instruction for that
purpose).

9. If one were to strip out the dead code and leave only the necessary
boot loader code and flash support package in these two modules (and
move the abort loop into this code), the resulting code volume would
currently be less than one kilobyte.  As a result, my contention that
you could build the Petacache or LSST chip with only two BRAMs per
processor (32 bits of a 64 bit word in each) for a total of 4 kB of
ISOCM rather than the 64 BRAMs per processor (plus 32 per processor for
DSOCM) in this current hardware design seems to have a factor of four
safety margin.

Current Initialization Thread Flow:

1. Puts 0x00000001 into LEDs.
2. Code to initialize CCR0 is commented out (including setting
instruction prefetch enable bits).
3. Invalidates instruction and data caches.
4. Code to clear SGR is commented out.
5. Loads 0x40000000 into ICCR and DCCR to enable caching of all 128
megabytes of RLDRAM.
6. Puts 0x00000002 into LEDs.
7. Fills all of RLDRAM with zeroes.
8. Tests all of RLDRAM for zeroes; aborts if any non-zero word found.
9. Puts 0x00000003 into LEDs.
10. Fills all of RLDRAM with 0xffffffff.
11. Tests all of RLDRAM for 0xffffffff; aborts if any non-matching word
found.
12. Puts 0x00000004 into LEDs.
13. Fills all of RLDRAM with byte offset of current word from RLDRAM
base address.
14. Tests all of RLDRAM for byte offset of current word; aborts if any
non-matching word found.
15. Refills all of RLDRAM with zeroes.
16. Puts 0xdead0000 into LEDs.  (N.B.: It typically takes 6-8 seconds to
reach this point; 1-2 seconds to configure FPGA from platform flash and
the balance to make the 7 passes through all of RLDRAM.)
17. Initiates read of page 0 of block 0 of flash into flash interface
BRAM.
18. Waits for read command to be started by flash interface.
19. Puts 0xdead0001 into LEDs.
20. Waits for read command to be completed by flash interface.
21. Puts 0xdead0n02 into LEDs, where "n" is the low nibble of the
"standard" flash status.
22. Aborts if "n" is non-zero.  (N.B.: The current behavior with
uninitialized flash chip is that hardware reports both correctable and
non-correctable R-S decode errors, so this abort is taken with final
value of 0xdead0302 in LEDs.)
23. Checks word at byte offset 4 (file number for boot option 0) from
flash interface BRAM; aborts if -1.
24. Puts 0xdead0003 into LEDs.
25. Extracts appropriate element of boot option vector from flash
interface BRAM and saves in processor registers.
26. Invokes readFile function using specified file number and load
address from element of boot option vector.
27. Puts 0xdeadnn04 into LEDs, where "nn" is the low byte of the
"standard" status returned by readFile.
28. Aborts if "nn" is non-zero.
29. Clears LEDs.
30. Enters boot file at the transfer address specified in element of
boot option vector, using the flags specified in that element as the
only entry parameter (passed in R3).

The bottom line here is that the only processor registers (other than
the general purpose registers) that are disturbed from their
post-reset values prior to entering the boot file code are ICCR and
DCCR.  In addition, the caches are invalidated.  If it's important, I
think that one could actually restore DCCR to its post-reset value,
thus disabling the data cache.  Of course, one would possibly have to
repeat the flush before re-enabling that cache, and one would also
have to avoid all data accesses to RLDRAM until that re-enable was
performed.  However, I don't think that there's any way to disable the
instruction cache once you've started executing code from the RLDRAM,
as the interface from the I-side PLB to the RLDRAM controller doesn't
(currently) support single-word fetches.  BTW, you're venturing into
the great unknown when you do start executing code from the RLDRAM, as
the I-side PLB-to-RLDRAM interface hasn't been formally exercised yet
in this chip.  However, I did briefly exercise the corresponding
interface in my Virtex-II Pro chip around 18 months ago, and the VHDL
for this interface is a subset of the corresponding interface for the
D-side PLB (without write capability).