Interface Between Boot Code and Application Code ------------------------------------------------ christopher o'grady, oct. 26, 2006 (updated feb. 13, 2007). o There are two independent pieces of code involved in the virtex4 booting process: the "boot code" will be loaded into readonly (from the powerPC perspective) BRAM along with the FPGA fabric from xilinx Platform Flash. This code will in turn load the samsung flash "application" image selected by a hardware dipswitch (in the case of a hard reset) or the image selected by the transfer control register defined by MEH (in the case of a soft reset). o The format of the application image in the samsung flash is not currently defined, but the most straightforward option would be a direct copy of the memory after the elf-executable has been loaded. o All flash related mtdcr/mfdcr instructions will be located in the readonly BRAM. This a partial, but certainly not bulletproof, protection mechanism for the flash. o The publicly exported flash control functions will be // upper 16 bits of readFile status are block number (in case of block err). // lower 16 bits of readFile status are standard status bits. unsigned (*readFile) (unsigned fileNum, void* memAddr); unsigned (*readPage) (unsigned blockNum,unsigned pageNum,void* memAddr); unsigned (*readPageNoECC)(unsigned blockNum,unsigned pageNum,void* memAddr); unsigned (*writePage) (unsigned blockNum,unsigned pageNum,void* memAddr); unsigned (*imageSelect) (unsigned imageNum); unsigned (*eraseBlock) (unsigned blockNum); where "page" is the smallest possible data unit that can be transferred to and from samsung flash, and "block" is the larger size that the chip uses to erase memory. o Return values will be 0 if an operation is successful, otherwise it will return a non-zero error code to be specified by the boot code. o Application code will call the flash control functions using a well-known pointer to an agreed-upon structure of jump instructions (i.e. PowerPC unconditional "B" branch instructions). These jump instructions will transfer control to the "real" BRAM-based routines. o The boot code will poll to determine when flash operations are completed. Application code will have the same behaviour, since it calls the boot code for flash operations. This isn't as efficient as an interrupt-driven interface, but it simplifies o Transfer control from boot code to application code will be done with a branch to the transfer address defined by MEH in chapter 3 of the Network Packet Adapter Description document. At the time of the branch, the processor should be in a state identical to the one after soft-reset described in the xilinx document "PowerPC Processor Reference Guide". We can't do a soft-reset directly because the powerPC first-instruction address is in write protected BRAM, so we can't do a write to that address. o Application code is responsible for managing re-entrancy issues associated with the flash filesystem. o Application code is responsible for ensuring that the image-number selected by the dipswitch is always usable, so that control of the system is not lost. ********************************************************************** Eric's BramFunction Return Values I've attached some sample code for the flash package. Ultimately I may clean things up a bit by switching to symbolic names for some of the bit fields. At this point, I mostly want to verify with you that I've got the status reporting approximately commensurate with what you're expecting. I've taken the following comment from one of your old emails and extended it to be valid for all functions: // upper 16 bits of readFile status are block number (in case of block err). // lower 16 bits of readFile status are standard status bits. Of course, imageSelect doesn't really return status, since it results in an immediate reboot. Anyway, let me make sure that we're clear on the following points: * The upper 16 bits always return the block number of the last attempted flash operation. Since the block number is only an 11-bit entity, the upper 5 bits should always be zeroes. Note that the block number is always valid, regardless of the "standard" status bits in the lower 16 bits are zero or non-zero. *At the moment, there are 6 defined bits in the "standard" status bits. Not all bits are meaningful for any given function. (n.b.: by "meaningful" I really mean "might possibly be asserted.") *Bit 0 (little-endian) is the correctable R-S decoder error flag - only meaningful in readPage, readPageNoECC, or readFile. *Bit 1 is the uncorrectable R-S decoder error flag - again, only meaningful for those 3 read functions. *Bit 2 is the chip's hardware error flag - meaningful only for writePage or eraseBlock. *Bit 3 is the unexpected null file link flag - meaningful only for readFile. This flag is set if the code encounters a null 16-bit value for a file link. This can happen either for the file link within a file handle in page 0, or in the file link in the last used page within a used block. One uses the value of the block number in the upper 16 bits to distinguish between these two cases. The first case implies that the file number points to an unused file handle in the file handle vector; the second implies that one effectively sees an EOF (in the sense that one can't locate any additional blocks in the file) before exhausting the file length specified in the file handle. Note that fileRead code doesn't really care whether or not all 32 pages in a block (other than the last block in the file) are used. As long as there is a valid file link in the last used page in a block, things continue with the next block. *Bit 4 is the missing null file link flag - meaningful only for readFile. This is basically the converse condition. The code expects the file link in the last used page in the last used block to be null, and sets this flag if one exhausts the file length specified in the file handle but there is a non-null file link in the last page. *Bit 5 is the excess data flag - meaningful only for readFile. This flag is set if the "page range" field for the last block indicates that there are more used pages in that last block than are required to exhaust the file length specified in the file handle. *Note that readPage and readPageECC return the specified data volume (520 bytes for readPage and 528 bytes for readPageNoECC) to the user buffer regardless of the presence or absence of R-S decoder errors. The same is true for the 512 bytes for the internal version of readPage (FLRDPG) employed by readFile. However, in that latter case readFile aborts the read attempt as soon as it encounters a page with any errors. (If you don't do that, then it becomes difficult, although perhaps not impossible, to come up with a clean definition of whether the block number reported in the upper half of the status refers to the block with the hardware error or the block with some subsequent file link or excess data error.) This tends to support a block replacement philosophy which leans towards early replacement as soon as you start encountering any correctable errors. However, in order to recover the data from the block with correctable errors, you have to employ readPage (presumably followed by writePage to a page in the replacement block) for each of the used pages in the failing block in turn. *Of course, this makes you wonder what you do when you get correctable errors in block 0. I suspect that the answer is to replace the flash chip. However, note that the chip is only "guaranteed" to not require error correction for block 0 for the first 1k program/erase cycles. That in turn suggests that you really want to use an in-memory cache for at least the contents of the bit maps in page 31 of block 0, and certainly want to avoid rewriting block zero each time you allocate another page to add onto a file that you're writing. In other words, you only want to rewrite block 0 once for each file that you add to the flash (or remove from the flash). *For the moment, I have located the .xfrvec section at address 0xffffffc0. The order of the entries is given by the jump table that is included in this code. *The code does NOT make any stack references. It manages to get by with using only registers R0 and R3-R6 for the page-level routines, and R0 and R3-R11 for readFile. Those are all considered to be scratch registers in a function call in the PowerPC run-time C environment - i.e. it is up to the function caller to preserve their contents (if they are meaningful to the caller), and not up to the function itself. ********************************************************************** PMC Bootstrap Process 1. DSOCM is at addresses 0 through 0xffff. 2. RLDRAM is at addresses 0x08000000 through 0x0fffffff. 3. ISOCM is at addresses 0xfffe0000 through 0xffffffff. 4. The on-processor DCR address map is defined by the fact that TIEDCRADDR is connected to 6 bits of zeroes. 5. The off-processor DCR address map is defined by the second half (starting with PATRN00 at 0x380) of the attached dcr.h file. 6. It's not obvious to me that you're ever going to try to use interrupts on this board, but if you do, I'll give you some documentation on the interrupt controllers in this hardware. ISOCM Software Overview: 1. The code that is in the ISOCM for the high priority processor (first in the JTAG chain) is a highly modified copy of the work-in-progress code for the PMC board in its intended application as a PCI to multi-channel fiber link interface. However, the invocation of the vast majority of the code has been stripped out of the initialization thread, with the result that most of the code that is actually in the ISOCM is dead code. The only active code is in the attached two modules (plus the branch instruction at 0xfffffffc that goes to INBGIN, plus a two-instruction endless loop with a one-instruction setup at LOABT that sets the wait state [WE] bit in the machine state register and then branches back to itself). 2. The code in the .inproc section of inproc.S (beginning with INBGIN) ends up at the beginning of ISOCM, at address 0xfffe0000. 3. The single instruction in the .kntext section from this module (at INERR) ends up at 0xfffe1214. 4. The LOABT code is at 0xffffffb4, with the actual two-instruction loop at 0xffffffb8. 5. The transfer vector (.xfrvec section in flash.S) is at 0xffffffc0. 6. The .kntext section of flash.S (beginning with readPage) ends up at 0xfffe31b8. 7. The majority of the code in inproc.S is tested, up to the point where it aborts with R-S decoder errors when reading page 0 of block 0 to obtain the boot option vector (since the ECC bytes for this block are not yet written). (Note that I've left in the sanity check for a value of the file number for entry 0 in this vector which is not -1, but that it turned out to be unnecessary because an uninitiated flash chip results in this R-S error). However, the sanity check, the call to readFile, and the subsequent jump into the loaded boot file are untested. 8. The flash code itself in flash.S is untested, although based on code that I've used for checking out the hardware. In general, it's been carefully read, and searched for my great nemesis - attempting to add an immediate constant to R0 (if you don't know the PowerPC machine instruction set, this instruction is a special "feature" that skips adding the constant to R0 and just loads the constant itself into the destination register - it's the way that they implemented the "load immediate" instruction without needing a separate instruction for that purpose). 9. If one were to strip out the dead code and leave only the necessary boot loader code and flash support package in these two modules (and move the abort loop into this code), the resulting code volume would currently be less than one kilobyte. As a result, my contention that you could build the Petacache or LSST chip with only two BRAMs per processor (32 bits of a 64 bit word in each) for a total of 4 kB of ISOCM rather than the 64 BRAMs per processor (plus 32 per processor for DSOCM) in this current hardware design seems to have a factor of four safety margin. Current Initialization Thread Flow: 1. Puts 0x00000001 into LEDs. 2. Code to initialize CCR0 is commented out (including setting instruction prefetch enable bits). 3. Invalidates instruction and data caches. 4. Code to clear SGR is commented out. 5. Loads 0x40000000 into ICCR and DCCR to enable caching of all 128 megabytes of RLDRAM. 6. Puts 0x00000002 into LEDs. 7. Fills all of RLDRAM with zeroes. 8. Tests all of RLDRAM for zeroes; aborts if any non-zero word found. 9. Puts 0x00000003 into LEDs. 10. Fills all of RLDRAM with 0xffffffff. 11. Tests all of RLDRAM for 0xffffffff; aborts if any non-matching word found. 12. Puts 0x00000004 into LEDs. 13. Fills all of RLDRAM with byte offset of current word from RLDRAM base address. 14. Tests all of RLDRAM for byte offset of current word; aborts if any non-matching word found. 15. Refills all of RLDRAM with zeroes. 16. Puts 0xdead0000 into LEDs. (N.B.: It typically takes 6-8 seconds to reach this point; 1-2 seconds to configure FPGA from platform flash and the balance to make the 7 passes through all of RLDRAM.) 17. Initiates read of page 0 of block 0 of flash into flash interface BRAM. 18. Waits for read command to be started by flash interface. 19. Puts 0xdead0001 into LEDs. 20. Waits for read command to be completed by flash interface. 21. Puts 0xdead0n02 into LEDs, where "n" is the low nibble of the "standard" flash status. 22. Aborts if "n" is non-zero. (N.B.: The current behavior with uninitialized flash chip is that hardware reports both correctable and non-correctable R-S decode errors, so this abort is taken with final value of 0xdead0302 in LEDs.) 23. Checks word at byte offset 4 (file number for boot option 0) from flash interface BRAM; aborts if -1. 24. Puts 0xdead0003 into LEDs. 25. Extracts appropriate element of boot option vector from flash interface BRAM and saves in processor registers. 26. Invokes readFile function using specified file number and load address from element of boot option vector. 27. Puts 0xdeadnn04 into LEDs, where "nn" is the low byte of the "standard" status returned by readFile. 28. Aborts if "nn" is non-zero. 29. Clears LEDs. 30. Enters boot file at the transfer address specified in element of boot option vector, using the flags specified in that element as the only entry parameter (passed in R3). The bottom line here is that the only processor registers (other than the general purpose registers) that are disturbed from their post-reset values prior to entering the boot file code are ICCR and DCCR. In addition, the caches are invalidated. If it's important, I think that one could actually restore DCCR to its post-reset value, thus disabling the data cache. Of course, one would possibly have to repeat the flush before re-enabling that cache, and one would also have to avoid all data accesses to RLDRAM until that re-enable was performed. However, I don't think that there's any way to disable the instruction cache once you've started executing code from the RLDRAM, as the interface from the I-side PLB to the RLDRAM controller doesn't (currently) support single-word fetches. BTW, you're venturing into the great unknown when you do start executing code from the RLDRAM, as the I-side PLB-to-RLDRAM interface hasn't been formally exercised yet in this chip. However, I did briefly exercise the corresponding interface in my Virtex-II Pro chip around 18 months ago, and the VHDL for this interface is a subset of the corresponding interface for the D-side PLB (without write capability).