Details of Checkpoint and Alignment Implementation

The first completed implementation in the new refactorised scoreboarding code is a scheme for checkpointing and crash-recovery of the scoreboards and alignment of the scoreboarding with regular clock-hours/day-periods. Both of these improvements are discussed below:

CheckPointing

In order to implement check-pointing, I save the state of the scoreboards to disk, while processing the scoreboards for each flow-file. When the program starts for the first time, it checks for checkpoints on disk before starting the scoreboading of every direction and if a checkpoint is present it unmarshalls the data from the checkpoint and starts updating the scoreboards from that point on.

Saving the Checkpoints

The first issue was how to save the state to disk. I decided to go with marshalling and saving the part of the nested hash structure containing the current scoreboard to disk. Specifically, mylist {direction}{$direction} {scoreboard} {aggregate} {report} points to an array of nested hashes , one for each type of report defined (as shown highlighted in the thumbnail below). ( let $ref = mylist {direction}{$direction} {scoreboard} {aggregate} {report}[$i] for the $ith report type ). Then $ref->{aggdata}{tuplevalues} contains a nested hash containing the current counters for the flows/packets/bytes both inward and outward for a particular 'direction'(Such as ATLAS BNL to CERN' ) specified by the $direction variable in mylist.

I create a temporary data strcuture ,each time the scoreboarding function is called in JKFlow.pm for a particular flowflile and then for a particular direction , and store the values of $ref->{aggdata}{tuplevalues} as well as certain other important variables representing the state of scoreboard including the $ref-> {count}, $ref->{startperiod}and $ref->{counter} in that data strcuture. It is then marshalled and written to disk using the technique i explained on this page.

The checkpoints are written such that there is a single checkpoint for each type of report and it is updated for each flow-file. e.g. I have shown the score directory for a particular monitored direction below:

akbar@iepm-resp $ pwd
/u1/flows/score/USATLAS Computing Farm - CERN
akbar@iepm-resp $
akbar@iepm-resp $ ls -ahl
total 32K
drwxr-xr-x    4 akbar    sg           4.0K May  8 00:01 ./
drwxr-xr-x   94 akbar    sg           4.0K May  8 00:01 ../
-rw-r--r--    1 akbar    sg            561 May  8 00:01 1440checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 15checkpoint.dat
drwxr-xr-x    2 akbar    sg           4.0K May  8 00:01 2007-03-21/
-rw-r--r--    1 akbar    sg            561 May  8 00:01 360checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 60checkpoint.dat
drwxr-xr-x    3 akbar    sg           4.0K May  8 00:01 other/
akbar@iepm-resp $

This is the top level of the score directory for the 'USATLAS Computing Farm - CERN' direction. There are four checkpoints in this directory. The starting number before the checkpoint is the duration in minutes of the report-type (as defined in JKFlow.xml) for this checkpoint. Thus 1440checkpoint.dat is the checkpoint for the daily aggregate report type and 360checkpoint.dat is the checkpoint for the six-hourly report.

Child pages

Details of Checkpoint and Alignment Implementation

CheckPointing