Details of Checkpoint and Alignment Implementation

The first completed implementation in the new refactorised scoreboarding code is a scheme for checkpointing and crash-recovery of the scoreboards and alignment of the scoreboarding with regular clock-hours/day-periods. Both of these improvements are discussed below:

CheckPointing

In order to implement check-pointing, I save the state of the scoreboards to disk, while processing the scoreboards for each flow-file. When the program starts for the first time, it checks for checkpoints on disk before starting the scoreboading of every direction and if a checkpoint is present it unmarshalls the data from the checkpoint and starts updating the scoreboards from that point on.

Saving the Checkpoints

The first issue was how to save the state to disk. I decided to go with marshalling and saving the part of the nested hash structure containing the current scoreboard to disk. Specifically, mylist {direction}{$direction} {scoreboard} {aggregate} {report} points to an array of nested hashes , one for each type of report defined (as shown highlighted in the thumbnail below). ( let $ref = mylist {direction}{$direction} {scoreboard} {aggregate} {report}[$i] for the $ith report type ). Then $ref->{aggdata}{tuplevalues} contains a nested hash containing the current counters for the flows/packets/bytes both inward and outward for a particular 'direction'(Such as ATLAS BNL to CERN' ) specified by the $direction variable in mylist.

I create a temporary data strcuture ,each time the scoreboarding function is called in JKFlow.pm for a particular flowflile and then for a particular direction , and store the values of $ref->{aggdata}{tuplevalues} as well as certain other important variables representing the state of scoreboard including the $ref-> {count}, $ref->{startperiod}and $ref->{counter} in that data strcuture. It is then marshalled and written to disk using a module written in C called Storable. The technique i explained on this page.

The checkpoints are written such that there is a single checkpoint for each type of report and it is updated for each flow-file. e.g. I have shown the score directory for a particular monitored direction below:

akbar@iepm-resp $ pwd
/u1/flows/score/USATLAS Computing Farm - CERN
akbar@iepm-resp $
akbar@iepm-resp $ ls -ahl
total 32K
drwxr-xr-x    4 akbar    sg           4.0K May  8 00:01 ./
drwxr-xr-x   94 akbar    sg           4.0K May  8 00:01 ../
-rw-r--r--    1 akbar    sg            561 May  8 00:01 1440checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 15checkpoint.dat
drwxr-xr-x    2 akbar    sg           4.0K May  8 00:01 2007-03-21/
-rw-r--r--    1 akbar    sg            561 May  8 00:01 360checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 60checkpoint.dat
drwxr-xr-x    3 akbar    sg           4.0K May  8 00:01 other/
akbar@iepm-resp $

This is the top level of the score directory for the 'USATLAS Computing Farm - CERN' direction. There are four checkpoints in this directory. The starting number before the checkpoint is the duration in minutes of the report-type (as defined in JKFlow.xml) for this checkpoint. Thus 1440checkpoint.dat is the checkpoint for the daily aggregate report type and 360checkpoint.dat is the checkpoint for the six-hourly report.

Loading the Checkpoints

The checkpoints only need to be loaded at the start of the program run. For this purpose I have a variable called FIRSTTIME which has module-wide visibility for all objects of JKFlow. THe value of this variable is default to 0 and set to 1 after the first flow-file has been processed. Thus I can load the checkpoint (if present) for the first run of the program.

Once I know that I am in the first run of the program I first check whether a checkpoint is available for the particular type of report. If it is available then I recover the values using the retreive function of the Storable module. Then there are some checks for determining whether this is a valid checkpoint i.e.

the checkpoint is after the start-time of the current report.
the checkpoint is not after the end time of the current report.

Once the above checks have been performed then I restore the state of the counters in the mylist datastructure and continue with the scoreboarding.

In this way if the machine crashes and is restarted the scoreboarding would be started from the exact same point where it was left off.

Child pages

Details of Checkpoint and Alignment Implementation

CheckPointing