Details of Checkpoint and Alignment Implementation

The first completed implementation in the new refactorised scoreboarding code is a scheme for checkpointing and crash-recovery of the scoreboards and alignment of the scoreboarding with regular clock-hours/day-periods. Both of these improvements are discussed below:

CheckPointing

In order to implement check-pointing, I save the state of the scoreboards to disk, while processing the scoreboards for each flow-file. When the program starts for the first time, it checks for checkpoints on disk before starting the scoreboading of every direction and if a checkpoint is present it unmarshalls the data from the checkpoint and starts updating the scoreboards from that point on.

The first issue was how to save the state to disk. I decided to go with marshalling and saving the part of the nested hash structure containing the current scoreboard to disk. Specifically, mylist {direction}{$direction} {scoreboard} {aggregate} {report} points to an array of nested hashes , one for each type of report defined (as shown highlighted in the thumbnail below). ( let $ref = mylist {direction}{$direction} {scoreboard} {aggregate} {report}[$i] for the $ith report type ). Then $ref->{aggdata}{tuplevalues} contains a nested hash containing the current counters for the flows/packets/bytes both inward and outward for a particular 'direction'(Such as ATLAS BNL to CERN' ) specified by the $direction variable in mylist.

I create a temporary data strcuture ,each time the scoreboarding function is called in JKFlow.pm for a particular flowflile and then for a particular direction , and store the values of $ref->{aggdata}{tuplevalues} as well as certain other important variables representing the state of scoreboard including the $ref-> {count}, $ref->{startperiod}and $ref->{counter} in that data strcuture. It is then marshalled and written to disk using the technique i explained on this page.

The checkpoints are written such that there is a single checkpoint for each type of report and it is updated for each flow-file. e.g. I have shown the score directory for a particular monitored direction below:

akbar@iepm-resp $ pwd
/u1/flows/score/USATLAS Computing Farm - CERN
akbar@iepm-resp $
akbar@iepm-resp $ ls -ahl
total 32K
drwxr-xr-x    4 akbar    sg           4.0K May  8 00:01 ./
drwxr-xr-x   94 akbar    sg           4.0K May  8 00:01 ../
-rw-r--r--    1 akbar    sg            561 May  8 00:01 1440checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 15checkpoint.dat
drwxr-xr-x    2 akbar    sg           4.0K May  8 00:01 2007-03-21/
-rw-r--r--    1 akbar    sg            561 May  8 00:01 360checkpoint.dat
-rw-r--r--    1 akbar    sg            561 May  8 00:01 60checkpoint.dat
drwxr-xr-x    3 akbar    sg           4.0K May  8 00:01 other/
akbar@iepm-resp $

Child pages

Details of Checkpoint and Alignment Implementation