The first completed implementation in the new refactorised scoreboarding code is a scheme for checkpointing and crash-recovery of the scoreboards and alignment of the scoreboarding with regular clock-hours/day-periods. Both of these improvements are discussed below:
CheckPointing
Saving the Checkpoints
In order to implement check-pointing, I save the state of the scoreboards to disk, while processing the scoreboards for each flow-file. When the program starts for the first time, it checks for checkpoints on disk before starting the scoreboading of every direction and if a checkpoint is present it unmarshalls the data from the checkpoint and starts updating the scoreboards from that point on.
The first issue was how to save the state to disk. I decided to go with marshalling and saving the part of the nested hash structure containing the current scoreboard to disk. Specifically, mylist {direction}{$direction} {scoreboard} {aggregate} {report} points to an array of nested hashes , one for each type of report defined (as shown highlighted in the thumbnail below). ( let $ref = mylist {direction}{$direction} {scoreboard} {aggregate} {report}[$i] for the $ith report type ). Then $ref->{aggdata}{tuplevalues} contains a nested hash containing the current counters for the flows/packets/bytes both inward and outward for a particular 'direction'(Such as ATLAS BNL to CERN' ) specified by the $direction variable in mylist.
I create a temporary data strcuture ,each time the scoreboarding function is called in JKFlow.pm for a particular flowflile and then for a particular direction , and store the values of $ref->{aggdata}{tuplevalues} as well as certain other important variables representing the state of scoreboard including the $ref-> {count}, $ref->{startperiod}and $ref->{counter} in that data strcuture. It is then marshalled and written to disk using the technique i explained on this page.
The checkpoints are written such that there is a single checkpoint for each type of report and it is updated for each flow-file. e.g. I have shown the score directory for a particular monitored direction below:
akbar@iepm-resp $ pwd /u1/flows/score/USATLAS Computing Farm - CERN akbar@iepm-resp $ akbar@iepm-resp $ ls -ahl total 32K drwxr-xr-x 4 akbar sg 4.0K May 8 00:01 ./ drwxr-xr-x 94 akbar sg 4.0K May 8 00:01 ../ -rw-r--r-- 1 akbar sg 561 May 8 00:01 1440checkpoint.dat -rw-r--r-- 1 akbar sg 561 May 8 00:01 15checkpoint.dat drwxr-xr-x 2 akbar sg 4.0K May 8 00:01 2007-03-21/ -rw-r--r-- 1 akbar sg 561 May 8 00:01 360checkpoint.dat -rw-r--r-- 1 akbar sg 561 May 8 00:01 60checkpoint.dat drwxr-xr-x 3 akbar sg 4.0K May 8 00:01 other/ akbar@iepm-resp $
This is the top level of the score directory for the 'USATLAS Computing Farm - CERN' direction. There are four checkpoints in this directory. The starting number before the checkpoint is the duration in minutes of the report-type (as defined in JKFlow.xml) for this checkpoint. Thus 1440checkpoint.dat is the checkpoint for the daily aggregate report type and 360checkpoint.dat is the checkpoint for the six-hourly report.