• Jobs disappear when they are submitted. Navid should give pipeline lsf job number when it is submitted. We should have mechanism for monitoring jobs from web.
  • Express queue sometimes takes >45 minutes to start jobs. Thus Navid's cleanup jobs can be delayed for a long time.
  • Sometimes we exceed 600 simultaneous connections to database. New connection pool will queue excess connections and mitigate this problem.
  • Database hiccup can cause status of pipeline to become very confused. Actual problem seen was caused by two job schedulers running simultaneously on test pipeline.
  • No script to nuke running jobs (in case of pending database outage).
  • We should have a web interface to allow to look at batch jobs. This could be tied into pipeline if pipeline new LSF job number.
  • Cannot view log file while job is running. Navid puts job file in temporary location and moves it at end of job (Why? We should ask Navid). Dan has location of log file so web interface could show it.
  • Lack of timestamps (in log files). Tom suggests writing timestamp plus node to log file at every transition. Matt suggests using Perl logger for messages printed by batch system. Matt suggests replaceing print statements with C++ logger, but Richard points out Gaudi already has message service.
  • Writing all log files to u05 will eventually cause NFS problems.
  • We have not been able to successfully use scratch space reliably with non-Babar machines. We should raise this with SCCS. (Perhaps we can use lsf to tell us which machines have scratch space).
  • Tom would like to be able to change queues for tasks.
  • Would like to have a flexible way to specify file locations.
  • Hard to grep log files because file names (x2000) exceed unix command line limit.
  • Should not have to edit gleam wrapper script.
  • Display summary of running jobs in task (e.g. cpu time used, node, time queued etc).
  • No labels

2 Comments

  1. Just want to add a few points. First, I expressed the opinion that the job status information must be absolutely reliable. Second, that to identify scaling problems (Oracle, NFS, whatever), sub-second time stamps embedded in the log file(s) will be crucial. Third, for debugging, we need to keep all generations of log files - especially from failed runs. Most other comments derived from these points.

    ===

    I hope to work on putting some pre-processing and post-processing scripts in the GleamWrapper*.pl scripts. These scripts can, for example:

    • manage old and new log files to prevent them from being overwritten
    • output time stamps
    • provide primative job state information in the form of 0-byte files (e.g. gleam-in.start, gleam-in.success, gleam-in.failed)
    • perform post process log file scanning, looking for tell-tale signs of problems, collecting job statistics, etc. and writting out a short summary.

    After the task is complete, these short job summaries can be harvested and combined to provide a task-wide summary.

    A time stamp along the lines of the following would suffice:

    noma0442 2005 09 29 10:04:13.9447 - <some short message>

    This contains the machine name and the full date and (sub-second) time.

    Currently, our log files have names like this:

    2piMuon10GeV-GR-HEAD1.651_00001_gleam_in_LSF_out.log

    It's nice to have the task name embedded in the name, but due to Unix command line length limitations (10,240 characters), I recommend shortening this to:

    gleam_in.log

    It

  2. (oops, hit the wrong button and prematurely submitted the last comment)

    While having the task name is nice, it is also redundant with an upper level directory name so we can probably do with out it reappearing in the log file name.

    It may be desirable to rethink the organization of the output file directory structure. Currently we have:

    <task-name>/rootData/<run-number>/log/<logfilename>.log

    Attempting to "grep" for information in the log files is what causes the Unix command line limit to be exceeded. Besides shortening the log file name, we might also elect to remove the "/log" level of directory structure.

    At some point we may need to distribute log and root files to different file systems either to avoid contention or simply because the filesystem is too small to handle the needs of the job. By writing files locally (to /scratch space on the farm machines), we can know exactly how many bytes are needed to store the output and can imagine a storage server of sorts that can dynamically assign a specific location.

    In trying to understand just how the pipeline works, I also encourage the developers to produce some simple documentation describing how it works. Useful information would included:

    • a block diagram of what software components are used and their relationships to one another (and their dependencies, especially to external products such as LSF, Oracle and mysql)
    • a time-line indicating the life-cycle of a single job request
    • exactly when and under what circumstances are Oracle transactions performed
    • ditto for mysql
    • exactly when are the run directories created.
    • ditto for the log files

    Getting real-time access to in-progress log files is essential, especially if one can use the "tail -f" command, which automatically follows the progress of a given file.

    Being able to cancel an in-progress task can address other situations than dealing with an Oracle outage. One might start a task without having first awaited a final okay by an expert about the quality of a build (this might be done if the feedback is delayed or if the response was anticipated in the affirmative), or for testing features of the pipeline in a realistic setting.

    For preserving old log files, something along the lines of simply appending a number at the end of the file name would convey both the original intent of the file name and the generation of the processing attempt, e.g.

    gleam.log
    gleam.log0
    gleam.log1
    gleam.log2
    ...

    where "gleam.log" is the most recent and successful log file, while the others represent earlier attempts. This structure would allow an easy way to ignore all but the successful attempt when trying to scan many log files for problems, e.g.,

    grep "number of events" /.log

    Once some of these steps are implemented, I will feel more confident about being able to follow a task in progress, assess its status, assess its success, perform debugging, and create a report.

    • Tom