Currently it is not possible to rollback jobs without first deleting the working directory. This needs to be fixed.

Related issues:

Constraints

  • Operations on rollback should be performed by the job submission engine, since this is the only service which can be assumed to have direct access to the (possibly remote) file system.
    • The job submission system has limited knowledge of what is going on, basically it knows about the working directory, and the files that are submitted with the job (normally just pipeline_wrapper)
    • We don't want to make the job submission engine much more complex, because it needs to operate at multiple locations, and needs to be highly reliable.
  • We would like the arrangement of working directories to not be any more complex than they are now (and eventally simpler or more logical – see discussion PII-170@jira).
    • Putting the "execution count" into the directory structure, at least for the "latest" execution, seems to violate this principle
  • We would like pointers to working directory and log file to continue to work for old executions after they are rolled back. This is so that both the web interface and pfind command can function properly.
    • Currently the working directory is computed and stored in the DB when the process is submitted. The log file location is stored when the job started e-mail is received from the job.
      • We could relativise the log file location with respect to the working directory to make subsequent changes easier.
      • The DB does know when a process is no longer the "latest" as a result of either stream or process rollback.
  • We need to handle both stream rollback and process rollback.
    • During stream rollback processes within substreams are implicitly rolled back, without the actual processinstance being modified.

Potential fixes

Make the default working directory path longer so that it also includes the execution number for streams and processes.

  • (smile) Simple to implement
  • (sad) Makes paths more complicated and confusing for users
  • (sad) Doesn't work if user specifies working directory himself

Allow job submission engine to move old working directories

This would involve the following:

  1. Modify the job submission so that it can take an extra option moveOldWorkingDir(String location)
    • This would indicate to the job submission engine that if the working directory already exists its contents should be moved to location before creating the new job. Some smarts would be required so that location could be relative to the current working dir (e.g. archive/nnnnnn) and already existing archives would not be moved.
  2. When submitting jobs (after rollback?) we would do the following:
    • See if any existing job used the same working directory (fairly easy DB query)
    • If so get either the processinstance id (a big number) or the execution instance path (e.g. _1_3_1) and add the moveOldWorkingDir option to the job submission
    • Update the database for the old job so that it knows the new location of its workingdir/logfile
  • (smile) Maintains reasonably understandable working directory structure
  • (smile) Works for any user defined working directory structure as long as different jobs (other than rollback) do not share same working directory
  • (smile) Database maintains knowledge about all working directories, log files
  • (sad) Working directories are not moved until rolled back job is submitted
  • (sad) Name of directory in archive not a simple ordinal

Some other additions to job submission to allow files to be moved before new job is submitted

Not thought through.

  • No labels

3 Comments

  1. In item 2., bullet 2, in the second "potential fix", could not one simply use _N, where "N" is the process instance for that particular process as the archive directory name? That is, why use the entire (lengthy) execution instance path where one is only concerned with the instance for that particular process step?

    In item 1., is it necessary to create two new directory levels (archive/nnnnn) rather than just one (archive_N)?

    1. Using _N where N is the process instance # (large database number) is possible.
      Using _N where N is the execution number (1,2,3...) is not possible because when a stream is rolled back the execution number of the process instance in the new stream is still 1. Thus we have to use a combination the process instance execution number and the execution number of all the streams (and substreams) above it, thus something like _1_3?

      archive_nnnnn rather than archive/nnnnn is possible, although perhaps makes the job of managing to not move the already archived files into the new archive directory a little trickier.

      1. A slight variant of my original suggestion is to use "N" where N=1,2,3,... but having no relation to the instance number. This is how I have implemented my own private rollback preparation scheme. For example, prior to rolling back a run, I work out the path to the log directory and create a directory "archive_N" into which I place everything (unless it is another archive directory), e.g., after a 2nd run of this stream:

        archive_1/ eventId.txt logFile.txt pipeline_env pipeline_summary pipeline_wrapper

        The value of "N" is automatically incremented should a subsequent iteration be needed. This scheme provides a unique and concise naming convention, and it preserves all processing history.