Note: All work is on TEST unless otherwise noted.
4/23/2008:
morning:
Tony and Dan got the TEST server up to date and running the newest code. Tony installed his dummy batch submitter to run "DATA" jobs. Dan created a test task "StreamStress" which is 4 task levels deep and exercises all dependency types. Dan also installed the Oracle Jobs which generate usage statistics. Put in a fix (PII-322) that makes the PL/SQL function 'GetStreamIsLatestPath(streamPK)' 33% faster.
2pm:
Started running StreamStress streams. Throughput seems slow as no more than about 200 jobs are getting to running or pending states (according to the usage plots.)
4:45pm:
Changed ProcessStatusCondition dependency checking to only evaluate dependents if transition was to a final state. Reran a large stream params=(4,20,100) id=7, ver=1.0.
4/24/2008:
12:15am
Loaded stored procedures that do stream-by-stream locking rather than top-level locking.
Ran another large stream id=8 ver=1.0.
This guy did not work out well, because I had a bug in the code where I tried to change an object in a helper routine but was not actually changing it because I changed the object reference, rather than the object itself.
3:20pm
Fixed the problem with the last run. Reuploaded stored procedures, running stream id=9, ver=1.0.
6pm
The last run took 2.25 hours when the previous runs had taken about 1.5 hours. Oracle finally offered new execution plans for some of the new stored procedure queries (those that find status/completion dependencies.) I accepted the proposed changes (which seem to be different than those offered in PROD !!!)
Running stream id=10, ver=1.0.
10pm
Stream 10 ran with same results as stream 9 -- took about 2.25 hours. More SQL tuning recommendations from Oracle:
(1) select COUNT(1) from ProcessStatusCondition PSC join ProcessInstance PI on (PSC.Process = PI.Process) join Stream S on PI.Stream = S.Stream where RowNum = 1 and PSC.DependentProcess = :1 and PSC.ProcessingStatus <> PI.ProcessingStatus and (S.ParentStream = :2 or S.Stream = :3) and S.IsLatest = 1 and PI.IsLatest = 1
The explain plan for this one is now the same as what was offered in PROD. I've accepted the new plan.
(2) select NVL(MAX(ExecutionNumber)+1, 1) AS EXECUTIONNUMBER from Stream where Task = :1 and ParentStream = :2 and StreamId = :3
This one seems to be a problem mainly at the beginning of a stream, probably because that's when all the substream / processinstance creation is happening. This query and another (lost it) were supposed to benefit from the following index:
ProcessInstance (Stream, Process)
I added this Index: ProcessInstance (Stream, Process, IsLatest); since the other query also used IsLatest in the where clause.
(3) select processname from processinstance i join process p on i.process=p.process join task t on p.task=t.parenttask join createsubtaskcondition st on st.process=p.process and st.subtask=t.task where i.processinstance=:1 and t.taskname=:2 and :3 not in (select s.streamid from stream s where s.parentstream=i.stream and s.task=t.task and s.IsLatest=1)
Oracle recommends building two indexes to make this guy faster. I think the query could be rewritten using Exists instead of In. For now, I've created the recommended indexes -- they seem reasonable for many of our other queries, anyway.
Consider running the Access Advisor to improve the physical schema design or creating the recommended index.
GLAST_DP_TEST.STREAM("PARENTSTREAM","TASK","ISLATEST")
GLAST_DP_TEST.TASK("TASKNAME")
(4) Oracle also had a new plan for executing the main-page (task list) query. I accepted the new plan.
11:15pm
Running stream 11.
11:45pm
Deleted stream 11. Tony had a problem in the mail receiver.
11:53pm
Recreated stream 11.
4/25/2008:
Stream 11 finished in <1.5 hrs.
11:00am
Tony added a mail backlog meter to the server process.
Tony put parallel mail handling into the server.
11:22am
Created stream 12.
2:20pm
Stream 12 ran to completion in roughly the same time as stream 11. (~1h40m) Now we seem to be seeing much more accurate statistics as far as job numbers in the usage plots. We see a very clear sawtooth pattern with a period of just over 5 minutes. Watching the log file and the real job-running and email-backlog numbers lead me to believe that the queue is getting flooded with scriptlets as the jobs all finish at roughly the same time. So no new jobs get started while the scriptlets are executed.
Tony modified the mailhandler so it sends the mail message to the thread rather than having each thread read from the spool -- the concurrent reading seemed to have caused some problems in stream 11.
Tony also added a failure rate to the jobcontrol bean, so I can see if failures and skipping work properly.
2:25pm
I created stream 13. This guy has a new sigma. Jobs used to be mean=100s, sigma=5s. This time they are mean=100s, sigma=20s. I intend to smooth their execution out in time so that the pipeline is less hammered by scriptlets. Hopefully this will increase throughput by allowing us to keep the job numbers up.
Lucky stream 13 also has a failure rate of 1/1000. This will allow me to test rollback, but since roughly 12 processes will fail randomly, the comparison to the last test will be harder to make. I will rely on the processing plot (which now seems accurate) to give me a feel for how the throughput has improved. I've had to do this because we're running out of time before the SCCS shutdown at 5pm. When central computing comes back up, I can run more thorough tests.
4pm
Stream 13 had two failures right at the end. The dependents skipped correctly. Say more about how it ran.
4:51pm
Stream 14 run with 10% failure rate, params=(2,2,2), 3+/-1s job length. Trying to see if this causes failures of the type seen in StreamStressFail streams 1,2.
4/28/2008:
9:30am
Stream 15 create with 0% failure rate, job-time=3+/-1s, params=(2,2,2). Ran to successful completion with no indication of email concurrency problems. Looks as if Tony's patch to add ProcessInstance locking to mail handler is working.
Streams 16,17,18 created with same parameters as 15, but 10% failure rate. All seem to have completed correctly. Failures were followed by correct skipping of dependents.
10am
Stream 19 created with 10% failure rate, job-time=3+/-1s, params=(2,4,50).
10:15am
I terminated all non-final processes in Stream 14 which were throwing the status counts off in the plots.
10:30am
Stream 19 failed in the very first process. I rolled back the first process which worked fine. There were then 49 failures at various points. I rolled these back using the command line '--minimum' option and it worked great. After 3 successive minimum rollbacks, the stream ran to successful completion. Process Rollback seems to be working. Should try rolling back a stream just to make sure that works as well.
10:55am
Stream 20 created with params=(4,10,50) and failure rate of 1/1000. Job time 3+/-1s. This guy will give us a chance to look at scriptlet performance.
1:20pm
Stream 20 had some problems with concurrent mail processing. It seems that the time format object Tony was using in BatchStatusHandler was not threadsafe. He moved the object to be instance level rather than class level (static). I will try rerunning a new stream.
Tony moved 'Data' tasks back to LSF and 'MC' to the dummy submitter in anticipation of running an L1 Stream. I have moved StreamStress's batch processes to the SLACMC jobsite.
Created Stream 21 with same parameters as Stream 20 in order to test whether Tony's fix to BatchStatusHandler worked. (It turns out I jumped the gun and created #21 before restarting the server to pick up the new changes. I stopped the server, deleted #21, then restarted the server and resubmitted #21.)
2:25pm
Stream 21 went to completion with one process failure (from the dummy submitter, not an error) which I rolled back. The problem with the BatchStatusHandler seems to be fixed.
I moved stream 14 from 'RUNNING' to 'TERMINATED'.
Bryson re-ran HalfPipe for stream 080428001. The original data files were encoded without an "MC Key". I will be running L1 in the TEST server as soon as a copy of the raw data has been made. (I moved /nfs/farm/g/glast/u52/rootData/dl/080428001 --> 080428001-badkey and copied Bryson's fixed raw-data-directory to 080428001.)
3:05pm
Created stream 80428001 of L1Proc in TEST with parameters:
DOWNLINK_ID=080428001,DOWNLINK_RAWDIR=/afs/slac/g/glast/ground/PipelineStaging/halfPipe/080428001
4/30/2008:
We're seeing problems with process instances left in waiting.
9:45pm
Stream 22 (2,4,50) had all process instances waiting on a process instance within a subtask (both kinds of conditions) left in waiting when the processes they depended upon completed.
10:40pm
Got rid of the 'continue' statements that were added yesterday. Replaced them with a log message to indicate which processes are involved and under what circumstances.
Stream 23 (2,2,2) and 24 (2,4,50) both got stuck waiting, but neither put any messages into the log that indicates the problem is associated with the code that was added to prevent (SKIPPED->SKIPPED) transitions.
11pm
Disgregard 23 and 24 because I installed the modified stored procedures to DEV instead of TEST.
I've patched TEST and created Stream 25 (2,4,50).
11:30pm
The problem seems to have been that I was using != to compare two strings rather than !.equals(). I've corrected the problem and launched Stream 26(2,4,50).
11:55pm
It's working again. I patched the DEV SPs as well. I moved the newest version 1.1.1-SNAPSHOT into TEST (but not DEV). The newest version has ProcessInstance locking around setting process instance statuses in the server code for script processes to avoid problems that occur if threads are pulled off the queue before the queuer is able to status the associated process instance.
Launched Stream 27(2,4,50) to test that Tony's newest 1.1.1-SNAPSHOT is working.