Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleDan
We have a few analysis pipelines that currently use the batch system. These include the burst advocate analysis, the gravitational wave followup, and FAVA.  The gravitational wave analysis typically requires thousands of jobs to be launched to analyze a large portion of the sky, so I think it’s probably hopeless to keep that up during the outage. FAVA runs on weekly timescales, so we can probably safely catch up that analysis once the batch farm comes back to full strength. The burst advocate analysis gets launched a little more than once a day.  Counting up the past week, we had 11 triggers in 7 days.  Each trigger launches 6 jobs and each job goes to the medium queue using rhel6. 

I can take the appropriate steps to deactivate the gravitational wave followup analysis and FAVA leading up to the outage. Let me know if you think we’d be able to keep the burst advocate analysis running and I’ll take the appropriate actions.  

Brian's proposal to move all VMs to H.A.:

Expand
titleBrian
I think we can move all fermilnx VMs to HA without oversubscribing memory or disk. Can we verify this?
* I think each fermilnx VM, except for fermilnx01 and fermilnx02, has 384GB memory.
* I think we have two VMWare Hypervisors in in HA.
I'd suggest distributing the VMs such that:
* fermilnx01 is on one hypervisor
* fermilnx02 is on another hypervisor (I think this is currently the case)
All other fermilnx-v* VMs are distributed between the other two hypervisors (live migration if possible)

 

Gotchas from the Dec 2017 outage

...