VM Plans and Choices

Snapshot of my current (June, 2012) thinking on how Fermi might make use of VMs to freeze on a particular OS, presumably rhel6.

Scope

A provisional list of activities to take place inside VMs:

Interactive code development, testing and debugging
SCons release manager
sys tests
MC production
Reprocessing

Unknown macro: {bgcolor}

Depending on exactly when rhel6 becomes deprecated, the following might also have to run in VMs (but I hope not):

L1 processing
Online monitoring
ASP

Resources

Which activities require which resources? Here I consider only the more restricted set of activities. If online activities are added more resources would be involved as well.

Activity\Resource	RM db (MySQL)	Calib db (MySQL)	Moot db (MySQL)	Calibrations (archive)	Moot configs (archive)	CVS	RM builds	Batch system
Code devel	no	yes	yes	yes (or copy)	yes (or copy)	yes	yes	would be nice
RM	yes	yes (for running test programs)	yes	yes	yes	yes	yes	yes
Sys tests	no	yes	yes	yes	yes	no	yes	yes
MC prod	no	yes	yes	yes	yes	no	yes	yes
Reprocessing	no	yes	yes	yes	yes	no	yes	yes

Note "batch system" should be interpreted loosely. It might be the centrally-supported batch system (lsf or its descendent); it might be something specialized for use from VMs.

Where to draw the line

The BaBar design puts everything behind barbed wire, or at least a moat. This can seem like overkill, but it does simplify some things. We're aiming for a hybrid scheme in which we use standard centrally-supported resources as much as possible, but that means that every interaction which has to cross the line has to be examined very carefully. Among areas to consider

Where do activities run?

Assuming all the activities occur inside VMs there are at least three possibilities:

Activity is available from an "appliance": pre-configured VM. Must be used in a machine that can act as "host"; that is, has VirtualBox or similar software installed. For more about how this works, see Tom's page on a ScienceTools appliance.
Activity is available in a VM which is normally already up and running (e.g., with SLAC-maintained machine as host)
Activity is available in a transient VM: VM exists which has been configured to support the activity, but it may not be up and running. A "start VM" step is required before using it.

Where do resources reside?

There are two plausible places for MySQL and CVS servers

SLAC centrally-maintained server (where they are now)
Stable VM running within a SLAC centrally-maintained host.

1. will most likely be preferable so that we won't have the burden of maintaining them and so that the databases may be easily read, e.g. by the server for SCons RM web pages. It seems likely but not certain that CVS and MySQL will continue to be supported by centrally-maintained machines by the end of Fermi's lifetime. If not, we'll have to go with 2.

File collections including CVS archive, calibration archive, moot archive and RM builds which VM-sequestered processes can write to will need some special handling.

Interaction with batch

How will jobs to be run in VMs be specified? Will they be submitted to the host or directly to the guest? If the latter, the VM nodes must always be up. If the former, there must be a separate step to start up the guest and there has to be some synchronization to ensure the guest is ready to execute by the time it receives a request. (I have been unable to find a good way to do this with VirtualBox. The only technique seems to be an initial guess at how long the boot will take, then try to execute, perhaps with retries and a sleep in-between.)

Security

Points to be considered include

applications allowed to run in VMs (e.g., exclude web browsers, email programs, etc.?)
network access allowed from VMs
access to file systems from VMs (e.g., exclude SLAC home directories?)
logins. The VirtualBox API ExecuteProcess routine has required username and password arguments; I don't believe there is any other means of authentication. Accounts may have an empty password.

To explore

Remote logins

With VirtualBox if you're sitting in front of the host you can boot up a VM interactively and log into it via its display just as if it were a physical machine. It is also possible to log in via programs like Remote Desktop (Windows) or the Linux equivalent, rdesktop. I got this to work some months ago but failed in more recent attempts. We need this to work!

MySQL, CVS access

Make sure VM can get to these SLAC resources.

VirtualBox and beyond

Is VirtualBox the right product for all our needs?

VirtualBox ease of use, features

VirtualBox supports all platforms of interest and appears to have the most comprehensive set of features, all accessible via the API. But the documentation is incomplete and confusing, especially for use direct from C++ (rather than COM). The command-line program VBoxManage is much easier to use and exports essentially everything available from the API.

VirtualBox reliability

I've encountered some unpleasant behavior with VirtualBox. Once or twice I think it caused my laptop (host) to shut down when only the VM should have shut down. Another time a VM booted "headless" got bogged down after being given a couple ExecuteProcess commands. Execution slowed to a crawl; even shutting it down took a very long time.

Alternatives

The most attractive general-purpose alternative is probably VMware. My impression is that it's a little behind VirtualBox in variety of platforms supported, especially newer OS versions, and the API might not be as complete, but in both of these areas VMware would probably be adequate for our needs. It's certainly worth investigating if VirtualBox reliability is questionable.

There was a meeting in April about future lsf versions, including a presentation about Platform LSF8 . One of its features known as "Platform Adaptive Cluster" (discussion starts on slide 78) involves use of VMs. Conceivably this could handle our batch VM needs; at least we wouldn't have to worry about integration with lsf! But we would still need some other form of VM for interactive code development and debugging.

Timeline

Should be implemented and checked out at least a year before end of Redhat 6 "Production 2" phase, since, at that point, the OS will not be updated to accommodate new hardware. Current estimate for end of Production 2 is Q2 of 2017, so target is aoubt 4 years from now. That seems like ample time to get the work done, especially since various pieces can be done in parallel - as long as initial decisions concerning tools and architecture are made in a timely fashion (and correctly!)

Space shortcuts

Child pages