VM Plans and Choices

Snapshot of my current (June, 2012) thinking on how Fermi might make use of VMs to freeze on a particular OS, presumably rhel6.

Scope

A provisional list of activities to take place inside VMs:

Interactive code development, testing and debugging
SCons release manager
sys tests
MC production
Reprocessing

Depending on exactly when rhel6 becomes deprecated, the following might also have to run in VMs (but I hope not):

L1 processing
Online monitoring
ASP

Resources

Which activities require which resources? Here I consider only the more restricted set of activities. If online activities are added more resources would be involved as well.

Activity\Resource	RM db (MySQL)	Calib db (MySQL)	Moot db (MySQL)	Calibrations (archive)	Moot configs (archive)	CVS	RM builds	Batch system
Code devel	no	yes	yes	yes (or copy)	yes (or copy)	yes	yes	would be nice
RM	yes	yes (for running test programs)	yes	yes	yes	yes	yes	yes
Sys tests	no	yes	yes	yes	yes	no	yes	yes
MC prod	no	yes	yes	yes	yes	no	yes	yes
Reprocessing	no	yes	yes	yes	yes	no	yes	yes

Note "batch system" should be interpreted loosely. It might be the centrally-supported batch system (lsf or its descendent); it might be something specialized for use from VMs.

Where to draw the line

The BaBar design puts everything behind barbed wire, or at least a moat. This can seem like overkill, but it does simplify some things. We're aiming for a hybrid scheme in which we use standard centrally-supported resources as much as possible, but that means that every interaction which has to cross the line has to be examined very carefully. Among areas to consider

Where do activities run?

Assuming all the activities occur inside VMs there are at least three possibilities:

Activity is available from an "appliance": pre-configured VM. Must be used in a machine that can act as "host"; that is, has VirtualBox or similar software installed. For more about how this works, see Tom's page on a ScienceTools appliance.
Activity is available in a VM which is normally already up and running (e.g., with SLAC-maintained machine as host)
Activity is available in a transient VM: VM exists which has been configured to support the activity, but it may not be up and running. A "start VM" step is required before using it.

Where do resources reside?

There are two plausible places for MySQL and CVS servers

SLAC centrally-maintained server (where they are now)
Stable VM running within a SLAC centrally-maintained host.

1. will most likely be preferable so that we won't have the burden of maintaining them and so that the databases may be easily read, e.g. by the server for SCons RM web pages. It seems likely but not certain that CVS and MySQL will continue to be supported by centrally-maintained machines by the end of Fermi's lifetime. If not, we'll have to go with 2.

File collections including CVS archive, calibration archive, moot archive and RM builds which VM-sequestered processes can write to will need some special handling.

Interaction with batch

How will jobs to be run in VMs be specified? Will they be submitted to the host or directly to the guest? If the latter, the VM nodes must always be up. If the former, there must be a separate step to start up the guest and there has to be some synchronization to ensure the guest is ready to execute by the time it receives a request. (I have been unable to find a good way to do this with VirtualBox. The only technique seems to be an initial guess at how long the boot will take, then try to execute, perhaps with retries and a sleep in-between.)

Security

Points to be considered include

applications allowed to run in VMs (e.g., exclude web browsers, email programs, etc.?)
network access allowed from VMs
access to file systems from VMs (e.g., exclude SLAC home directories?)
logins. The VirtualBox API ExecuteProcess routine has required username and password arguments; I don't believe there is any other means of authentication. Accounts may have an empty password.

Since the redhat 6 production phase (including, among other things, security patches) is now projected to last through November, 2020, isolation of VMs for security is no longer an urgent concern. There might still be a need for it toward the end of Fermi offline activity, however, so the architecture chosen should allow for such isolation, even if it's not turned on initially.

To explore

Remote logins

With VirtualBox if you're sitting in front of the host you can boot up a VM interactively and log into it via its display just as if it were a physical machine. It is also possible to log in via programs like Remote Desktop (Windows) or the Linux equivalent, rdesktop. I got this to work some months ago but failed in more recent attempts.

Turned out I needed to re-install the Oracle extension pack which has server support for this. I'm not sure why but it's possible I upgraded VirtualBox without also upgrading the extension pack, a no-no.

In order to allow connection to more than one guest on the same server, guests within the same host should be configured to use distinct ports.

MySQL, CVS access

Make sure VM can get to these SLAC resources.

Without any special configuration, processes on VMs can access resources such as mysql databases or cvs archives across the net just as if they were running on the host.

ssh

I was able to ssh out of the VM just as if I were logged directly into the host.

To ssh in to the guest, the VM configuration has to be adjusted by adding a port forwarding entry. This is easy to do from the VirtualBox gui but the end result is not ideal. I can ssh in like this:

ssh -p alternate-port user@hostnode

but, depending on the ssh configuration for the node I'm coming from, I might get a complaint about host keys not matching. ssh expects the host key to be the same when the host is the same, regardless of port number. If the host key doesn't match what's in your known_hosts file it won't connect; you have to either edit out the old entry in known_hosts first.

There is a trick - a little clunky, but it works. You need to tell ssh to use a dummy known_hosts and to just go ahead and add new keys without asking, like this:

ssh -o "UserKnownHostsFile /dev/null" -o "StrictHostKeyChecking no" -p alternate-port user@hostnode

Appliances

My first experience making (exporting) and using (importing) an appliance went reasonably well. I used the VirtualBox GUI for both export (on rhel5 laptop) and import (old Windows XP desktop). Procedures were simple and clear. The import wasn't 100% smooth but that's no reflection on VirtualBox. The two problems I encountered were

shared folder definitions didn't carry over because the Windows host didn't have directories of the proper name - or even of the proper form. Either appliances shouldn't have shared folders or users should be prepared to adjust them after import.
memory assigned to VM (768 Mbytes) was too much for the Windows host. VirtualBox complains if the VM asks for more than 1/2 of the host's memory.

It's easy enough to change the characteristics of the appliance once it's been imported. To address 1. I removed all the shared folder definitions. For 2 I reduced the VM's memory to 512 M...but the real solution is not to attempt to run the VM on such a resource-poor machine.

See more ruminations on appliances as well as a link to a "bare bones" appliance in the GlastRelease Virtual Machines child page.

VirtualBox and beyond

Is VirtualBox the right product for all our needs?

VirtualBox ease of use, features

VirtualBox supports all platforms of interest and appears to have the most comprehensive set of features, all accessible via the API. But the documentation is incomplete and confusing, especially for use direct from C++ (rather than COM). The command-line program VBoxManage is much easier to use and exports essentially everything available from the API.

VirtualBox reliability

I've encountered some unpleasant behavior with VirtualBox. Once or twice I think it caused my laptop (host) to shut down when only the VM should have shut down. Another time a VM booted "headless" got bogged down after being given a couple ExecuteProcess commands. Execution slowed to a crawl; even shutting it down took a very long time.

Alternatives

The most attractive general-purpose alternative is probably VMware. My impression is that it's a little behind VirtualBox in variety of platforms supported, especially newer OS versions, and the API might not be as complete, but in both of these areas VMware would probably be adequate for our needs. It's certainly worth investigating if VirtualBox reliability is questionable.

There was a meeting in April about future lsf versions, including a presentation about Platform LSF8 . One of its features known as "Platform Adaptive Cluster" (discussion starts on slide 78) involves use of VMs. Conceivably this could handle our batch VM needs; at least we wouldn't have to worry about integration with lsf! But we would still need some other form of VM for interactive code development and debugging.

Timeline

Should be implemented and checked out at least a year before end of Redhat 6 "Production 2" phase, since, at that point, the OS will not be updated to accommodate new hardware. Current estimate for end of Production 2 is Q2 of 2017, so target is aoubt 4 years from now. That seems like ample time to get the work done, especially since various pieces can be done in parallel - as long as initial decisions concerning tools and architecture are made in a timely fashion (and correctly!)

Space shortcuts

Child pages