You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Introduction

We want to improve the robustness and reliability of the batch compute environment by applying more rigid resource controls. By running jobs in a "sandbox", they are protected from each other and cannot consume all of the resources on a machine. LSF version 9.1.2 makes use of linux Control Groups (AKA cgroups) to limit the CPU cores and memory that a job can use. These cgroup resource restrictions are currently not in our production LSF configuration. We want to understand the potential impact to users and get feedback from stakeholders. I have outlined some examples below using our test cluster.

CPU core affinity example

Take a look at this simple script named mploadtest.csh. The script explicitly forks several CPU-bound tasks. In reality many users submit "single-slot" jobs that may behave in a similar manner or call API functions that spawn multiple child threads or processes:

-----start of mploadtest.csh -----------
#!/bin/csh
dd if=/dev/urandom bs=1M count=80 | bzip2 -9 >> /dev/null &
dd if=/dev/urandom bs=1M count=80 | bzip2 -9 >> /dev/null &
dd if=/dev/urandom bs=1M count=80 | bzip2 -9 >> /dev/null &
dd if=/dev/urandom bs=1M count=100 | bzip2 -9 >> /dev/null
-----end of mploadtest.csh -----------

Run the script as a single slot job on a specific idle host, for example: "bsub -q mpitest -m bullet0019 ./mploadtest.csh" . Open up a terminal session on the chosen host and observe the load across the CPU cores. You can do this running "top" in interactive mode and pressing "1" for the per-core load view. You'll notice the child processes associated with your running job are distributed across several cores, even though the job is "single-slot".

Wait for the job to finish then resubmit to the same host but this time request CPU affinity: "bsub -q mpitest -m bullet0019 -R 'affinity[core:membind=localprefer]' ./mploadtest.csh". Observe this job again using the per-core load view with top. This time you should see all of the load is associated with a single core. The number of assigned cores will match the number of job slots so submitting the job with "-n 3" will result in the job using 3 CPU cores.



 

 

 

 

  • No labels