Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

We had planned to intersperse storage purchases in with compute nodes. The new cluster architecture is relying on Lustre as a shared file system, and also to provide scratch space for batch. Such an upgrade was anticipated and the 170 TB existing space can be doubled by adding trays to the existing servers for about $60k.

Purchase Options

Anchor
Options
Options

This proposal is to expand the bullet cluster with combined funds from
PPA general, ATLAS, and Theory.  This would double our existing parallel
file system size (173->346TB) and add XX cores.
This proposal is to expand the bullet cluster with combined funds from PPA, ATLAS, and Theory.  This would double our existing parallel file system size (173->346TB) and add either 1649 or 1904 cores depending on which option we choose.  The first option is to provision infiniband (IB) in all nodes and add IB switches to allow additional future expansion of the IB network.  Because of the IB network topology allowing future expansion implies a jump in the number of core switches from 4 to 8.  The second option would split the cluster into IB and non-IB parts with the ATLAS nodes being non-IB.  Note the pricing below is based on several different quotes that would have to be been refreshed.  Hence the pricing is approximate and hopefully not low-balledto be verified but very close to actual.  The details are:

...

Option 1: Expand to 18 fully populated chassis with all-IB and future expansion capability (revised for increased IB cost (+6k/chassis))

...

  • 6 full chassis @91k @97.227k  => 546k583.4k (quote 661664648)
  • 7 blades w/IB added to existing empty slots @4@5.817k 018k => 33.72k35.2k (quote 661664256)
  • 4 IB switches with cables @10.5k => 42k50.2k (includes active fiber IB cables for lustre switch) (quote 662008993)
  • 2 60x2TB disk trays with controllers @34k @30.3k => 68k60.6k (quote 661663331)

Total is $690k $729.3k for 1648 1616 cores and storage expansion.
Gross bullet cluster core count would then be 2960+1648=4608 (all IB)

costs are:

  • ATLAS: 60 blades @4.9375k 96730k => 296.25k 73k                    (960c) (Based on quote 659024769 for a non-IB full chassis)
  • Theory: 91k 97.227k + 75*45.817k 018k => 124   122.72k 32k                             (368c336c)
  • PPA: 268k       310.83k - 2.7k => 308.93                                                           (320c)

...

  • (352c)

Notes:

  • PPA cost/core is

...

  • not meaningful because it includes the storage expansion and subsidizing the IB infrastructure for the Atlas blades

Benefit here is that we have a uniform cluster.

...

Option 2: Expand to 15 full IB chassis and 4 non-IB chassis

...

  • 4 full non-IB chassis @79k => 316k
  • 3 full IB chassis @91k @97k => 273k291k
  • 7 blades w/IB @4.817k => 33.72k
  • 2 60x2TB disk trays with controllers @34k => 68k

Total is $691k $709k for 1904 cores and storage.
Gross bullet IB cores is then 3840.
Gross bullet non-IB cores is 1024.

costs are:

  • ATLAS: 4 @79 => 316k                                  (1024c)
  • Theory: 91k = 97k + 7x4.817k => 124130.72k                  (368c)
  • PPA: 269k 262k                                                       (512c)

...

Notes

...

  • Revised on 8/21 to account for IB price change since the original nodes were purchased.  Full IB chassis changed from 91->97 when IB changed from QDR to FDR (increase in performance).
  • Need to verify Theory (Hoeche) budget (is 131k too large?)
  • Need to verify Atlas budget (option 2 is over 300k)
  • Revised 9/3 with latest quotes and to reflect choice of option one and the actual budget estimates.
Add-on to either option:

We could get new GPU servers (kipac's are old!) which are equivalent to bullet blads blades with ~5000 gpu-cores for ~10k each.  So  So we could top off to 300k if we got 3 of these.  We (PPA) do need to replace or existing GPU "system" that is hosted by kipac.  A good case for adding some of these is presented here by Debbie Bard.

Pros and Cons of these options should be discussed.