Introduction
PPA has started buying hardware for the common good across the directorate. This was initiated in 2012 with the first purchase of 185 "Bullet" nodes in early 2013. These are infinband-connected, with Lustre storage. Historically the cluster was provisioned largely for BABAR, with other experiments riding its coattails. Currently there are three projects of comparable batch allocation size: BABAR, ATLAS and Fermi. BABAR stopped taking data in 2009 and it is presumed that their usage will tail off; Fermi is in routine operations with modest near real time needs and a 1.5-2 year program of intensive work around its "Pass8" reconstruction revamp; ATLAS operates a Tier2 center at SLAC and as such can be viewed as a contractual agreement to provide a certain level of cycles continuously. It is imagined that at some point, LSST will start increasing its needs, but at this time - 8 years from first light - those needs are still unspecified.
The modeling has 3 components:
- inventory of existing hardware
- model for retirement vs time
- model for project needs vs time
Purchase Record of Existing Hardware
Purchase Year |
Node type |
Node Count |
Cores per node |
CPU factor |
---|---|---|---|---|
2006 |
yili |
156 |
4 |
8.46 |
2007 |
bali |
252 |
4 |
10. |
2007 |
boer |
135 |
4 |
10. |
2008 |
fell |
164+179 |
8 |
11. |
2009 |
hequ |
192 |
8 |
14.6 |
2009 |
orange |
96 |
8 |
10. |
2010 |
dole |
38 |
12 |
15.6 |
2011 |
kiso |
68 |
24 |
12.2 |
2013 |
bullet |
185 |
16 |
14. |
Of these, ATLAS owns 78 boers, 40 fells, 40 hequs, 38 doles and 68 kiss. Also, note that as of 2013-07-15, the Black Boxes were retired, taking with them all the balis, boers and all but 25 of the yilis.
Snapshot of Inventory for Modeling - ATLAS hardware removed
Purchase Year |
Node type |
Node Count |
Cores per node |
CPU factor |
---|---|---|---|---|
2006 |
yili |
25 |
4 |
8.46 |
2008 |
fell |
277 |
8 |
11. |
2009 |
hequ |
192 |
8 |
14.6 |
2009 |
orange |
96 |
8 |
10. |
2013 |
bullet |
185 |
16 |
14. |
Retirement Models
Two models have been considered: strict age cut (eg all machines older than 5 years are retired) and a do not resuscitate model ("DNR" - machines out of Maintech support left to die). The age cut presumably allows better planning of the physical layout of the data center, as the DNR model would leave holes by happenstance. On the other hand, the DNR model leaves useful hardware in place with minimal effort, but does assume that floor space and power are not factors in the cost.
In practice, we may adopt a hybrid of these two, especially since a strict age cutoff would make sudden drops in capacity, given our acquisition history.