You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Introduction

PPA has started buying hardware for the common good across the directorate. This was initiated in 2012 with the first purchase of 185 "Bullet" nodes in early 2013. These are infinband-connected, with Lustre storage. Historically the cluster was provisioned largely for BABAR, with other experiments riding its coattails. Currently there are three projects of comparable batch allocation size: BABAR, ATLAS and Fermi. BABAR stopped taking data in 2009 and it is presumed that their usage will tail off; Fermi is in routine operations with modest near real time needs and a 1.5-2 year program of intensive work around its "Pass8" reconstruction revamp; ATLAS operates a Tier2 center at SLAC and as such can be viewed as a contractual agreement to provide a certain level of cycles continuously. It is imagined that at some point, LSST will start increasing its needs, but at this time - 8 years from first light - those needs are still unspecified.

 The modeling has 3 components:

  • inventory of existing hardware
  • model for retirement vs time
  • model for project needs vs time

A python script has been developed to do the modeling. We are using "CPU factor" as the computing unit to account for differing oomphs of the various node types in the farm.

Purchase Record of Existing Hardware

Purchase Year

Node type

Node Count

Cores per node

CPU factor

2006

yili

156

4

8.46

2007

bali

252

4

10.

2007

boer

135

4

10.

2008

fell

164+179

8

11.

2009

hequ

192

8

14.6

2009

orange

96

8

10.

2010

dole

38

12

15.6

2011

kiso

68

24

12.2

2013

bullet

185

16

14.

Of these, ATLAS owns 78 boers, 40 fells, 40 hequs, 38 doles and 68 kiss. Also, note that as of 2013-07-15, the Black Boxes were retired, taking with them all the balis, boers and all but 25 of the yilis.

Snapshot of Inventory for Modeling - ATLAS hardware removed

Purchase Year

Node type

Node Count

Cores per node

CPU factor

2006

yili

25

4

8.46

2008

fell

277

8

11.

2009

hequ

192

8

14.6

2009

orange

96

8

10.

2013

bullet

185

16

14.

Retirement Models

Two models have been considered: strict age cut (eg all machines older than 5 years are retired) and a do not resuscitate model ("DNR" - machines out of Maintech support left to die). The age cut presumably allows better planning of the physical layout of the data center, as the DNR model would leave holes by happenstance. On the other hand, the DNR model leaves useful hardware in place with minimal effort, but does assume that floor space and power are not factors in the cost.

 In practice, we may adopt a hybrid of these two, especially since a strict age cutoff would make sudden drops in capacity, given our acquisition history.

There are industry estimates for survival rates vs time.

Age cut:

Using a strict 5 year cut, here is the survival rate of the existing hardware (in 2019 it is all gone):

Year

#hosts

#cores

kiso-units

SLAC-units

2013

729

7164

7433

90698

2014

427

4848

5366

65476

2015

179

2864

3284

40067

2016

179

2864

3284

40067

2017

179

2864

3284

40067

2018

179

2864

3284

40067

Basically by the end of 2014, all hardware before the bullet purchase would have been retired.

DNR model:

None are allowed after 10 years.

Year

#hosts

#cores

kiso-units

SLAC-units

2013

729

7164

7433

90698

2014

652

6568

6862

83744

2015

575

5968

6288

76736

2016

499

5376

5718

69789

2017

420

4792

5166

63039

2018

337

3944

4275

52183

2019

203

2688

3024

36908

2020

110

1760

2018

24622

2021

87

1392

1596

19474

2022

67

1072

1229

14997

2023

49

784

899

10968

Needs Estimation

PPA projects were polled for their projected needs over the next few years. This is recapped here.

  • No labels