...
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
"Service Now, Knowledge Base, search for "High Availability" , following link for current servers"
And here is the current statement about high-availability functionality:
Code Block |
---|
Current Services in HA Racks
•CATER application
•Confluence application
•Data center management tool
•Drupal web
•Email lists
•Email transport infrastructure
•ERP application
•Exchange email
•EXO application
•Facilities monitoring
•Fermi application
•IT Ticketing system
•Network infrastructure
•Site Security infrastructure
•Unix authentication infrastructure
•Unix AFS infrastructure
•Unix mailboxes
•Unix monitoring
•VPN
•Windows authentication infrastructure
•Windows file servers and SAN
•Windows monitoring
•Windows web |
Supporting documentation
Email from Steve Tether with some storage-related information:
Expand | ||
---|---|---|
| ||
Change "fermilnx01 or fermilnx02" to "fermilnx01 and fermilnx02". While services can all be shifted to one of those machines, frankly it's a pain.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u23 currently has 554 GB free. This is where we store:
- Incoming FASTCopy packages (L0 data, HSK data).
- Outgoing FASTCopy packages (L1 data, mission planning).
- Unpacked LAT raw data (L0, HSK, etc.)
FASTCopy packages for both L0 and L1 data are archived daily to "astore-new" and are then deleted within 24 hours. "astore-new" is a POSIX-compliant filesystem interface to HPSS that replaced the old "astore" interface. This is driven by the old GLAST Disk Archiver service. The packages are also archived to xrootd daily. Unpacked raw data is also archived to xrootd but is retained for 60 days on u23. The unpacked raw data on xrootd is a "live" backup in the sense that it can be accessed by ISOC tools and L1 reconstruction if needed, though that option is not normally enabled.
We get something like 16 GB of L0 data daily. If archiving to astore-new is turned off then we would have to retain the original incoming L0 FC packages, the unpacked L0 data and the L1 FC packages. Naively assuming that all of these to be about the same size that means retaining 48GB or more per day so u23 would fill up in 11.5 days or less. And we'd probably start experiencing problems as it approached being 100% full.
If the astore-new archiving were kept going but the xrootd archiving were suspended, then we would retain only the 16 GB of unpacked L0 data per day which would fill up u23 in 30 days or so.
So I would recommend changing the classification of "astore (non-Fermi server)" from NC to XC for this long of an outage. And rename "astore" to "astore-new (HPSS)". I see that the Archiver server fermilnx-v03 is already classified as XC, so that's good.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u41 is used by the halfpipe to store events extracted from LAT raw data. The events would take up 16 GB daily times some modest expansion factor. That partition needs to be kept going for normal processing. I don't know how long the event data is retained but the partition currently has 4.4 TB free so it shouldn't be a problem in any event.
All the rest of the page seems OK.
|
Wilko's statement regarding space currently available in xrootd:
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
"Service Now, Knowledge Base, search for "High Availability" , following link for current servers"
And here is the current statement about high-availability functionality:
Code Block |
---|
Current Services in HA Racks
•CATER application
•Confluence application
•Data center management tool
•Drupal web
•Email lists
•Email transport infrastructure
•ERP application
•Exchange email
•EXO application
•Facilities monitoring
•Fermi application
•IT Ticketing system
•Network infrastructure
•Site Security infrastructure
•Unix authentication infrastructure
•Unix AFS infrastructure
•Unix mailboxes
•Unix monitoring
•VPN
•Windows authentication infrastructure
•Windows file servers and SAN
•Windows monitoring
•Windows web |
Supporting documentation
Email from Steve Tether with some storage-related information:
Expand | ||
---|---|---|
| ||
Change "fermilnx01 or fermilnx02" to "fermilnx01 and fermilnx02". While services can all be shifted to one of those machines, frankly it's a pain.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u23 currently has 554 GB free. This is where we store:
- Incoming FASTCopy packages (L0 data, HSK data).
- Outgoing FASTCopy packages (L1 data, mission planning).
- Unpacked LAT raw data (L0, HSK, etc.)
FASTCopy packages for both L0 and L1 data are archived daily to "astore-new" and are then deleted within 24 hours. "astore-new" is a POSIX-compliant filesystem interface to HPSS that replaced the old "astore" interface. This is driven by the old GLAST Disk Archiver service. The packages are also archived to xrootd daily. Unpacked raw data is also archived to xrootd but is retained for 60 days on u23. The unpacked raw data on xrootd is a "live" backup in the sense that it can be accessed by ISOC tools and L1 reconstruction if needed, though that option is not normally enabled.
We get something like 16 GB of L0 data daily. If archiving to astore-new is turned off then we would have to retain the original incoming L0 FC packages, the unpacked L0 data and the L1 FC packages. Naively assuming that all of these to be about the same size that means retaining 48GB or more per day so u23 would fill up in 11.5 days or less. And we'd probably start experiencing problems as it approached being 100% full.
If the astore-new archiving were kept going but the xrootd archiving were suspended, then we would retain only the 16 GB of unpacked L0 data per day which would fill up u23 in 30 days or so.
So I would recommend changing the classification of "astore (non-Fermi server)" from NC to XC for this long of an outage. And rename "astore" to "astore-new (HPSS)". I see that the Archiver server fermilnx-v03 is already classified as XC, so that's good.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u41 is used by the halfpipe to store events extracted from LAT raw data. The events would take up 16 GB daily times some modest expansion factor. That partition needs to be kept going for normal processing. I don't know how long the event data is retained but the partition currently has 4.4 TB free so it shouldn't be a problem in any event.
All the rest of the page seems OK.
|
Wilko's statement regarding space currently available in xrootd:
Expand | ||
---|---|---|
| ||
there are currently about ~290TB free in the xrootd gpfs space, which is plenty . Also, if needed we can always purge old recon files from disk. |
Nicola's estimate of batch power needed for GW follow up pipeline:
Expand | ||
---|---|---|
| ||
I am trying to figure out the right numbers looking at the resource plots…. Not sure how to read the plots. I think they are running on 300 cores for about an hour. So my estimation was 30 cores for 10 hours… |
Stefano's comment on Flare Advocates:
Expand | ||
---|---|---|
| ||
for the FA shifts |
Dan's statement on various Pipelines (including FAVA):
Expand | ||
---|---|---|
| ||
We have a few analysis pipelines that currently use the batch system. These include the burst advocate analysis, the gravitational wave followup, and FAVA. The gravitational wave analysis typically requires thousands of jobs to be launched to analyze a large portion of the sky, so I think it’s probably hopeless to keep that up during the outage. FAVA runs on weekly timescales, so we can probably safely catch up that analysis once the batch farm comes back to full strength. The burst advocate analysis gets launched a little more than once a day. Counting up the past week, we had 11 triggers in 7 days. Each trigger launches 6 jobs and each job goes to the medium queue using rhel6.
I can take the appropriate steps to deactivate the gravitational wave followup analysis and FAVA leading up to the outage. Let me know if you think we’d be able to keep the burst advocate analysis running and I’ll take the appropriate actions.
| ||
Expand | ||
| ||
there are currently about ~290TB free in the xrootd gpfs space, which is plenty . Also, if needed we can always purge old recon files from disk. |
...