Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Accepted Yee's changes, changed to PLM to in-depth documentation, changed reason for Microsft interest to prod. net TCP eval experience, fixed spelling of archiving, said event code still in development

...

Note

Topics still needed:
More on Terapaths interactions.
More on why or how users/physicists care, even anecdotes, testimonials
We have room for a figure, it needs to be compelling.

 

Today's data intensive sciences, such as High Energy Physics (HEP), need to share large amounts of data at high speeds. This in turn requires high-performance, reliable end-to-end network paths between the major collaborating sites. In addition network administrators need alerts when there are anomalous events, and grid middleware and end-users need long and short-term forecasting for application and network performance for planning, setting expectations and trouble-shooting. To enable this requires a network monitoring infrastructure between the major sites that can help notify and identify potential problems.

Active monitoring: We have developed an active network monitoring toolkit (IEPM-BW). It provides measurements, data archingarchiving, analysis, reporting and visualization. This is now being used to make regular measurements from the following major LHC related sites: CERN, BNL, Caltech, FNAL, SLAC, and Taiwan. We also have about 60 locations worldwide that are being monitored from these important sites. We use a selection of probes based on the quality and interest in the path being measured utilizing metrics such as network routes, round trip time, one-way delays, available bandwidth and achievable throughput. We are extending the presentation of IEPM-BW by working with the USATLAS and ULTRALIGHT groups to customize reports on their most relevant interests.
To aid the detection of incorrect host settings, we have also implemented features to query the configurations of the network monitoring hosts.
As part of this we have developed and put into production, management tools for automation and robustness, including: installation and update kits; measurement and reporting of unreachable participating hosts; documentation, including a Program Logic Manual; and a database of site, host, location, contact, OS, cpu, test parameters.

...

Event Detection and Diagnosis: It is increasingly impossible for network managers to manually review thousands of reports each morning to detect problems. Thus we are developing tools to automate this activity. This includes forecasting and comparing the observed with the forecast to detect anomalous events, reporting the events, and automatically gathering information relevant to the event to assist in diagnosis. One version of the event detection is in production use. As part of this, in the last year, we have also detected, reported (together with in-depth case studies) and helped diagnose major problems at sites such as BNL, Taiwan, SDSC, NRL, BINP, and CERN.

...

The the code is still in development

...

. We are

...

field testing the alerts to known problems

...

, carefully reviewing the symptoms and causes etc., and working with network experts and providers to understand and evaluate what is needed to diagnose the problems. 

High speed data transport: Our involvement in high speed TCP transport, including several publications and winning the Supercomputing Bandwidth challenges three times in succession and the Internet land-speed record twice world-leadershiip role in evaluating TCP transport mechanisms in production networks has led Microsoft to request our help in evaluating their next generation TCP stack (CTCP). Given the extent of Windows deployment it is critical to ensure that CTCP performs well without a negative impact upon the iIternet community.
As part of this we have identified and aided the testing of numerous added features to aid the performance of the delay-based congestion control algorithm used in CTCP. Having finalized our initial report into the deployment impact of using CTCP in production environments on both long and short distance high speed Internet paths, we are know looking to publish a joint conference paper with Microsoft of our work.

...