The IEPM-BW system architecture is based on MySQL database on each individual monitoring host, various daemons and servers, and a crontab.
There are three file 'systems' associated with IEPM-BW Monitoring systems: the source code, the report storage, and the IEPM database which is a MySQL data base. Each monitoring host has these. The target hosts only have the target system source code.
The IEPM data base is used for storage of the monitoring system's configuration and data storage. There are also two short configuration files which reside in /etc/. They are /etc/iepm.cnf which provides absolutely required information to the system, and /etc/my.cnf which is used by MySQL to define the IEPM data base.
The MySQL database contains all the configuration information for the probes that are to be made (node information, probe types, probe options, probe timing, etc.) and the specifications for the various analyses and graphs that are to be generated. It also contains tables where the probe data is stored and some tables to save the results of the various analyses.
The IEPM data base has the following tables: CONFIGURATION TABLE NAMES |
UNIQUE INDEX |
FUNCTION : Contains |
|
NODES |
nodeID |
Details on all target and monitoring host nodes |
|
MONHOSTS |
monhostID |
Additional details for the monitoring hosts |
|
NODESPECS |
nodespecsID |
Architecture information for each target and monitoring host |
|
PLOTSPECS |
plotspecID |
Specifications for various types of plots |
|
SCHEDULE |
scheduleID |
An entry for each scheduled probe run |
|
TOOLSPECS |
toospecsID |
Specifications for each probe for each target host |
|
|
|
|
|
DATA STORAGE TABLE NAMES |
UNIQUE INDEX |
FUNCTION : Contains |
|
ASN |
asnID |
IP address ASN information |
|
ASNINFO |
asninfoID |
ASN details |
|
BWDATA |
bwdataID |
Bandwidth probe results |
|
OWAMPDATA |
owampdataID |
Owamp probe results |
|
PINGDATA |
pingdataID |
Ping probe results |
|
ROUTEDATA |
routedataID |
Traceroute results |
|
ROUTENO |
routenoID |
Unique route number for traceroutes |
|
|
|
|
|
ANALYSIS RESULTS TABLE NAMES |
UNIQUE INDEX |
FUNCTION : Contains |
|
ALERTS |
commentsID |
Details on alerts detected by the analysis |
|
GRAPHS |
graphID |
Graphs from the analysis |
|
NEWALERTS |
alertID |
Details on alerts in another format |
|
COMMENTS |
commentsID |
Comment storage |
|
Optional Analysis Table Names (Experimental, Currently NOT used) |
UNIQUE INDEX |
FUNCTION : Contains |
|
HWALERTS |
|
Holt-winters alerts |
|
HWGRAPHS |
|
Holt-winters graphs |
|
REGDATA |
regdataID |
Regularized data for Hold-winters |
|
|
|
|
|
Servers
The servers are the servers which are run on the monitoring system to respond to the probes from the other monitoring hosts. Each monitoring host has a customization directory called 'config' which must be customized for the monitoring host. Very often the copy of this is the same across all monitoring hosts. Currently this file 'config/servers.alive' contains the following lines:
bw-iperf-server,bin,-s -p 5000 -w 20M thrulayd,bin, pathload_snd,bin, -q -i map-updated,, pathchirp_snd,bin, owampd,bin, -c /afs/slac/package/netmon/bandwidth-tests/v3src/config -Z >& /tmp/owampd.log &
The first column is the name of the server, the second column is the directory it is in, and the third column is the list of parameters that it should be started with.
The servers are kept alive via the script 'keep-servers-alive'.
Daemons
The daemons are actually used to control the scheduling of probes, the probing, and the loading of the results from the probes into the data base. The daemons have a directory in the MySQL data directory '/home/iepm/mysql/keepalives'. They each touch their respective keep-alive file every time they cycle through. The script 'keep-em-alive' checks periodically the time stamps on the daemon keep-alives, and restarts them if they are not running.
The current daemons are:
bw-synchd.alive load-datad.alive load-scheduled.alive owpingd.alive pathchirpd.alive pingd.alive traced.alive
load-scheduled
'load-scheduled' reads the TOOLSPECS table periodically and looks for probes that are due to be run. Each probe has a 'lastrunepoch' field that is updated when that probes is run. 'load-scheduled' reads the 'lastrun' and the 'runinterval' fields, adds them together to tell if it is time to run another instance of the probe. If it is, it adds the command information to the SCHEDULE table.
load-datad
'load-datad' loops over all the 'load-test-data' scripts and calls them sequentially to load the data in the data base. All the results from the probes go into files in the '/home/iepm/mysql/data' with the name which is the 'scheduleid.probetype'.
bw-synchd
'bw-synchd' is the daemon which runs all probes that MUST be run sequentially and not concurrently. It reads all 'background-syn' probes that are scheduled and loops through them one at a time, running the probe command. The output from the probe is stored in the '/home/iepm/mysql/data'directory.
owpingd, pathchirpd, pingd, and traced
perform the probe type that matched their name. These are all low bandwidth probes which can be run concurrently (testtype 'background'). They fetch from the SCHEDULE table the scdeduled probes that they are responsible for and execute them. The data goes in the '/home/iepm/mysql/data' directory.
Cleanup Scripts
bw-cleanup
From experience I have found that there are various probes and programs which hang for one reason or another. bw-cleanup is called from the crontab to look at a list of known troublesome processes, and if they are around longer than a specified time, it kills them. One example in here is 'gnuplot'. Gnuplot 3.7 had some bugs in it that caused it to loop and hang. There was no way to work around these problems, so I created 'bw-cleanup' and its control file '/home/iepm/v3src/config/cleanup-list' to handle these circumstances.
Currently '/home/iepm/v3src/config/cleanup-list' contains:
post-test,120 gnuplot,5 triganal,120 runperiod,120 pathload_rcv,3 pathchirp_rcv,3 /bin/ping,3 /bin/owping,3 /bin/tlaytcp,3 /bin/bw-iperf-client,3 whois,3 asn.pl,3
The first column is the unique name or character string which represents the process. The second colomn is the number of minutes after the start of the process to kill the process. Note that the 'iperf' server is named 'bw-iperf-server' so that one can easily tell the client apart from the server. The client ('bw-iperf-client') hangs, but the server does not. Note '/bin/ping'. Ping can and will hang do to some network problems.