Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

then there are multiple steps for the sync to MCCO, labelled REFRESH_MCCO_IRMIS_TABLES

Note
titleNote

REFRESH_MCCO_IRMIS_TABLES will not kick off unless ALL_DATA_UPDATE finishes with a status of SUCCESS.

some specific error checks:

  • comparison of PV counts: pre and post crawl
    An error is flagged if the total PV count has dropped more than 5000, because this may indicate that some IOCs have been skipped that were previously being crawled. This would happen when a change occurs in the IOC boot files or IOC boot structure that the crawler doesn’t understand. For example, if the crawler encounters new IOC startup file syntax that it doesn’t understand, but that doesn’t actually crash the crawler program.
    The logfile and e-mail message will tell you which IOCs are affected. The first thing to do is check with the responsible IOC engineer(s). It’s possible that the PV count drop is “real”, i.e. the engineer(s) in question intentionally removed a large block of PVs from the system.
    • If the drop is intentional: You will need to update the database manually to enable the crawler to proceed: the count difference is discovered by comparing the row count of the newly populated curr_pvs view with the row count of the materialized view curr_pvs_mv, which was updated on the previous day. If the PV count has dropped > 5000 PVs, the synchronization is cautious - it prevents updating good data in MCCO until the reason for the drop is known.
      The step to say "it's ok, the drop was intentional or at least non-destructive - synchronization is now ok" is to go ahead and updated the materialized view with current data. Then the next time the crawler runs, the counts will be closer (barring some other problem) and the synchronization can proceed
      So, to enable the crawlers proceed, you need to update the materialized view with data in the current curr_pvs view, like this:
      log in as cddev on slcs2 and run
      /afs/slac/g/cd/tools/irmis/cd_script/updateMaterializedViews.csh
      The next time the crawler runs (that night), the data validation will be making the comparison with correct current data, and you should be good to go.
    • If the drop is NOT intentional: Check the log for that IOC (search for “Processing iocname”) and/or ask the IOC engineer to check for changes in the IOC boot directories, and IOC boot files for:
        • a syntax error in the boot file or directory
        • new IOC st.cmd syntax has change to something that the crawler hasn’t learned yet. May require a PVCrawlerParser.pm modification to teach IRMIS a new syntax. This is rare but does occur. It’s possible the IOC engineer can switch back to an older-style syntax, add quotes around a string, etc. to temporarily adjust the situation.
    • comparison of processed IOC counts in 2 log files
      If the numbers don't match, the PV Crawler terminated early, before it processed all the IOCs. This usually happens when the crawler encounters new IOC startup file syntax that crashes the crawler perl program. Go to the logfile to determine the cause. This is rare, but usually requires a PVCrawlerParser.pm modification. As a temporary fix, update the IOC table and for the offending IOC(s), set SYSTEM to something other than LCLS or NLCTA (e.g. LCLS-TEMP-NOCRAWL) so it/they will not be crawled next time.
    • bsa_root_names row count checks: check for problem accessing the table, or 0 row counts for expected bsa types
    • devices_and_attributes row count checks: problem accessing the table, or row count 0
    • check for successful completion of all crawler steps: check to make sure all crawler steps enter a start and a completion row in controls_global.data_validation_audit.
      some circumstances where steps are missing:
      • step launched but didn’t finish: check the status of processes launched by the cron job using ps --ef | grep. An example: when perl dbi was hanging due to the 199-day-Linux-server-uptime bug. Several LCLS PV crawler jobs had launched, but had hung in the db_connect statement, and had to be killed from the Linux command line.
      • step launched and finished, but the completed step was never written: the getPwd problems cause this symptom. See entries starting 9/23 9 pm for an illustration.
      • step never launched: is the script available? Is the server up? Is crontab/trscrontab configured correctly? Are there permission problems? etc.
      • other mysteries: figure out where the job in question stopped, using ps --ef, logfiles, etc…

Logfile messages sent in e-mail:

    • Duplicate PV report (lcls only near the top of the message)
      Forward these to the relevant IOC Engineer.
    • List of IOCs with PV counts that changed during the current crawl (lcls only near the top of the message) – helpful for figuring out a big PV count drop
    • IOCS not found in IRMIS, added to IOC table (lcls only) – nice to know which new IOCs were just added.
      If an IOC directory in $IOC is not in the IRMIS IOC database table yet, IRMIS will automatically add it, and produce a message which you'll see here.
    • PV Crawler log messages
      • Don't worry about "expecting 3 arguments to nfsMount"!
      • Quickly check "Could not locate" - usually an IOC dir that isn't quite in production yet.
      • You can send a message to the IOC engineer if you see " unreachable" - it could be a typo, or it could be intentional.
      • A "Could not locate" an st.cmd file will create "Parse error: 2" messages, which can be ignored if the IOC isn't in production.

...

  • CURR_PVS: IRMIS PVs create a daily current PV list, a view called curr_pvs. curr_pvs supplies lcls pv names to one of the AIDA names load jobs (LCLS EPICS names). The current PV and IOC lists are also queried by the IRMIS gui, and by web interfaces, and joined with data in lcls_infrastructure by Elie and co. for Greg and co.
  • BSA_ROOT_NAMES: Qualifying IRMIS PVs populate the bsa_root_names table, which is joined in with Elie's complex views to device data in lcls_infrastructure. For details on the bsa_root_names load, please have a look at the code for the stored procedure which loads it (see one of the tables above)
  • DEVICES_AND_ATTRIBUTES:* PV names are parsed into device names to populate the devices_and_attributes table which is used by lcls_infrastructure and associated processing.
  • User interfaces may be affected (but they use MCCQA, so are not affected by failure of the sync to MCCO step):
    • IRMIS gui
    • the web IOC Report
    • IRMIS iocInfo web and APEX apps
    • EPICS camdmp APEX app
    • Archiver PV search APEX app
    • ad hoc querying

#top

3 possible worst case workarounds:

...