Refereneces

From Core Mnutes March 20, 2007: 

ROOT data indexing: (Eric C.) ROOT-based metadata scheme (used by BABAR) containing pointers to locate all known data for a given event. The basic idea involves storing (File, Tree, Entry) in an external table, pointed to by a single pointer in an NTuple (index file), but with some additional flexibility (works with xrootd, including when data has been migrated to tape). Tools to read & copy the event data also provided. See https://confluence.slac.stanford.edu/download/attachments/20011/Event_Collections.pdf. BABAR code is all online in their CVS repository, module KanEvent. This scheme might find application in GLAST for interleave, skimming, and analysis scenarios.

Relational DB option 

In a nutshell: while at the collaboration meeting it became apparent that the goals for this "system" were rapidly changing and the scope of the concept is increasing dramatically. For example, the pipeline people would like to be able to categorize events (pass CNO filter, pass MIP filter, GCR event, etc.) and they think it would be natural to just write out an event/run number to do this, then use the "system" to read the events back. In thinking about it a bit more, it seems to me that the problem neatly divides into two pieces. One piece is the part that given a run and event number (which I am told are the unique identifiers for all events) returns the information on where to find the actual data. The other piece is the code that, given the run and event number and the information on where to find the data, returns the actual data requested (which can be various root trees - mc, relation, recon, digi, etc., or ntuples, or...).
My argument is that the first piece is best done with a relational database. With a relational database you would use the run/event number as the key and the enter, only once, all the information on where to find the various bits of data associated with it. In addition, you can also have a few more bits of information which will further categorize the event which can be used during a query to identify the event in some particular way. I think this type of system will be far more extensible and much easier to manage than a pile of root files which would try to do this same sort of thing. In addition, it would be straightforward for the pipeline guys to hook into this automatically to fill it. I also think it would be very easy to transport the database to other installations which might be repositories of large datasets (e.g. Lyon).
In any case, I think this approach also neatly divides the problem into two pieces which may well make it much easier to implement.
And, finally, I also discovered that Joanne has already provided all the tools necessary for implementing such an approach, including some very nice gui tools for looking at stuff that is in there. There would be cost involved in understanding her stuff and then wrapping what we wanted to do around the outside of it but this would be far simpler than trying to invent something ourselves.
I have the BaBar code downloaded on my laptop. I had started to think about building it but realized I needed some include files. When I asked Tom where I might find them he cautioned me that I was beginning to pull on a very long string and it might be best to not try to do that. So... whatever we decide to do, it sounds like our best approach may be to take the concept but do our own implementation.
Tracy

Use Cases

Level 0 Interleave

Feed the simulation a library of background events it can randomly access in full from the original data files.  So would we envision Interleave querying the DB directly during a run to get the next background event?For the interleave we might try to do it like the ntuple interleave, where you set up to read all events within a given bin, on the assumption that several events will be generated from that bin before going to a new bin. But I think there are few enough interleave events that querying the db each time you want one is not so much overhead. I suppose we will see!

Pruner Skimming

A search is performed and N events are found that satisfy the criteria - with run/event ids in hand, the system finds the data files containing the events, pulls them out and ships them off for delivery to the user doing the search.

User Analysis

User Joe has his favorite N events and he wants to keep a run/event list handy to share with his friends.  His friends can then use that list to retrieve the full data themselves.right. the db produces a list which the root reading code uses to return the events as you need them.

  • No labels

3 Comments

  1. Unknown User (chamont)

    For what concerns the DB part, be carefull not to generate a monster with much redundant data. If you store the file/tree/entry for each run/event, the same file/tree will certainly be repeated many times. I would rather see the database as a collection of datasets, where each dataset contains :

    1) the list of real files/trees, with the number of entries in each file/tree.
    2) the event component concerned (recon/mc/digi/etc.)
    3) the first and last run/event ids.

    This imply more work when trying to retreive the file/tree/entry for a given run/event. Perhaps one could maintain in parallel a list of run/event/pointer, where the pointer is pointing to the relevant dataset. On the other hand, if you want complete each run/event with some additional information (pass CNO filter, pass MIP filter, GCR event, etc.), perhaps the relative weight of file/tree/entry will be negligeable.

    For what concerns the access to data (part B), this is close to the skimmer problematic, except that the skimmer is expected to duplicate the data. Hopefully there could be some common code between the two.

  2.  Here is what I was thinking of in terms of the ideal scenario... the one where we ignore cpu time, memory limitations, etc.

    As the events roll off the pipeline (from whatever source) they will be categorized. For example, the onboard filter could categorize them as gammas, or heavy ion candidates, or mip candidates, or an offline process might have some other categorizations to apply. Very importantly, it will be possible that an event can have multiple categorizations (e.g. it may have passed several of the onboard filters). Ideally, we would like a system that would keep track of the run/event number (as the unique identifier), something which keeps track of what category it is in (perhaps just a bit mask), and then some information on where to find the pieces of the event.

    If we were dealing with millions of events I'd say the best solution was a relational database that in one place kept track of all the information needed to find each event. However, this may not be practical.

    Instead, maybe a better solution is to keep an entry for each event in a database which keeps run/event number and its categorizations. One can then obtain lists of events/run numbers in whatever categorization one likes. From these lists, one uses the run numbers to look up in another set of relational database tables the information on where the events for that run are stored. So, this table would index run number against all the root trees, ntuples, etc., that are associated with it.

    The list of desired run/event numbers and the list of input files by run number can then be fed to the RootIo service/associated algorithms to read the events. One can take advantage of Root's indexing tools to build a table of indices vs keys (run/event number in this case) for "quick" access to the desired events.

    I advocate this type of solution because:
    1) The current RootIo is *almost* already set up to handle this,
    2) There exists an extensive infrastructure of tools, provided by Joanne Bogart in the context of the calibrations, to access and manage information in a relatiional database,
    3) The above database tools are not tied to a particular database but currently use MySql which is freely available (meaning we can export this solution to the other proposed major data processing centers in Lyon and Italy),
    4) I think the above solution will be extensible to other applications within Glast (so, from interleave to, perhaps, the data server),
    5) and, finally, I mostly advocate this approach because I think it breaks the general problem into several pieces, each of which can be developed independently/further as time goes on.

     
    Tracy

  3. I'm rudely inserting Anders' email here so we do not forget his concerns:       It sounds like the db part of what Tracy describes is a partial duplication of the data catalog/server. Is this really what we want to to? We at least need to work out the division between the two to avoid duplicating work.
            What I originally(!) had in mind is entirely covered by the second part of what Tracy describes i.e. given a run/event number I need to know where to get that event in the different data files. This is needed for both monitoring in the L1 pipeline (we need to open digi and some other file at the same time) and for book keeping in the processing. I hope to get away from inserting empty events (or at least depending on it) so in the case where we digitize a complete run, but only reconstruct parts of it I can use the master file to enforce the 1-1 correspondence in an automated way in the L1 pipeline. As a user I would also like to be able to use this master file in my Root Tree analysis.
            If I need to connect to a db to get my master file it will obviously complicate things, but it's possible that may be worked out in a user friendly way. What about performance? A pipeline job will deal with millions of events. What would be the approach to fill and read the db tables without any performance penalty? Would all interface now expecing a Root file have to rewritten to expect a db connection and then a Root file?
            Individual users may also want to add files - a db approach seems less flexible in that sens. But maybe one can have a 'write out to file' option.
    Anders