You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Overview

Currently the data catalog is being used by Fermi and EXO, and used as a side effect of using the pipeline by several other groups (CDMS, CTA).

LSST is interested in using the data catalog  (for test data at least, and possibly for DESC and data handling if its dependence on oracle was removed). The download manager is currently used by SSRL and they could potentially use more of the data catalog for making data which is not currently accessible from JCSG directly available from SLAC (although they do not currently seem very interested in this). 

The data catalog is also potentially usable as part of a future photon science data portal. 

This page attempts to explore the possibilities for future development of the data catalog -- assuming for the moment that we have unlimited resources to achieve this.

Current features

  • Ability to define logical folders to arrange data independent of physical location
  • Ability to define groups of files of the same format (logical datasets)
  • Arbitrary meta-data associated with files, folders and groups. 
  • Download manager for downloading large numbers of files. Allows resume of partial downloads and download of updated/new files. Offline 

Towards data catalog 2.0

Improved modularity

The data catalog currently has 5 different components

  1. Web interface
  2. Line mode client
  3. Database back-end
  4. Download manager
  5. File "crawler"

The download manager and crawler are not strongly coupled to the rest of the system, but the top 3 items are fairly tightly coupled, in particular they all access the database directly. Ideally the web interface would access the database via an abstraction layer, so that the web interface could be used with any backend providing similar functionality. If we add restful interfaces to the web interface it would probably make sense for the line mode client to use those interfaces (this would move the current restriction that the line-mode client can run only at SLAC).

New Features

  • Access control lists. For Fermi there was no need to have fine grained access control, you can either login and access the data catalog and related data or you cannot. For EXO we added an ad-hoc method using meta-data to "hide" the EXO data from non-EXO users. For more general use some more flexible method of access control would probably be desirable.
  • Enhanced web interface using HTML5 features.
    • Create elements of the tree only as they are browsed for faster application loading
    • Time-consuming data (such as total size of all files in a folder) should be populated into the web interface asynchronously
    • Drag and Drop: it should be possible to drag files between the local file system and the data catalog. Potentially this would remove the need for the separate Java download manager (which does not work on mobile devices/tablets).
  • Remote access to data. Currently the functionality of the data catalog stops at the file level. To access any data within the files requires that the files be downloaded, or that the file paths be accessed and fed into some separate tool which analyses the data. Ideally the web interface would also make it possible to drill into the data files themselves, for example to:
    • View images or tables within a fits file
    • View histograms within a root file
    • Tabulate/explore tuples within files

Database Independence

Currently the data catalog is tightly coupled to Oracle through the use of

  • Java stored procedures
  • Hierarchical queries
  • Miscellaneous use of oracle specific SQL

Ideally this dependence could be removed through the use of a proper object-relational mapping layer, such as hibernate.

Integration with other tools

  • IRods? SAM? Globus Online? DIRAC?

Implementation Language

  • No labels