Monday, 5 November 2012

Progress on Mirage Data Migration

The Disk Space Issue

The main issue that is preventing us from rolling out Mirage for the majority of CMM's instruments is disc space.  Basically, if we did a "full" roll-out, we would fill up the available disc space in a small number of months, and then we would have to start deleting older files to make space That would mean that users would need to manage their own long-term file storage, and we would essentially be back where we started.

Then there is the new 3View system that is capable of generating 10Gb of data in a single night, or 20 Gb after rendering the images as tif files.

Managing Lots Of Data

We are addressing the issue of "too much data, not enough disc space" in two ways:
  • For data that needs to be kept online, we are currently implementing a scheme where individual data files are "migrated" to secondary online locations within UQ.  As far as users are concerned, the data files will be accessed as before, except that access will be a just little bit slower.
  • For data that no longer needs to be online, we will be implementing an archiving scheme in which snapshots of entire experiments complete with all relevant metadata are saved to offline storage.

Progress So Far

I am currently developing the code for the datafile migration subsystem for Mirage.  The basic file migration code is working in the Mirage test system, and the code that will decide what files to migrate and when to migrate them is in progress.  The initial migration system will take into account the size of the individual files, their file types, and when they were created and last accessed.  Later on, I intend to allow users to indicate the relative importance of files, datasets and experiments to influence the migration decision making.

(For MyTardis folks, the Mirage migration code is actually a MyTardis "app".  To use it, you will need to set up one or more secondary "destinations", which can simply be private WebDAV servers on some other machine with lots of disk space.  Look in my MyTardis repo on GitHub for the code.)

The other aspect that needs to be sorted out is actual disk space provisioning.  I have negotiated some space on the UQ HPC cluster for interim storage, but "the real thing" will be implemented on the QERN system that QCIF is currently developing.  We are currently "on the list" for transition onto QERN.