Monday 5 November 2012

Progress on Mirage Data Migration

The Disk Space Issue

The main issue that is preventing us from rolling out Mirage for the majority of CMM's instruments is disc space.  Basically, if we did a "full" roll-out, we would fill up the available disc space in a small number of months, and then we would have to start deleting older files to make space That would mean that users would need to manage their own long-term file storage, and we would essentially be back where we started.

Then there is the new 3View system that is capable of generating 10Gb of data in a single night, or 20 Gb after rendering the images as tif files.

Managing Lots Of Data

We are addressing the issue of "too much data, not enough disc space" in two ways:
  • For data that needs to be kept online, we are currently implementing a scheme where individual data files are "migrated" to secondary online locations within UQ.  As far as users are concerned, the data files will be accessed as before, except that access will be a just little bit slower.
  • For data that no longer needs to be online, we will be implementing an archiving scheme in which snapshots of entire experiments complete with all relevant metadata are saved to offline storage.

Progress So Far

I am currently developing the code for the datafile migration subsystem for Mirage.  The basic file migration code is working in the Mirage test system, and the code that will decide what files to migrate and when to migrate them is in progress.  The initial migration system will take into account the size of the individual files, their file types, and when they were created and last accessed.  Later on, I intend to allow users to indicate the relative importance of files, datasets and experiments to influence the migration decision making.

(For MyTardis folks, the Mirage migration code is actually a MyTardis "app".  To use it, you will need to set up one or more secondary "destinations", which can simply be private WebDAV servers on some other machine with lots of disk space.  Look in my MyTardis repo on GitHub for the code.)

The other aspect that needs to be sorted out is actual disk space provisioning.  I have negotiated some space on the UQ HPC cluster for interim storage, but "the real thing" will be implemented on the QERN system that QCIF is currently developing.  We are currently "on the list" for transition onto QERN.

2 comments:

  1. Are you migrating to QERN or QCloud? I think there is a difference in terms of sustainability: QERN is a prototype system (not sure what the sunset on it is), where as QCloud will be a production system governed by contractual guarantees to operate beyond 2014

    ReplyDelete
    Replies
    1. I need to clarify this, but my understanding was that the QERN is the production system (pQERN was the prototype). But either way, the good news is that our virtual is due to be provisioned towards later this week.

      Delete