DATa MANager: a system for managing research MRI data

DATa MANager: a system for managing research MRI data

In 2014, we were tasked with building a system for managing incoming DICOM data from both our internal MRI center and external sites. DICOM data stores 3D data as a set of files, one for each 2D slice. For 4D data (common with functional MRI and data using for white matter fibre tracking), the data is still stored as a set of 2D slices. Therefore a single 60-slice image with many timepoints can consist of > 5000 files. The thing linking these files is their headers. These headers are not generally standardized across hardware vendors, and are even subject to change between software upgrades on a single piece of hardware. The research market makes up such a small percentage of the total MRI market that the major vendors (GE, Siemens, Phillips) to not dedicate many resources to keeping these crucial fields consistent. This is crucial, because software upgrades also cause real changes in the data that we need to be able to detect as soon as possible.

At the time, the system consisted of an FTP server, and a fine young gentleman who would semi-manually unzip incoming packages of DICOMS, run conversion software on those data to create a single file per series (making them much easier to handle while deleting most of the header information), and if any analysis was required, he also manually ran those scripts on the data. The major issue was, aside from the strategy not being scalable, that the provinance of the data was often questionable. After having done this for 3 years, there were many examples of edge-case data that were handled differently than the rest, but no paper trail existed to tell us why that was. Many computer professionals might find this alarming (it is), but what is even more alarming is that this is status-quo in research. However I was fortunate enough to work for a lab that was well funded and wanted to scale their analysis beyond what is possible to do by hand, so over the years I and a small team developed a fully-automated system.

DATMAN (the DATa MANager) is a set of Python scripts that link together a myriad of pipelines, a QC dashboard system, and an XNAT database. Data collected within the hospital is automatically uploaded to XNAT, while external users have to upload data to XNAT manually. Therefore XNAT is the first point of common contact for all data, and we use it primarily for it’s ability to automatically understand DICOM headers, and organize the data accordingly. They’ve gone through a lot of pain (I imagine) making that system work across all of the major platforms, so it has been very helpful to leverage that. Paired data (e.g., notes acquired at the scanner, or behavioural scores required to analyze the functional MRI data), can be stored in XNAT using a flat folder strucutre.

Staying Organized

Next, we have a set of scripts that interact with DATMAN via it’s REST interface to

  1. Identify all DICOM series available.
  2. Use RegEx to map each series to a set of ‘tags’.
  3. Export a single DICOM (2D) image from each series (so we can hold on to the complete header if we need it later).
  4. Export a ‘compiled’ series (3D/4D) for ease-of-analysis.
  5. Export all paired data.

The tag system is how we map from the unique information in each DICOM header to the probable ‘kind’ of data each series represents. This lets us feed files through the appropriate QC or analytic pipeline downstream. For example, the _T1_ tag denotes an anatomical MRI, and we configure the system to measure the signal-to-noise ratio of these kinds of images, as well as calculate statistics about brain-region volume and thickness of the cortex.

The core concept in DATMAN is the structure of the filenames. They contain all the information needed to know:

  1. Who this scan came from.
  2. For what purpose the scan was collected.
  3. The exact pipelines that should be run on this scan.

Filenames look like this: STUDY_SITE_SUBJECT_TIMEPOINT_REPEAT, e.g. (SPINS_CMH_0001_01_01).

The above tells me:

  • This subject came from the SPINS study.
  • It was collected at the site CMH (Centre for Addiction and Mental Health).
  • It is subject 0001.
  • This subject was scanned for session 01. In multi-timepoint studies (e.g., baseline and 6-month follow up), this field tracks which timepoint we’re at.
  • This data is the first repeat of the baseline. Sometimes people need to be brought back in for a repeat scan (in the case of data corruption or scheduling conflicts), so we needed a way to distinguish between the repeat of the baseline, vs. the first acquisition of the 6-month follow up. If we only kept track of ordinal information, we couldn’t distinguish between a repeat of a baseline or a 6-month follow up.

Quality Control

At this point, we have converted data available on our system. Now we need to make sure of a few things:

  1. Did we get everything we were promised?
  2. Is everything we recieved useful?

To ensure we have everything, we use a simple configuration file that maps between the DICOM headers and the eventual tags, as well as the expected number of files that should match that tag. If there’s a mismatch, the system warns us.

To ensure everything is useful, we first run all of the data through a myriad of quality control pipelines. The majority of them I’ve organized in the qcmon project. Without getting into too much detail, this system:

  1. Quantifies the signal-to-noise, and other simple quality metrics of each file.
  2. Quantifies and flags artifacts. The big ones here are slice-dropouts in white matter tracking data, and head motion in the functional MRI data.
  3. Matches key fields of the file’s headers with ‘gold standard’ DICOMs stored at the beginning of the study to catch any parameter-drift.
  4. Generates visualizations that make it easy to spot structured noise in the data that could signify hardware issues.

This information is then collected and organized into a fairly sophisticated dashboard system that can be used to perform quality checks, as well as track these metrics over time (to ensure quality does not drift too much over the course of a multi-year study). More on the dashboard in another post.


Finally, to facilitate the researchers, the data that passes QC is fed through a large number of analytic pipelines, and the results of them are stored in a regular way by our system. Each is quite complicated and demands it’s own post, so what follows is an overview:

  1. Cortical surface segmentation (freesurfer, civet, ciftify).
  2. Functional MRI preprocessing (epitome).
  3. White matter tracking (UKFTractography, DTIPrep, ENIGMA-DTI, camino).
  4. Subcortical segmentation (MAGtT-Brain).

Broadly, the above are used to measure the thickness and volume of brain regions, including the cortex and sub-cortical regions; connectivity between brain regions as measured by both white matter tracking and functional MRI; brain engagement while performing actions in the scanner; and metrics that represent brain health, or integrity, such as the purity of a white matter tracts. These metrics have proven to be useful in understanding the subpopulations hidden among the patients we study.