I'm a very big fan of organizing datasets (and not just fMRI datasets) following "just enough" of the BIDS standard. We now have a strategy that is working fairly well for a number of datasets and collaborations, and this (soon to be) series of posts is intended to serve as an introduction and quick-start guide, especially for people not already familiar with BIDS. A recording of me giving a talk covering much of the same material is at https://osf.io/w7zkc/files/ycgd3.
But first, a warning: I use BIDS in two distinct contexts: 1) transient, made for fMRIPrep preprocessing and then deleted; and 2) permanent, for managing a particular research project's dataset. "Just enough" BIDS only is for the second case; fMRIPrep (and other BIDS Apps), absolutely require the input files to be fully valid BIDS (and things can go badly wrong if anything is mis-specified). I'll return to the idea of "transient" BIDS later; the focus here is on the dataset management context.
why do we need any BIDS?
This seems to be a case of "if you know, you know" ... many of us can confirm that trying to understand data collected by someone else (or by you, a few years ago) can be an absolute nightmare; even answering basic questions like how many participants the dataset has and what tasks they performed can take substantial effort. Worse still are cases where a shared dataset turns out to be unusable because key information is missing, or when confusion about it leads to incorrect conclusions. Even deciding how to arrange your own datasets can be daunting, and poor initial organization choices can be costly to correct later.
a tiny bit of BIDS: six top-level directories
Arranging a dataset in six top-level directories is a (relatively) small step that can simplify management. For example, below is what I see in the project root directory of "MLX", a dataset I'm currently working with. Its files are in six subdirectories: analysis, code, derivatives, rawdata, sourcedata, and stimuli, the names and contents of which follow the BIDS standard.
Also in the root is MLX_datasetDescription.docx, an informal Scientific Data descriptor-type article. Its purpose is to hold key information that anyone (in the lab or publicly) considering using the dataset should know. This overlaps with parts of the BIDS README, CHANGES, and dataset_description.json files, but we've had more implementation success asking people to read and maintain a single flexible-format word processing document than the formal separate files. (The last file, fuzzKittens.JPG, is a picture of tiny foster kittens, used to confirm that the directory permissions were working but too cute to delete after.)
What goes into each of the six top-level directories?
/stimuli/ is for task stimuli; more generally, all the files and details needed to run or understand what the participants (and experimenters) did, arranged as desired. This can include SOPs, experimenter scripts, redcap dictionaries, eprime files, training materials, recordings of someone doing the tasks, etc., etc.
These screen captures show the stimuli directory for two different projects to give a sense of how they can vary; the project at left had multiple visits, tasks, and types of data collection so has many more subdirectories than the single-task LifespanStroopCW project at right.
/sourcedata/ is for the data files as originally collected, unaltered. There's not a particular naming, format, or organization required for sourcedata: however you collected ‘em is fine (if documented). The MLX and MTurk2018 examples below have different sourcedata subdirectories, each with an idiosyncratic combination of file names and types.
Some datasets may need to have much of the sourcedata archived elsewhere. For example, the dualmechanisms dataset below only has a readme.txt file for sourcedata because it was an fMRI study, and our fMRI files go directly from the scanners to a university XNAT database. It would be redundant (and a lot of storage overhead) to keep a copy of them all under sourcedata, plus jeopardize participant privacy (e.g., DICOM files usually have identifiable details).
/rawdata/ is for data files that have been rearranged, renamed, and probably converted to a different format than they were collected (and stored in sourcedata) in, but not fundamentally transformed. I suggest that the rearranging and renaming in rawdata be to follow the BIDS standard more closely, as will be explained in a later post (and the recorded talk).
Here are examples of rawdata from three different datasets; note that in each case the directory structure and file types are consistent.
/derivatives/ is for altered versions of the rawdata and sourcedata files.
For fMRI datasets, preprocessed images are the derivatives: these images have been spatially normalized, smoothed, converted to surfaces, etc.; much more substantial changes than converting a file from DICOM (sourcedata) to nifti (rawdata).
But the idea can apply to many other types of data.
For example, some of our projects have a lot of trial-level behavioral data for each participant. The BIDS standard is for these to be arranged in multiple _beh.tsv files (which we do in rawdata), but some of our collaborators prefer to work with fewer larger files of data from multiple participants and sessions, which we store in derivatives subdirectories. I generally put files with data from more than one participant and/or session under derivatives, including individual differences questionnaire responses (in NDAR format; more details and examples will be in a later post).
The derivatives for these three example datasets have some similarities but substantial variations, due to the different data types and analysis requirements. For instance, the precisionneuroscience dataset has EEG data, which is being processed by BVA. BVA has its own requirements for dataset organization and its preprocessing can be done in various ways, so we decided to put each into its own derivatives subdirectory.
/code/ is for code, especially that used for processing, converting, and quality-checking the dataset. (backed up in a version control system)
/analysis/ is for quality control summary documents, analysis code, results files, manuscripts, etc.
BIDS does not have file name nor content requirements for either the analysis or code subdirectories.
It is working reasonably well in our collaborations to use the code subdirectory for "official" dataset maintenance and conversion-type scripts, and analysis be very flexible (e.g., with individuals making their own "working" subdirectories in which they run ongoing analyses however they wish).
more examples!
I think it's easiest to get accustomed to BIDS by working with some already-formatted datasets, but hopefully the various examples and explanation here (and in the talk) will give a sense and serve as a starting point. For dataset management, I believe adopting the convention of these six top-level directories plus a description document is beneficial on its own, without adding too much procedural overhead.
several full ascii tree-style examples after the jump