This post is the second in a series introducing "just enough" BIDS; please start with the introduction. As explained in that first post (and this talk), organizing a dataset into six top-level directories following the BIDS standard can simplify and improve its sharing, storage, and management.
When introducing each of the six directories I repeatedly explained their contents in general terms; that each should have specific types of information, but not that the files need to be organized and named in a particular way ... except for rawdata. As I put it in that first post,
/rawdata/ is for data files that have been rearranged, renamed, and probably converted to a different format than they were collected (and stored in sourcedata) in, but not fundamentally transformed. I suggest that the rearranging and renaming in rawdata be to follow the BIDS standard more closely, as will be explained in a later post here (and the recorded talk).
For "just enough" BIDS, I suggest starting by using the standard's file naming and organization scheme (i.e., subdirectory structure) in rawdata, then start adding key .json files (or their equivalent), and finally consider converting all files to the specified types (e.g., .tsv instead of .csv). I'll explain each of these pieces in turn. (Warning: if you'll be running fMRIPrep or another BIDS App on the rawdata it's not safe to disregard any part of the BIDS standard!)
The screen captures in this post are from various datasets, but always of rawdata. I hope that as you read the explanations you will start to recognize the datasets' consistent structure and logic, and so be able to more easily find the information you're looking for in any BIDS dataset.
BIDS file naming
BIDS file names are usually (very) long, but systematic and interpretable. For example, from the names in the screen capture below we know that these files contain behavioral (_beh) data for participant (sub-) f1027ao's (_ses-) wave1bas lab visit, during which they did the Axcpt, Cuedts, Stern, and Stroop (_task-) twice each (_run-).
The documentation goes into a lot of detail on the abbreviations, keys, ordering, etc. to use in BIDS filenames. I suggest following its entity definitions as closely as possible (and throughout the experiment); consistent use of even just the sub-, _ses-, _task-, and the data type suffix (here, _beh) make it much easier to understand (and navigate) the contents of a dataset. People have all sorts of naming conventions, but I think the clarity and ease of communication of a universal standard outweigh most other considerations (e.g., "visit" instead of "session", aesthetic dislike of long file names, capitalization preferences).
For "just enough" BIDS I recommend closely following the file naming definitions, but relaxing some aspects when reasonable. For example, imagine an online behavioral study in which data was collected from hundreds of individuals in a single day. If the experimenters downloaded that data as a large spreadsheet, reformatting it into many hundreds of separate files (one per person per task, named sub-1_task-A_beh.tsv, sub-1_task-B_beh.tsv, sub-2_task-A_beh.tsv, ....) may seem ridiculous and a waste of time. If so, I'd recommend creating two large files instead: sub-all_task-A_beh.tsv and sub-all_task-B_beh.tsv. sub-all is not valid BIDS (sub- should refer to a single participant), but in my opinion does follow the logic and spirit of BIDS, makes the dataset easier to use and understand, and is far preferable to arbitrary file names and formats (e.g., download.xlsx).
BIDS subdirectories
The BIDS standard is for each person to have their own rawdata subdirectory (named sub- then the subject ID), which in turn has subdirectories for each session (ses-) and/or data type. In the example above, we can see from the subdirectories that participant f1027ao (sub-) came to the lab three times, completing the wave1bas, wave1pro, and wave1rea sessions (ses-), while f1342ku (below) has three additional (wave2) sessions.
Sessions often correspond to separate lab visits, but not always (as put in the BIDS standard, a "session" is "A logical grouping of neuroimaging and behavioral data consistent across subjects."). If each participant only interacted with the study once (like in the online study example), the study doesn't have sessions, and so participants don't need ses- subdirectories.
In the MLX study (below right) participants came to the lab twice for data collection: the "pre" and "post" sessions. But there is a third, "ses-ema", subdirectory for participant 8, because participants were asked to complete surveys on some days between the pre and post sessions. The survey responses weren't collected in the lab nor on a single day, but they are a consistent type of behavioral data collected during a specific time frame, so we decided it would be clearest to group them into a single "session".
Deciding how to group the data can be difficult; there's not always a single correct answer. For example, suppose that the MLX researchers designed the surveys not as a unique part of the overall experiment, but instead as a continuation of the pre session (e.g., the surveys were training to reinforce what participants learned during the first session). In such a study it could be more logical to store the survey files in ses-pre and not have a separate ses-ema subdirectory; the experiment design and planned analyses inform how the dataset should be structured.
Notice in the MLX screen capture above that sub-8 has three session subdirectories (pre, ema, post), while 11 only has pre and 12, only pre and post. This is because subjects 11 and 12 didn't complete all parts of the study; some of their data is missing. Omitting empty subdirectories makes it easy to identify missings, makes the dataset easier to read, and is
consistent with the standard; there's no need to have lots of empty subdirectories.
Similarly, I could have made
/beh/ subdirectories within each of the MLX session subdirectories, but did not because it would be redundant: all the data files for each session are of the same (_beh) type. That's not the case for the study below, in which ses-1 has both beh and eeg subdirectories, with different files in each. In this study participants come to the lab many times, doing different tasks in each visit. In session 1 this participant did the "StroopMULTIReward" task while EEG was being recorded, so there's an eeg subdirectory with the various files
BIDS specifies for describing EEG data collection. At the start of every visit participants completed a session intake survey; the survey questions were the same regardless of the tasks done later in the session, so we decided to store the responses separately from the task data, in beh subdirectories.
I suggest following the standard very closely when your study includes data of a type specifically covered by BIDS (
(f)MRI, many others). Methods researchers have worked on the standard over the years, adding and refining its fields and formats to capture the details needed for analyses, but not more (i.e., for complete, but de-identified, datasets). The BIDS standard has modality-specific subsections because each modality has such different requirements; files explaining
electrode placement (_coordsystem.json, _electrodes.tsv) are important for EEG, but meaningless for fMRI. In my opinion, if the data for your study is
valid BIDS, you can feel reasonably confident that you're not missing critical details that will be needed to use the dataset (and you'll be ready to use a
BIDS-App for preprocessing).
BIDS sidecar files (.json)
Very broadly, details about particular data files are in the "sidecar" .json file with a matching name. The standard specifies .json files for particular data types, (e.g., /ses-1/eeg/ above has an
_eeg.json file); if you have a described data type, I recommend following its standard (e.g., record your local power frequency in Hz in the "PowerLineFrequency" .json line; don't arbitrarily use "powerFreq" instead). There are programs to make BIDS sidecar files for particular data types (e.g., the splendid
dcm2niix for fMRI,
BV2BIDS for Brain Vision EEG), I recommend using them, if applicable.
.json can look intimidating, but it's just a text file with particular formatting; they're (more-or-less) human readable in any text editor. The OpenNeuro website
displays .json files as source text or nicely formatted, and there are many examples throughout the BIDS standard. While you can write a valid .json file by "hand" (keeping track of a lot of {} [] , and ;), I don't recommend it; it's much simpler to use a dedicated package. (I use
jsonlite in R; I'll eventually post some example code, or please ask.)
Writing BIDS .json files is mostly necessary to describe columns in behavioral data files; to provide study-specific details like the meaning of particular event codes or participant group assignments (unlike, say, PowerLineFrequency, which is relevant to any EEG data and so can be covered by a standard). Since these fields are study-specific, more detail is generally better. For example, below left is part of MLX sub-78_ses-pre_task-AxcptBas_beh.tsv, and the "probeAnswer" column has rows with 1, 4, and null ... which isn't very informative. For the explanation, we look at the .json file for the task, in this case, /rawdata/task-Axcpt_beh.json, and find the section for "probeAnswer" (below right):
To be fair, this explanation may still seem uninformative. But it is useful when combined with the task information in the
dataset description document and stimuli files. Note that the .json descriptions include the name of the corresponding e-prime (software used to administer this experiment) fields; these sorts of links can be very helpful for later analyses (and/or debugging).
There's only one task-Axcpt_beh.json file for the entire MLX dataset, in the root of rawdata, because the same data columns are used in sub-78_ses-pre_task-AxcptBas_beh.tsv and sub-78_ses-pre_task-AxcptPro_beh.tsv, as well as the ses-post files, and for all participants. Since all of the task-Axcpt _beh.tsv variants have the same fields, only one explanatory .json is needed (
BIDS "inheritance"). (Fully valid BIDS might need both task-AxcptBas_beh.tsv and task-AxcptPro_beh.tsv, I'm not sure. But since this is a "just enough" BIDS dataset, I decided it was less confusing to have a single file for both Axcpt variants than two identical ones differing only in filename.)
participants.tsv & participants.json
The
BIDS standard has these as "recommended" files, but I consider them "required", even for "just enough" BIDS: a dataset's participants.tsv is my gold-standard, definitive, absolutely-correct source for a list of its participants. It may seem odd that I'm putting so much emphasis on having a definitive list of participants, but subject ID discrepancies happen surprisingly often, and can lead to all sorts of trouble (at minimum, missing data going undetected).
I usually recommend listing every fully-enrolled (consented) subject in participants.tsv, even if they dropped before much data was collected, because it simplifies record keeping (e.g., cross-checking with other reports about the same study) and reduces ambiguity. While complete files are important for private versions of the dataset and studies in active data collection, only a subset of the data can usually be shared (e.g., to maintain participant privacy), requiring making a separate version of the dataset specifically for sharing. In this case it's generally best to have a separate participants.tsv for the shared dataset, with only the details which can be shared. For example, the
DMCC55B participants.tsv has little besides the subject IDs for its 55 people (the full DMCC dataset has more participants and details), because that is all that can be shared without restrictions.
In addition to the subject IDs, it is generally useful to store demographic-type information in participants.tsv. For example, if all participants in a study filled out a questionnaire asking for their age and handedness, I'd recommend storing those responses in participants.tsv columns, with the questionnaire text (and field names, if applicable) in the corresponding participants.json. Studies often ask for information in different ways, and the variations can matter (e.g., whether "prefer not to answer" is an option, how numerical ranges are presented, free text entry vs. selecting from a list).
musings
In this post (and particularly
the previous) I've described
one way of arranging valid BIDS* datasets, but it's not the only way; for example, subject data directories can be in the root, they don't have to be within rawdata. I organized the DMCC55B dataset this way, as you can see in
openneuro; only derivatives, code, and sub- subdirectories. I no longer recommend this organization nor use it internally for dataset management (we do use it transiently for preprocessing individual participants, but this post is already very long; happy to discuss further sometime).
While I gave a few mentions of "just enough" modifications in this post, my general advice is for rawdata to be properly valid BIDS, doubly so when the study includes datatypes covered by the specification (see the "Modality specific files" list in the left menu
of the website). I'm not especially fond of .json and .tsv formats, nor of using n/a for missings instead of NA, but judge those sorts of hassles to be a small price to pay compared to the benefits of standardization: standardized datasets are far more likely to be understandable (and usable, maintainable, etc.) by colleagues, now and in the future.
In practice, researchers accustomed to their own, idiosyncratic dataset scheme are reluctant to change. The
six top-level directories and "just enough" BIDS logic can be a stepping stone; providing a high-level standard structure while "hiding" the full, formal BIDS in the rawdata subdirectory.
* A valid BIDS dataset would require (at minimum) two files I didn't discuss above, rawdata/dataset_description.json and rawdata/README. These are important when sharing datasets and I have no objection to them in principle, but we've found a combined description document (.docx, .odt) in the root easier to maintain and more likely for collaborators to read.
No comments:
Post a Comment