Thursday, June 18, 2026

"just enough" BIDS 2: proper BIDS in rawdata

This post is the second in a series introducing "just enough" BIDS; please start with the first. As explained there (and in this talk), organizing a dataset into six top-level directories following the BIDS standard can simplify and improve its sharing, storage, and management. 

When introducing each of the six directories I repeatedly explained their contents in general terms; that each should have specific types of information, but not that the files need to be organized and named in a particular way ... except for rawdata. As I put it in that first post, 

/rawdata/ is for data files that have been rearranged, renamed, and probably converted to a different format than they were collected (and stored in sourcedata) in, but not fundamentally transformed. I suggest that the rearranging and renaming in rawdata be to follow the BIDS standard more closely, as will be explained in a later post here (and the recorded talk).

For "just enough" BIDS, I suggest starting by using the standard's file naming and organization scheme (i.e., subdirectory structure) in rawdata, then start adding key .json files (or their equivalent), and finally consider converting all files to the specified types (e.g., .tsv instead of .csv). I'll explain each of these pieces in turn. (Warning: if you'll be running fMRIPrep or another BIDS App on the rawdata it's not safe to disregard any part of the BIDS standard!)

The screen captures in this post are from various datasets, but always of rawdata. I hope that as you read the explanations you will start to recognize the datasets' consistent structure and logic, and so be able to more easily find the information you're looking for in any BIDS dataset.

BIDS file naming

BIDS file names are usually (very) long, but systematic and interpretable. For example, from the names in the screen capture below we know that these files contain behavioral (_beh) data for participant (sub-) f1027ao's (_ses-) wave1bas lab visit, during which they did the Axcpt, Cuedts, Stern, and Stroop (_task-) twice each (_run-). 

directory tree: /rawdata/sub-f1027ao/ses-wave1bas/sub-f1027ao_ses-wave1bas_task-Axcpt_run-1_beh.tsv 
The documentation goes into a lot of detail on the abbreviations, keys, ordering, etc. to use in BIDS filenames. I suggest following its entity definitions as closely as possible (and throughout the experiment); consistent use of even just the sub-, _ses-, _task-, and the data type suffix (here, _beh) make it much easier to understand (and navigate) the contents of a dataset. People have all sorts of naming conventions, but I think the clarity and ease of communication of a universal standard outweigh most other considerations (e.g., "visit" instead of "session", aesthetic dislike of long file names, capitalization preferences). 

For "just enough" BIDS I recommend closely following the file naming definitions, but relaxing some aspects when reasonable. For example, imagine an online behavioral study in which data was collected from hundreds of individuals in a single day. If the experimenters downloaded that data as a large spreadsheet, reformatting it into many hundreds of separate files (one per person per task, named sub-1_task-A_beh.tsv, sub-1_task-B_beh.tsv,  sub-2_task-A_beh.tsv, ....) may seem ridiculous and a waste of time. If so, I'd recommend creating two large files instead: sub-all_task-A_beh.tsv and sub-all_task-B_beh.tsv. sub-all is not valid BIDS (sub- should refer to a single participant), but in my opinion does follow the logic and spirit of BIDS, makes the dataset easier to use and understand, and is far preferable to arbitrary file names and formats (e.g., download.xlsx).

BIDS subdirectories

The BIDS standard is for each person to have their own rawdata subdirectory (named sub- then the subject ID), which in turn has subdirectories for each session (ses-) and/or data type. In the example above, we can see from the subdirectories that participant f1027ao (sub-) came to the lab three times, completing the wave1bas, wave1pro, and wave1rea sessions (ses-), while f1342ku (below) has three additional (wave2) sessions.

Sessions often correspond to separate lab visits, but not always (as put in the BIDS standard, a "session" is "A logical grouping of neuroimaging and behavioral data consistent across subjects."). If each participant only interacted with the study once (like in the online study example), the study doesn't have sessions, and so participants don't need ses- subdirectories. 

In the MLX study (below right) participants came to the lab twice for data collection: the "pre" and "post" sessions. But there is a third, "ses-ema", subdirectory for participant 8, because participants were asked to complete surveys on some days between the pre and post sessions. The survey responses weren't collected in the lab nor on a single day, but they are a consistent type of behavioral data collected during a specific time frame, so we decided it would be clearest to group them into a single "session".   

  

Deciding how to group the data can be difficult; there's not always a single correct answer. For example, suppose that the MLX researchers designed the surveys not as a unique part of the overall experiment, but instead as a continuation of the pre session (e.g., the surveys were training to reinforce what participants learned during the first session). In such a study it could be more logical to store the survey files in ses-pre and not have a separate ses-ema subdirectory; the experiment design and planned analyses inform how the dataset should be structured.

Notice in the MLX screen capture above that sub-8 has three session subdirectories (pre, ema, post), while 11 only has pre and 12, only pre and post. This is because subjects 11 and 12 didn't complete all parts of the study; some of their data is missing. Omitting empty subdirectories makes it easy to identify missings, makes the dataset easier to read, and is consistent with the standard; there's no need to have lots of empty subdirectories.

Similarly, I could have made /beh/ subdirectories within each of the MLX session subdirectories, but did not because it would be redundant: all the data files for each session are of the same (_beh) type. That's not the case for the study below, in which ses-1 has both beh and eeg subdirectories, with different files in each. In this study participants come to the lab many times, doing different tasks in each visit. In session 1 this participant did the "StroopMULTIReward" task while EEG was being recorded, so there's an eeg subdirectory with the various files BIDS specifies for describing EEG data collection. At the start of every visit participants completed a session intake survey; the survey questions were the same regardless of the tasks done later in the session, so we decided to store the responses separately from the task data, in beh subdirectories.

I suggest following the standard very closely when your study includes data of a type specifically covered by BIDS ((f)MRI, many others). Methods researchers have worked on the standard over the years, adding and refining its fields and formats to capture the details needed for analyses, but not more (i.e., for complete, but de-identified, datasets). The BIDS standard has modality-specific subsections because each modality has such different requirements; files explaining electrode placement (_coordsystem.json, _electrodes.tsv) are important for EEG, but meaningless for fMRI. In my opinion, if the data for your study is valid BIDS, you can feel reasonably confident that you're not missing critical details that will be needed to use the dataset (and you'll be ready to use a BIDS-App for preprocessing).


BIDS sidecar files (.json)

Very broadly, details about particular data files are in the "sidecar" .json file with a matching name. The standard specifies .json files for particular data types, (e.g., /ses-1/eeg/ above has an  _eeg.json file); if you have a described data type, I recommend following its standard (e.g., record your local power frequency in Hz in the "PowerLineFrequency" .json line; don't arbitrarily use "powerFreq" instead). There are programs to make BIDS sidecar files for particular data types (e.g., the splendid dcm2niix for fMRI, BV2BIDS for Brain Vision EEG), I recommend using them, if applicable.

.json can look intimidating, but it's just a text file with particular formatting; they're (more-or-less) human readable in any text editor. The OpenNeuro website displays .json files as source text or nicely formatted, and there are many examples throughout the BIDS standard. While you can write a valid .json file by "hand" (keeping track of a lot of {} [] , and ;), I don't recommend it; it's much simpler to use a dedicated package. (I use jsonlite in R; I'll eventually post some example code, or please ask.)

Writing BIDS .json files is mostly necessary to describe columns in behavioral data files; to provide study-specific details like the meaning of particular event codes or participant group assignments (unlike, say, PowerLineFrequency, which is relevant to any EEG data and so can be covered by a standard). Since these fields are study-specific, more detail is generally better. For example, below left is part of MLX sub-78_ses-pre_task-AxcptBas_beh.tsv, and the "probeAnswer" column has rows with 1, 4, and null ... which isn't very informative. For the explanation, we look at the .json file for the task, in this case, /rawdata/task-Axcpt_beh.json, and find the section for "probeAnswer" (below right): 

To be fair, this explanation may still seem uninformative. But it is useful when combined with the task information in the dataset description document and stimuli files. Note that the .json descriptions include the name of the corresponding e-prime (software used to administer this experiment) fields; these sorts of links can be very helpful for later analyses (and/or debugging).

There's only one task-Axcpt_beh.json file for the entire MLX dataset, in the root of rawdata, because the same data columns are used in sub-78_ses-pre_task-AxcptBas_beh.tsv and sub-78_ses-pre_task-AxcptPro_beh.tsv, as well as the ses-post files, and for all participants. Since all of the task-Axcpt _beh.tsv variants have the same fields, only one explanatory .json is needed (BIDS "inheritance"). (Fully valid BIDS might need both task-AxcptBas_beh.tsv and task-AxcptPro_beh.tsv, I'm not sure. But since this is a "just enough" BIDS dataset, I decided it was less confusing to have a single file for both Axcpt variants than two identical ones differing only in filename.)

participants.tsv & participants.json

The BIDS standard has these as "recommended" files, but I consider them "required", even for "just enough" BIDS: a dataset's participants.tsv is my gold-standard, definitive, absolutely-correct source for a list of its participants. It may seem odd that I'm putting so much emphasis on having a definitive list of participants, but subject ID discrepancies happen surprisingly often, and can lead to all sorts of trouble (at minimum, missing data going undetected).

I usually recommend listing every fully-enrolled (consented) subject in participants.tsv, even if they dropped before much data was collected, because it simplifies record keeping (e.g., cross-checking with other reports about the same study) and reduces ambiguity. While complete files are important for private versions of the dataset and studies in active data collection, only a subset of the data can usually be shared (e.g., to maintain participant privacy), requiring making a separate version of the dataset specifically for sharing. In this case it's generally best to have a separate participants.tsv for the shared dataset, with only the details which can be shared. For example, the DMCC55B participants.tsv has little besides the subject IDs for its 55 people (the full DMCC dataset has more participants and details), because that is all that can be shared without restrictions.

In addition to the subject IDs, it is generally useful to store demographic-type information in participants.tsv. For example, if all participants in a study filled out a questionnaire asking for their age and handedness, I'd recommend storing those responses in participants.tsv columns, with the questionnaire text (and field names, if applicable) in the corresponding participants.json. Studies often ask for information in different ways, and the variations can matter (e.g., whether "prefer not to answer" is an option, how numerical ranges are presented, free text entry vs. selecting from a list).

musings

In this post (and particularly the previous) I've described one way of arranging valid BIDS* datasets, but it's not the only way; for example, subject data directories can be in the root, they don't have to be within rawdata. I organized the DMCC55B dataset this way, as you can see in openneuro; only derivatives, code, and sub- subdirectories. I no longer recommend this organization nor use it internally for dataset management (we do use it transiently for preprocessing individual participants, but this post is already very long; happy to discuss further sometime).

While I gave a few mentions of "just enough" modifications in this post, my general advice is for rawdata to be properly valid BIDS, doubly so when the study includes datatypes covered by the specification (see the "Modality specific files" list in the left menu of the website). I'm not especially fond of .json and .tsv formats, nor of using n/a for missings instead of NA, but judge those sorts of hassles to be a small price to pay compared to the benefits of standardization: standardized datasets are far more likely to be understandable (and usable, maintainable, etc.) by colleagues, now and in the future. 

In practice, researchers accustomed to their own, idiosyncratic dataset scheme are reluctant to change. The six top-level directories and "just enough" BIDS logic can be a stepping stone; providing a high-level standard structure while "hiding" the full, formal BIDS in the rawdata subdirectory.


* A valid BIDS dataset would require (at minimum) two files I didn't discuss above, rawdata/dataset_description.json and rawdata/README. These are important when sharing datasets and I have no objection to them in principle, but we've found a combined description document (.docx, .odt) in the root easier to maintain and more likely for collaborators to read.


below the jump, plain text directory tree versions of the images 

Wednesday, May 27, 2026

"just enough" BIDS 1: introduction and six top-level directories

I'm a very big fan of organizing datasets (and not just fMRI datasets) following "just enough" of the BIDS standard. We now have a strategy that is working fairly well for a number of datasets and collaborations, and this (soon to be) series of posts is intended to serve as an introduction and quick-start guide, especially for people not already familiar with BIDS. A recording of me giving a talk covering much of the same material is at https://osf.io/w7zkc/files/ycgd3.

But first, a warning: I use BIDS in two distinct contexts: 1) transient, made for fMRIPrep preprocessing and then deleted; and 2) permanent, for managing a particular research project's dataset. "Just enough" BIDS only is for the second case; fMRIPrep (and other BIDS Apps), absolutely require the input files to be fully valid BIDS (and things can go badly wrong if anything is mis-specified). I'll return to the idea of "transient" BIDS later; the focus here is on the dataset management context.

why do we need any BIDS?

This seems to be a case of "if you know, you know" ... many of us can confirm that trying to understand data collected by someone else (or by you, a few years ago) can be an absolute nightmare; even answering basic questions like how many participants the dataset has and what tasks they performed can take substantial effort. Worse still are cases where a shared dataset turns out to be unusable because key information is missing, or when confusion about it leads to incorrect conclusions. Even deciding how to arrange your own datasets can be daunting, and poor initial organization choices can be costly to correct later.

a tiny bit of BIDS: six top-level directories

Arranging a dataset in six top-level directories is a (relatively) small step that can simplify management. For example, below is what I see in the project root directory of "MLX", a dataset I'm currently working with. Its files are in six subdirectories: analysis, code, derivatives, rawdata, sourcedata, and stimuli, the names and contents of which follow the BIDS standard

the root directory of a dataset with six subdirectories: analysis, code, derivatives, rawdata, sourcedata, and stimuli, together with two files (fuzzKittens.jpg and MLX_datasetDescription.docx).
Also in the root is MLX_datasetDescription.docx, an informal Scientific Data descriptor-type article. Its purpose is to hold key information that anyone (in the lab or publicly) considering using the dataset should know. This overlaps with parts of the BIDS README, CHANGES, and dataset_description.json files, but we've had more implementation success asking people to read and maintain a single flexible-format word processing document than the formal separate files. (The last file, fuzzKittens.JPG, is a picture of tiny foster kittens, used to confirm that the directory permissions were working but too cute to delete after.)

What goes into each of the six top-level directories?

/stimuli/ is for task stimuli; more generally, all the files and details needed to run or understand what the participants (and experimenters) did, arranged as desired. This can include SOPs, experimenter scripts, redcap dictionaries, eprime files, training materials, recordings of someone doing the tasks, etc., etc. 

These screen captures show the stimuli directory for two different projects to give a sense of how they can vary; the project at left had multiple visits, tasks, and types of data collection so has many more subdirectories than the single-task LifespanStroopCW project at right.
  

/sourcedata/ is for the data files as originally collected, unaltered. There's not a particular naming, format, or organization required for sourcedata: however you collected ‘em is fine (if documented). The MLX and MTurk2018 examples below have different sourcedata subdirectories, each with an idiosyncratic combination of file names and types.

Some datasets may need to have much of the sourcedata archived elsewhere. For example, the dualmechanisms dataset below only has a readme.txt file for sourcedata because it was an fMRI study, and our fMRI files go directly from the scanners to a university XNAT database. It would be redundant (and a lot of storage overhead) to keep a copy of them all under sourcedata, plus jeopardize participant privacy (e.g., DICOM files usually have identifiable details). 
   



/rawdata/ is for data files that have been rearranged, renamed, and probably converted to a different format than they were collected (and stored in sourcedata) in, but not fundamentally transformed. I suggest that the rearranging and renaming in rawdata be to follow the BIDS standard more closely, as will be explained in a later post (and the recorded talk).

Here are examples of rawdata from three different datasets; note that in each case the directory structure and file types are consistent.
   


/derivatives/ is for altered versions of the rawdata and sourcedata files. For fMRI datasets, preprocessed images are the derivatives: these images have been spatially normalized, smoothed, converted to surfaces, etc.; much more substantial changes than converting a file from DICOM (sourcedata) to nifti (rawdata). But the idea can apply to many other types of data. 

For example, some of our projects have a lot of trial-level behavioral data for each participant. The BIDS standard is for these to be arranged in multiple _beh.tsv files (which we do in rawdata), but some of our collaborators prefer to work with fewer larger files of data from multiple participants and sessions, which we store in derivatives subdirectories. I generally put files with data from more than one participant and/or session under derivatives, including individual differences questionnaire responses (in NDAR format; more details and examples will be in a later post). 

The derivatives for these three example datasets have some similarities but substantial variations, due to the different data types and analysis requirements. For instance, the precisionneuroscience dataset has EEG data, which is being processed by BVA. BVA has its own requirements for dataset organization and its preprocessing can be done in various ways, so we decided to put each into its own derivatives subdirectory.
  

/code/ is for code, especially that used for processing, converting, and quality-checking the dataset. (backed up in a version control system)

/analysis/ is for quality control summary documents, analysis code, results files, manuscripts, etc.

BIDS does not have file name nor content requirements for either the analysis or code subdirectories.
It is working reasonably well in our collaborations to use the code subdirectory for "official" dataset maintenance and conversion-type scripts, and analysis be very flexible (e.g., with individuals making their own "working" subdirectories in which they run ongoing analyses however they wish). 
  


more examples!

I think it's easiest to get accustomed to BIDS by working with some already-formatted datasets, but hopefully the various examples and explanation here (and in the talk) will give a sense and serve as a starting point. For dataset management, I believe adopting the convention of these six top-level directories plus a description document is beneficial on its own, without adding too much procedural overhead.


several full ascii tree-style examples after the jump

Friday, February 27, 2026

detrending and normalizing timecourses: afni 3dDetrend and 3dDeconvolve in R

This post introduces an expanded and updated demo of how detrending and normalizing ("scaling") of individual voxel (or vertex) timecourses works with afni 3dDeconvolve and 3dDetrend commands. 

In my original post I showed how to duplicate what 3dDetrend -normalize -polort 2 does with R code, not to replace the afni function, but to understand what it is doing. This expanded version simplifies the old code a bit, but more importantly, adds sections explaining another common method of preparing timecourses for analysis: using 3dDeconvolve with -num_stimts 0 -polort A -errts (plus censoring, motion regressors, etc.; see below) to create the residual error timeseries. 

The compiled demo is afni3dDetrend3dDeconvolve_R.pdf  and is a knitr file; its source code (with many comments) and files required for compilation are in a section of my osf site, DOI https://doi.org/10.17605/OSF.IO/NU324 (please include that DOI in any citation). 

starting point: a voxel's timecourse

The examples use a left motor grey matter voxel from a preprocessed (with fmriprep 1.3.2) fMRI task run from the DMCC55B dataset; I chose it arbitrarily. The plot below is directly from the preprocessed nifti; this voxel has values around 6900, and the timecourse vector is length 540. DMCC55B's TR was 1.2 s, so this is a 10.8 minute-long run. The grey vertical lines are at one-minute intervals, with TR (frame) number along the x-axis and BOLD amplitude along the y-axis. (See the knitr .rnw for details, plotting code, etc.; this blog post just has a few highlights.)

raw (preprocessed) voxel timecourse

normalizing and detrending

Scaling alone isn't usually sufficient to prepare fMRI timeseries for analysis; we also need at least a bit of detrending. There's no universally correct degree or type of detrending to use. I generally recommend a modest amount of detrending before parcel-averaging types of analyses, specifically 3dDetrend -normalize -polort 2 . 

In the plot below, the same voxel timecourse as above is plotted after normalizing only (tc.np0, black, from 3dDetrend -normalize -polort 0), or with detrending at polort 2 (blue, tc.np2), and the more aggressive polort 5 (green, tc.np5). 

same timecourse, after 3dDetrend -normalize -polort 0, 2, or 5
Notice that the spiky parts of the timecourses are pretty much the same in all three versions, but the slower changes vary more; e.g., in the first minute without no-detrending line (tc.np0) is furthest from zero, the tc.np2 line is closer, and tc.np5 line closest. 

It's sensible that larger polort numbers have more of an effect on the timecourse's shape, since, as explained in the afni help for 3dDetrend, -polort ppp gives "the Legendre polynomials of order up to and including 'ppp' in the list of vectors to remove", so larger -polort numbers means removing more complex trends. The R code below shows how to do this type of normalizing and detrending; Legendre() is from Gregor Gorjanc, and requires the orthopolynom and polynom R packages.

 # R commands for 3dDetrend -normalize -polort 2  
 lm.out <- lm(tc.raw ~ Legendre(x=seq(tc.raw), n=2)); # lm with two Legendre polynomials (polort 2)  
 tmp <- residuals(lm.out);  # extract the residuals  
 tc.Rnp2 <- (tmp-mean(tmp))/sqrt(sum((tmp-mean(tmp))^2));  # normalize the residuals, afni-style   

residual error via 3dDeconvolve

It's common to have the realignment parameters as nuisance regressors and censor high-motion frames prior to fMRI timecourse analyses, which can be done with 3dDeconvolve. Skipping rather a lot of explanations from the full demo, the afni command is:

 # errts.fname is the file made by 3dDeconvolve, from which the single voxel timecourse was extracted:  
 system2(paste0(afni.path, "3dDeconvolve"),   
     args=paste0("-input '", scale.fname, "' -polort A -float ",  
           "-censor '", c.fname, "' -num_stimts 0 ",  
           "-ortvec '", mot.fname, "' moveregs ",  
           "-nobucket -errts ", errts.fname), stdout=TRUE);  

where -num_stimts 0 means not to include any events in the model,  -errts that we want afni to write the residual error time series from the "full model fit to the input data" into file errts.fname (in this case, a .nii.gz), and -polort A that afni should set the polort level according to the run length (here, that gives 5).

Below is the 3dDetrend -normalize -polort 5 (tc.np5) timecourse again in green, with the new (-errts) version from the 3dDeconvolve command in pink: 

same voxel timecourse, 3dDeconvolve errts over scale(tc.np5), censored frames marked

The errts timecourse is highly correlated with the np5 version, which makes sense, since both included polort 5 detrending. They're not perfectly correlated, though: the 3dDeconvolve command also did censoring and included the motion regressors. I don't have a simple way to describe the differences in the lines; they're clearly very similar, but not identical; sometimes one is more extreme or spiky, sometimes the other.

The errts image has 0 in the censored frames. This is obvious in a 4d nifti (entire frame filled with 0s), but ambiguous in a single voxel (or vertex) timecourse like this (in the plot the censored frames are circled on the tc.errts timecourse, squared on the np5 version). For some analyses in R (e.g., averaging frames after an event for temporal compression) it'd be sensible to use NA for the censored frames.

This R code matches the 3dDeconvolve calculations:

 # made in startup code chunk; download at https://osf.io/nu324/files/c6nax  
 mot.fname <- paste0(demo.path, "sub-f1027ao_ses-wave1bas_task-Stroop_6regressors_demean.txt");   
 mot.tbl <- read.delim(mot.fname, sep=" ", header=FALSE); # 540 x 6  
   
 # made in startup code chunk; download at https://osf.io/nu324/files/bjrk8  
 c.fname <- paste0(demo.path, "sub-f1027ao_ses-wave1bas_task-Stroop_FD_mask0.5.txt");  # 0 1 censor file  
 censor.vec <- read.table(c.fname, header=FALSE)[,1];  
 censor.TRs <- which(censor.vec == 0) # [1] 300 316 431 448 497  
   
 # first, remove censored frames from the motion regressors and input timecourse  
 # The input timecourse tc.scale is the voxel from file scale.fname: the input bold.fname after  
 # scaling with 3dcalc -expr 'min(200, a/b*100)*step(a)*step(b)'   
 scale.vec <- tc.scale[-censor.TRs];  
 col1.vec <- mot.tbl$V1[-censor.TRs]; # demeaned trans_x  
 col2.vec <- mot.tbl$V2[-censor.TRs];   
 col3.vec <- mot.tbl$V3[-censor.TRs];   
 col4.vec <- mot.tbl$V4[-censor.TRs];   
 col5.vec <- mot.tbl$V5[-censor.TRs];   
 col6.vec <- mot.tbl$V6[-censor.TRs];   
   
 # fit the lm, polort 5, on the censored scale.vec and including 6 censored motion regressors  
 lm.out <- lm(scale.vec ~ Legendre(x=seq(scale.vec), n=5) + col1.vec + col2.vec + col3.vec + col4.vec + col5.vec + col6.vec);   
 tc.Rerrts <- residuals(lm.out);  # extract the residuals  
   
 # put 0s back in where the censored frames were taken out.  
 for (i in 1:length(censor.TRs)) { tc.Rerrts <- append(tc.Rerrts, 0, (censor.TRs[i]-1)); }  
   
 # the R version matches the 3dDeconvolve errts version   
 cor(tc.errts, tc.Rerrts); # almost perfect  

musings

Working out the R code which matched the 3dDeconvolve -errts demystified it for me; the 3dDetrend -normalize -polort 2 detrending and normalizing ("np2") I've treated as a default (for timecourse-averaging type analyses) is closer to these 3dDeconvolve residuals ("errts") than I'd thought. 

Am I going to change my default detrending method? Is the 3dDeconvolve errts "better" than the 3dDetrend np2 for an analysis like this? I think incorporating the censoring (to NA, not 0) and polort-picking calculation (which gave 5 in this demo) from 3dDeconvolve (instead of using 2 regardless of run length) would be sensible, modest improvements.

I'm less confident about whether to add in the motion regressors. My sense was that including these would somehow "account for" or "clean up" any motion effects not "corrected" by preprocessing. And including the six realignment parameter columns does change the timecourse produced by the model a bit ... but it's not much different if the actual realignment parameters or random ones are included, making me suspect the change is more due to the change in model degrees of freedom than actually "fixing" the head motion.

Here's code for the random-motion-regressor models:

 # permute numbers in each motion regressor separately  
 lm.out <- lm(scale.vec ~ Legendre(x=seq(scale.vec), n=5) + sample(col1.vec) + sample(col2.vec) + sample(col3.vec) + sample(col4.vec) + sample(col5.vec) + sample(col6.vec));   
 tc.test1 <- residuals(lm.out);  # extract the residuals  
 for (i in 1:length(censor.TRs)) { tc.test1 <- append(tc.test1, 0, (censor.TRs[i]-1)); } # put 0s back in  
   
 # random numbers for the motion regressors  
 ct <- length(scale.vec);  # how long to make each fake motion regressor column  
 lm.out <- lm(scale.vec ~ Legendre(x=seq(scale.vec), n=5) + rnorm(ct) + rnorm(ct) + rnorm(ct) + rnorm(ct) + rnorm(ct) + rnorm(ct));   
 tc.test2 <- residuals(lm.out);  # extract the residuals  
 for (i in 1:length(censor.TRs)) { tc.test2 <- append(tc.test2, 0, (censor.TRs[i]-1)); } # put 0s back in  

3dDeconvolve errts timecourse plus permuted motion regressor columns (tc.test1)

3dDeconvolve errts timecourse plus random-number motion regressor columns (tc.test2)


Thursday, January 15, 2026

so long, windows

I've always had windows on my work computers, and have built up quite a bit of "muscle memory" over the (ahem) decades for how to use it; the hard part is what to analysis to do or image to look at, not how to do open the image. I didn't want to change my computer setup, but am unwilling to use windows 11, given all its privacy & AI intrusions and limitations. The last few months I've switched over to linux, and it's ... pretty much fine; everything is working more-or-less like it was before.

My hope is that this may smooth the path for other neuroscience folks considering a linux switch but hesitant or unsure how to go about it. For framing, I am most assuredly not a linux guru; I started with some experience using linux servers managed by others, but zero desire to switch my trusty mousepad for a command prompt or to spend days/weeks/months relearning how to do routine work tasks. But now you'd now have to look fairly closely to notice that the software changed on my work computer in the last six months, and I consider that a good thing.

os: zorin linux

There are a lot of "flavors" of linux, and choosing one is daunting. Since my goal was to keep my computers as windows-ish as possible (and not fiddle with settings) I chose Zorin 18 pro, and highly recommend it. I was wondering if hardware would be a hassle (my desktop computer has two monitors, two hard drives, a USB webcam, keyboard, and my beloved vintage Fingerworks iGesture mousepad), but everything just worked no problems whatsoever. (I've since also installed Zorin linux to dual-boot with windows on my laptop, also without hardware difficulty.)

My desktop computer has two drives: a smaller one for the operating system, and a larger for file storage. I had the zorin installation program reformat the smaller drive, but left the larger unchanged; it's still formatted NTFS as it was for windows. I was worried the mixed drive formats would cause trouble but haven't had any, nor any speed issues. 

Zorin comes with an array of menu/desktop appearance layouts: some mimic windows, others mac os or other linux versions. I picked a windows style, and then tweaked the start menu, colors, etc. to my preference. The oddest thing I changed from defaults was the "desktop environment", from wayland to x11, mostly because I wanted a picture gallery screensaver (screensavers are apparently a contentious topic in linux circles). Changing the start menu was non-intuitive: via the (installed) System Tools -> Main Menu program, not via settings or right-clicking on the start menu itself. 

software

Many programs have linux versions (R, zoom, etc.) and so are no problem (install from the Zorin Software collection directly or the programs' website; clicking the start menu and typing a program's name brings up an installer in many cases), but others present more of a challenge. 

The biggest hurdle for me was OneNote: LibreOffice (comes with Zorin) is fine, but doesn't have a OneNote equivalent. I first tried Logseq, but ended up going with Joplin. Which to use is definitely a matter of personal style and preference (e.g., my notes are organized hierarchically and I don't like tags); in my case some of the individual page layouts and formatting got scrambled in the conversion, but the critical text, images, and page organization all came through fine, and I can make new pages without difficulty:


I have a lot of Powerpoint and Word documents; LibreOffice has been managing them fine so far, but wanting Microsoft Office to be available "just in case" is the primary reason I installed Zorin linux on my laptop as dual-boot instead of removing windows entirely. 

In no particular order, here are some of the programs I used on windows and what I'm using instead on linux (I didn't list programs which are available for both, such as RStudio). Many came with Zorin, others I installed via its Software "store".  

  • Notepad++ -> NotepadNext
  • TigerVNC -> Remmina
  • Snipping Tool -> Gradia
  • Foxit & Sumatra (pdfs) -> qpdfview (has tabs; for knitr compiling), Document Viewer (for highlighting & notes), LibreOffice Draw (for complex editing)
  • WinMerge -> Meld
  • WinSCP -> FileZilla 
  • File Manager -> Dolphin (for navigation); Files (GNOME Nautilus; for mounting smb and turtle git)
  • tortoise git -> turtle git
  • MS Office -> LibreOffice 26.2*
  • MS OneNote -> Joplin
  • 7-zip GUI -> File Roller; sudo apt install ark 7zip to have an Extract menu when right-click on .zip files.

remaining headaches

I'm still puzzled why some file operations (smb mounting, turtle git) work differently in Files and Dolphin. I prefer Dolphin's navigation-tree layout and right-click options, but can only access smb-mounted files from Rstudio if I connect first via Files.


MRIcron and mango work fine for nifti and dicom image viewing, but I start the programs by double-clicking on their executables; getting pretty desktop icons or start menu items for them is tricky (you can't just right-click on the executable and make a working shortcut). Programs installed from the Software store don't have this configuration problem, so I assume it's something related to how I did the installation. Though those are less essential now that I now have afni installed locally (I never got it working on windows except through vnc).

FileZilla often seems slower than winscp, despite connecting to the same servers. I think FileZilla disconnects more completely, requiring it to pause and reconnect when I, e.g., double-click a text file for viewing. There may be a setting for this? I had to change several FileZilla defaults, including increasing the Timeout time and changing the Double-click action on files to View/Edit. Setting file associations is still tricky; it doesn't always "see" installed programs, despite messing with the Flatseal permissions. 


* Zorin 18 came with LibreOffice 25 preinstalled. I find version 26 matches my (older windows-style) instincts more closely, especially after setting the "UI Mode" to "Standard Toolbar" and Icons to "Colibre" (Options -> LibreOffice -> Appearance). I did the upgrade in the Terminal, following the uninstall and install commands from this post. The terminal commands don't appear to change the GUI, but after the flatpak install command finishes, the start menu commands open the new version of LibreOffice, with all the file associations, etc. updated properly without rebooting (!).  [added 5 February 2026]


Wednesday, October 1, 2025

apparent motion also at 7T

WUSTL (ahem, WashU) recently set up a 7T scanner (Siemens Terra; 8Tx32Rx_Head_C head coil), and we've started a bit of piloting and exploring what we can do with it. I've been working through multiple aspects: the BOLD images themselves look different because of the stronger magnet, we're trying some new-to-me multi-echo sequences, collecting some noRF and phase images, and even the BIDS and fMRIPrep parts have taken some script updating. I think I'm getting enough of a handle on things now (thanks for help via neurostars, Chris Markiewicz and Taylor Salo!) to share some impressions, but it's all still very much in process.

We've had two pilot sessions so far, each with a different participant and somewhat different sequences. We're most interested in task fMRI, but that's tricky at the moment since the scanner doesn't (yet) have a way to present stimuli, nor to record responses. Eventually I want to do sequence comparisons with the reward-possible DMCC proactive Cued Task-Switching paradigm (and probably other tasks), like in my OHBM 2023 poster. But to get started I asked the second pilot participant to do a self-paced version of the HCP Motor task: blocks of right finger tapping, left finger tapping, right toe wiggling, left toe wiggling, tongue moving, with a bit of rest in between and a few deep breaths before and after each movement block to serve as onset/offset markers. (I'm not going to discuss the movement analysis parts yet, but so far I'm encouraged.)

Both pilot sessions included a pair (PA/AP) of runs with an acquisition similar-ish to what we used in the DMCC at 3T: 2.4 mm iso voxels, MB4, TR 1.2 s. But the 7T allows more acceleration, so they added in-plane acceleration (GRAPPA 2) and collected 4 echoes instead of just 1.

Below are the realignment parameters (from the fmriprep derivative _desc-confounds_timeseries.tsv files) for those runs from the two participants we've had so far. In both cases the grey vertical lines are at one-minute intervals; the first participant (TB7T1)'s runs were about 10.5 minutes and he alternated periods of regular and slow/deep breaths; the second participant (TD7T1)'s runs were about 6.5 minutes each and he did the blocked breathing-motor task.


TB7T1 clearly has much more apparent motion than TD7T1; the periods of regular and deep breathing are obvious, not only in the realignment parameters but also in movies of the BOLD run. TB7T1_run-44_echo-1_bold.avi is the "rawdata" (before preprocessing) version of echo 1; TB7T1_run-44_space-MNI152NLin2009cAsym_desc-preproc_bold.avi is after preprocessing (with fMRIPrep 25.1.0 using Tedana to optimally combine the echoes). The brain sort of looks like it's "jumping" with the breaths in the raw movie; perhaps more like expanding and contracting in the preprocessed version. 

The TB7T1 participant is the same individual as in some of our previous acquisition tests (at 3T); clearly, using 7T doesn't mean we can forget about apparent motion. ... I wouldn't have forgotten about apparent motion regardless since it's a favorite topic of mine, as is probably obvious since I'm starting off this (hopefully) series of posts about our 7T piloting with it.

Each of these sessions included multiple different acquisition sequences; in all cases TB7T1 had more obvious apparent motion than did TD7T1. TB7T1 run 28 had the largest apparent motion; it also had smaller voxels and a longer TR than the other examples.


Finally, the TB7T1 run 44 motion is also striking in the greyplot version created by fmriprep (left); TD7T1 run 24 is below, right:


Wednesday, June 18, 2025

OHBM 2025 (let's chat about crescents!)

I'll be at OHBM 2025 next week (want to meet up? drop me a line), presenting a poster about the crescent artifacts (""Crescent" artifacts with multiband fMRI acquisitions: appearance, causes, and consequences"). The poster pdf is available via OHBM's system, or at https://osf.io/pzj39.

My blog post last December covers some of the same information, but preparing the poster refined my thoughts a bit. I still don't have as clear a sense as I'd like of exactly how much the artifact affects fMRI signal quality, but am confident that there is enough likelihood of a substantial negative impact that they shouldn't be ignored. 

It seems to be a given in the MR physics literature that Nyquist ghosts appreciably degrade EPI signal to noise. Quantifying the impact in GLM results is difficult, however, especially after preprocessing, smoothing, in group analyses (artifact location and intensity varies across people), and when runs of different encoding directions are analyzed together. Qualitatively, I have seen crescents in single-subject statistical images (average activation, GLMs), but that is of course not the typical analysis. It could be informative to run a few comparison tests; for example, is the effect strength in frontal parcels (i.e., those in the path of the crescent artifact) worse in a group of participants with the artifact than without? Or in PA than AP encoding runs?

While the degree of impact is uncertain, I think there's enough evidence to recommend that crescent artifacts be one of the criteria for choosing acquisition protocols: select fMRI acquisition parameters so that crescent artifacts are minimized and/or appear in brain areas of low theoretical interest. Since the crescent artifact likely reduces BOLD signal quality somewhat, and is more often prominent in people with smaller brains, there's a risk of bias if an experimentally-important participant characteristic (e.g., age, gender) is associated with differences in head size; extra care should be taken in these cases.

Monday, March 10, 2025

"fun" with Box AI information leakage

Our university IT folks encourage employees to use their box account for data storage, including of sensitive (human subjects research data, medical records, etc.) files. I wasn't pleased to see the Box AI button appear, and asked our IT what exactly it does, and how it impacts file privacy. 

We went through several rounds of messages, including these responses: "Yes, the HIPAA protections are still in place with the BOX-AI application. Box AI securely engages an external AI provider to compute embeddings from these text chunks and the user’s question. Advanced embeddings models, such as Azure OpenAI’s ada-02, are utilized for this purpose." and "Box does not allow any access to data to outside vendors other than the isolated environment used to process the data. No retained data is being allowed. The data is processed in a bubble, then the bubble is destroyed when completed essentially."

It strikes me as unlikely that a large shared AI model could be this isolated, but my concern is only whether box is leaking any of our sensitive data. Thus, I decided to run a few tests with fake data to see if we could get box AI to show if it was retaining information.

The test file and two chat transcripts are below the jump. Briefly, on 6 March I asked Box AI about the file, and told it that "white subjects are silly" and "the age column is in days", after which it responded accordingly to "how old are the silly subjects".

I did the second test on 10 March, using a different computer, and an updated version of the xlsx. Critically, I asked the box AI, "how old are the silly subjects in years?" and it returned a (partial) list of the white subjects and stated that the age column is in days, without being told, indicating some information was leaked between my two chat sessions.

Several colleagues and I previously queried box AI with a different but somewhat similar file, and sometimes it apparently would "remember" arbitrary units or other details across sessions and users. Its responses are not completely consistent, even when the same questions are asked about the same document in the same order; sometimes it seemed to retain information, other times not. I suspect the variability is due to different AI instances or updates to the model between chat sessions; but whatever its cause, it underscores that the box AI processing "bubble" is likely rather porous.

Anyone else tried anything similar?