Mapping brain functions during naturalistic stimulation

A fMRI journey through perception, language, and emotion. 101 Dalbraintians brings together neuroscience, computation, and storytelling to reveal how the human brain activity synchronizes to the natural flow of a movie.

Explore the Dataset ↓

Dataset Overview

A multimodal fMRI dataset combining manual annotation and computational modeling

101 Dalbraintians is an open-access multimodal naturalistic fMRI dataset that captures how the human brain perceives, understands, and feels during real-life experience. Using the full-length movie One Hundred and One Dalmatians, this resource integrates synchronized neural, visual, and auditory data with detailed manual annotations and advanced computational modeling.

The dataset was designed to provide a comprehensive tool for investigating neural dynamics in naturalistic conditions, offering unprecedented access to the perceptual, cognitive, and emotional dimensions of movie watching. It enables researchers to explore processes spanning language comprehension, social cognition, memory, attention, emotion, and cross-modal perception.

Each second of the movie has been manually annotated across multiple perceptual and semantic categories, and modeled through computational frameworks that extract both low-level (visual and acoustic) and high-level (semantic and categorical) features. This unique combination allows users to link brain activity with the movie’s multisensory structure and narrative meaning.

Designed for transparency, reproducibility, and reuse, 101 Dalbraintians invites the scientific community to explore, extend, and reinterpret its contents — from the mechanisms of sensory processing to the complexity of human cognition.

Participants and Experimental Conditions

Fifty subjects took part in the study: typically developed (TD) individuals and sensory deprived (SD) subjects, who lack visual or auditory experience since birth. Three samples of TD individuals underwent a different experimental condition consisting in the presentation of one version of the same movie: either the full multimodal audiovisual (AV) (N = 10, 35 ± 13 years, 8 females), the auditory (A) (N = 10, 39 ± 17 years, 7 females) or the visual (V) (N = 10, 37 ± 15 years, 5 females) one. SD individuals comprising blind (N = 11, mean age 46 ± 14 years, 3 females) and deaf (N = 9, mean age 24 ± 4, 5 females) participants were presented with the A and V movie conditions respectively The following visual summarizes sample size (N ± SD) across groups.

Participants per group

Age distribution ± SD

Annotations of Movie Categories

The categorical annotation model was designed to capture the richness of naturalistic movie perception by describing, second by second, the visual and auditory information conveyed by One Hundred and One Dalmatians. Two parallel annotation sets were created to reflect the distinct nature of the visual and auditory streams. While the narrator’s voice provides a global description, the visual track reveals localized perceptual content. Together, they form a comprehensive taxonomy of the movie’s perceptual and narrative structure.

Visual Categories

Visual annotations were defined within one-second windows, labeling all salient foreground elements and supplementary details related to color, motion, or narrative importance.

Auditory Categories

The auditory stream was annotated using the same 1-second sampling, labeling all foreground and background sounds that contribute to the narrative.

Movie-Editing and Linguistic Features

Complementary annotations capture the formal cinematic structure—the editor’s visual and auditory choices shaping narrative continuity and engagement.

The annotation process spanned nearly 200 hours of expert manual labeling, ensuring high temporal precision (1 s) and semantic consistency across modalities. This multi-layered framework bridges low-level perceptual features and higher-order narrative constructs.

Visual Stream — Visual Categories

Auditory Stream — Auditory Categories

Computational Models

The computational modeling framework describes how the visual and auditory movie stimuli were decomposed into low-level sensory, high-level representational, and semantic feature spaces. These models complement the manual annotations by providing an automated and hierarchical description of the sensory and cognitive dimensions of the movie.

Low-level Visual Model — Motion Energy

Motion energy features were derived from space-time Gabor filters at multiple orientations, spatial, and temporal frequencies (0, 2, and 4 Hz). Each two-second movie segment was characterized by 4,715 descriptors, capturing the fine-grained motion and direction energy in the frames. This model mimics early visual processing in cortical areas such as V1 and MT, representing sensitivity to temporal frequency and motion direction. The MATLAB implementation used (Gallant Lab) follows the approach of Nishimoto et al. (2011).

Low-level Auditory Model — Power Spectrum

The low-level auditory model was based on the power spectral density of the sound waveform, computed via Welch’s method using Gaussian windows. The resulting 449-dimensional representation describes signal power across frequencies up to ~15 kHz, capturing the spectral energy distribution and envelope dynamics over 2-second intervals.

High-level Visual Model — VGG-19 Feature Space

The VGG-19 convolutional neural network was used to extract hierarchical representations of the visual stream. Intermediate layer outputs (ReLU3_1) captured low/mid-level statistics similar to early visual cortex, while deeper layers (ReLU6) encoded object- and scene-level semantics, supporting complex visual recognition processes. These features provide a bridge between visual input and neural activity in higher-order visual areas.

High-level Auditory Model — VGGish Feature Space

The VGGish network — a VGG-like architecture trained on the AudioSet dataset — was employed to extract complex auditory representations. Features from layer ReLU5.1 describe higher-order auditory content including harmonic patterns, rhythm, and semantic aspects such as speech and music. This allows for a robust mapping between sound characteristics and cortical auditory responses.

Compositional Semantic Features — GPT-4 Embeddings

To capture narrative meaning, the full English script of the movie was segmented at the sentence level and processed using GPT-4 (text-embedding-3-small, 1536-dimensional output). These contextual embeddings encode rich semantic relationships, encompassing syntax, pragmatics, and thematic continuity throughout the narrative. This model enables the exploration of brain activity associated with conceptual and linguistic comprehension beyond sensory modality.

Explore the dataset

Defaced structural images, as well as raw and preprocessed fMRI data, were organized according to the BIDS structure and are available on Figshare. The code to preprocess (f)MRI data is publicly available in the repository as well under code/ subdirectory. It includes bash scripts for the preprocessing of anatomical and functional data using ANTs, AFNI and FSL software. Use the link below to open the repository (placeholder). Additionally, the code for the ISC analysis is available on OSF repository and provided below for extended research reproducibility.

Open Figshare repository Open OSF repository

Selected Publications

Setti, F., Bottari D., Leo A., Diano M., Bruno V., Tinti C., Cecchetti L., Garbarini F., Pietrini P., Ricciardi E., Handjaras G. 101 Dalmatians: a multimodal naturalistic fMRI dataset in typical development and congenital sensory loss. Sci Data 12, 1792 (2025).

Marras, L., Teresi, L., Simonelli, Setti, F., Ingenito, A., Handjaras, G., & Ricciardi, E. Neural representation of action features across sensory modalities: A multimodal fMRI study. NeuroImage, 121439 (2025).

Orsenigo, D., *Setti, F.*, Pagani, M., Petri, G., Tamietto, M., Luppi, A., & Ricciardi, E. Beyond reorganization: Intrinsic cortical hierarchies constrain experience-dependent plasticity in sensory-deprived humans. bioRxiv,(2025).

Setti, F., Handjaras, G., Bottari, D., Leo, A., Diano, M., Bruno, V., … & Ricciardi, E. A modality-independent proto-organization of human multisensory areas. Nature Human Behaviour, 7(3), 397–410 (2023).

Lettieri, G., Handjaras, G., Cappello, E. M., Setti, F., Bottari, D., Bruno, V., … & Cecchetti, L. Dissecting abstract, modality-specific and experience-dependent coding of affect in the human brain. Science Advances, 10(10), eadk6840 (2024).

Lettieri, G., Handjaras, G., Setti, F., Cappello, E. M., Bruno, V., Diano, M., … & Cecchetti, L. Default and control network connectivity dynamics track the stream of affect at multiple timescales. Social Cognitive and Affective Neuroscience, 17(5), 461–469 (2022).