Mapping brain functions during naturalistic stimulation
A fMRI journey through perception, language, and emotion. 101 Dalbraintians brings together neuroscience, computation, and storytelling to reveal how the human brain activity synchronizes to the natural flow of a movie.
Explore the Dataset101 Dalbraintians is an open-access multimodal naturalistic fMRI dataset that captures how the human brain perceives, understands, and feels during real-life experience. Using the full-length movie One Hundred and One Dalmatians, this resource integrates synchronized neural, visual, and auditory data with detailed manual annotations and advanced computational modeling.
The dataset was designed to provide a comprehensive tool for investigating neural dynamics in naturalistic conditions, offering unprecedented access to the perceptual, cognitive, and emotional dimensions of movie watching. It enables researchers to explore processes spanning language comprehension, social cognition, memory, attention, emotion, and cross-modal perception.
Each second of the movie has been manually annotated across multiple perceptual and semantic categories, and modeled through computational frameworks that extract both low-level (visual and acoustic) and high-level (semantic and categorical) features. This unique combination allows users to link brain activity with the movie’s multisensory structure and narrative meaning.
Designed for transparency, reproducibility, and reuse, 101 Dalbraintians invites the scientific community to explore, extend, and reinterpret its contents — from the mechanisms of sensory processing to the complexity of human cognition.
Fifty subjects took part in the study: typically developed (TD) individuals and sensory deprived (SD) subjects, who lack visual or auditory experience since birth. Three samples of TD individuals underwent a different experimental condition consisting in the presentation of one version of the same movie: either the full multimodal audiovisual (AV) (N = 10, 35 ± 13 years, 8 females), the auditory (A) (N = 10, 39 ± 17 years, 7 females) or the visual (V) (N = 10, 37 ± 15 years, 5 females) one. SD individuals comprising blind (N = 11, mean age 46 ± 14 years, 3 females) and deaf (N = 9, mean age 24 ± 4, 5 females) participants were presented with the A and V movie conditions respectively The following visual summarizes sample size (N ± SD) across groups.
Participants per group
Age distribution ± SD
The categorical annotation model was designed to capture the richness of naturalistic movie perception by describing, second by second, the visual and auditory information conveyed by One Hundred and One Dalmatians. Two parallel annotation sets were created to reflect the distinct nature of the visual and auditory streams. While the narrator’s voice provides a global description, the visual track reveals localized perceptual content. Together, they form a comprehensive taxonomy of the movie’s perceptual and narrative structure.
Visual Categories
Visual annotations were defined within one-second windows, labeling all salient foreground elements and supplementary details related to color, motion, or narrative importance.
- Animals: all species portrayed on screen, from domestic pets to wildlife.
- Body-parts: isolated limbs or features such as a hand, leg, or paw, excluding faces.
- Human faces / Animal faces: close-ups of any face clearly visible regardless of viewpoint.
- Houses: single buildings or cityscapes, including farms, castles, façades.
- Location: the setting of the action, distinguishing indoor and outdoor environments.
- Landscape: natural or artificial broader spatial contexts (e.g., countryside or urban scenes).
- Objects: human-made artifacts and tools handled or visible on screen.
- Person: depictions of the full body or torso; isolated faces belong to faces.
- Vehicles: recognizable transportation means or salient parts (e.g., car hood, wheel).
Auditory Categories
The auditory stream was annotated using the same 1-second sampling, labeling all foreground and background sounds that contribute to the narrative.
- Animals: all animal vocalizations or noises clearly distinguishable from background.
- Houses: verbal or acoustic references to buildings and built environments (e.g., “outside the castle gate”).
- Objects: sounds or mentions of human-made items (e.g., clinking teacups, ringing bells).
- Person: human speech and activity sounds (dialogues, footsteps, coughing, laughter).
- Vehicles: vehicle noises and onomatopoeic descriptions (“beep”, “vroom”, “screech”).
Movie-Editing and Linguistic Features
Complementary annotations capture the formal cinematic structure—the editor’s visual and auditory choices shaping narrative continuity and engagement.
- Scenes: narrative units characterized by stable location, characters, and temporal continuity.
- Camera cuts: abrupt changes in camera angle or viewpoint between consecutive shots.
- Audio descriptions: narrator’s voice-over conveying visual or emotional information.
- Dialogues: spoken exchanges and monologues forming the linguistic backbone of the movie.
- Soundtracks: musical scores and songs accompanying or enhancing visual flow.
- Subtitles: on-screen text transcribing spoken narrative or describing environmental sounds.
- Text in frame: any written element embedded in the visual scene (e.g., signs, letters).
The annotation process spanned nearly 200 hours of expert manual labeling, ensuring high temporal precision (1 s) and semantic consistency across modalities. This multi-layered framework bridges low-level perceptual features and higher-order narrative constructs.
The computational modeling framework describes how the visual and auditory movie stimuli were decomposed into low-level sensory, high-level representational, and semantic feature spaces. These models complement the manual annotations by providing an automated and hierarchical description of the sensory and cognitive dimensions of the movie.
Low-level Visual Model — Motion Energy
Motion energy features were derived from space-time Gabor filters at multiple orientations, spatial, and temporal frequencies (0, 2, and 4 Hz). Each two-second movie segment was characterized by 4,715 descriptors, capturing the fine-grained motion and direction energy in the frames. This model mimics early visual processing in cortical areas such as V1 and MT, representing sensitivity to temporal frequency and motion direction. The MATLAB implementation used (Gallant Lab) follows the approach of Nishimoto et al. (2011).
Low-level Auditory Model — Power Spectrum
The low-level auditory model was based on the power spectral density of the sound waveform, computed via Welch’s method using Gaussian windows. The resulting 449-dimensional representation describes signal power across frequencies up to ~15 kHz, capturing the spectral energy distribution and envelope dynamics over 2-second intervals.
High-level Visual Model — VGG-19 Feature Space
The VGG-19 convolutional neural network was used to extract hierarchical representations of the visual stream. Intermediate layer outputs (ReLU3_1) captured low/mid-level statistics similar to early visual cortex, while deeper layers (ReLU6) encoded object- and scene-level semantics, supporting complex visual recognition processes. These features provide a bridge between visual input and neural activity in higher-order visual areas.
High-level Auditory Model — VGGish Feature Space
The VGGish network — a VGG-like architecture trained on the AudioSet dataset — was employed to extract complex auditory representations. Features from layer ReLU5.1 describe higher-order auditory content including harmonic patterns, rhythm, and semantic aspects such as speech and music. This allows for a robust mapping between sound characteristics and cortical auditory responses.
Compositional Semantic Features — GPT-4 Embeddings
To capture narrative meaning, the full English script of the movie was segmented at the sentence level and processed using GPT-4 (text-embedding-3-small, 1536-dimensional output). These contextual embeddings encode rich semantic relationships, encompassing syntax, pragmatics, and thematic continuity throughout the narrative. This model enables the exploration of brain activity associated with conceptual and linguistic comprehension beyond sensory modality.
Defaced structural images, as well as raw and preprocessed fMRI data, were organized according to the BIDS structure and are available on Figshare. The code to preprocess (f)MRI data is publicly available in the repository as well under code/ subdirectory. It includes bash scripts for the preprocessing of anatomical and functional data using ANTs, AFNI and FSL software. Use the link below to open the repository (placeholder). Additionally, the code for the ISC analysis is available on OSF repository and provided below for extended research reproducibility.
Setti, F., Bottari D., Leo A., Diano M., Bruno V., Tinti C., Cecchetti L., Garbarini F., Pietrini P., Ricciardi E., Handjaras G. 101 Dalmatians: a multimodal naturalistic fMRI dataset in typical development and congenital sensory loss. Sci Data 12, 1792 (2025).
Setti, F., Bottari D., Leo A., Diano M., Bruno V., Tinti C., Cecchetti L., Garbarini F., Pietrini P., Ricciardi E., Handjaras G. 101 Dalmatians: a multimodal naturalistic fMRI dataset in typical development and congenital sensory loss. figshare (2025)
Marras, L., Teresi, L., Simonelli, Setti, F., Ingenito, A., Handjaras, G., & Ricciardi, E. Neural representation of action features across sensory modalities: A multimodal fMRI study. NeuroImage, 121439 (2025).
Orsenigo, D., *Setti, F.*, Pagani, M., Petri, G., Tamietto, M., Luppi, A., & Ricciardi, E. Beyond reorganization: Intrinsic cortical hierarchies constrain experience-dependent plasticity in sensory-deprived humans. bioRxiv,(2025).
Setti, F., Handjaras, G., Bottari, D., Leo, A., Diano, M., Bruno, V., … & Ricciardi, E. A modality-independent proto-organization of human multisensory areas. Nature Human Behaviour, 7(3), 397–410 (2023).
Lettieri, G., Handjaras, G., Cappello, E. M., Setti, F., Bottari, D., Bruno, V., … & Cecchetti, L. Dissecting abstract, modality-specific and experience-dependent coding of affect in the human brain. Science Advances, 10(10), eadk6840 (2024).
Lettieri, G., Handjaras, G., Setti, F., Cappello, E. M., Bruno, V., Diano, M., … & Cecchetti, L. Default and control network connectivity dynamics track the stream of affect at multiple timescales. Social Cognitive and Affective Neuroscience, 17(5), 461–469 (2022).