PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

SoundNarratives: Rich Auditory Scene Descriptions to Support Deaf and Hard of Hearing People

Liang-Yuan Wu, Dhruv Jain

University of Michigan
ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2025)

Paper Code Video Preview

SoundNarratives delivers semantically rich auditory scene descriptions to enhance sound awareness for deaf and hard of hearing individuals. Unlike existing approaches that only provide sound event labels (dotted-line box), our system offers more detailed descriptions across multiple sound parameters (solid-line box), enabling users to better engage with their surroundings.

Abstract

Sound recognition enhances safety, social interaction, and situational awareness for deaf and hard of hearing (DHH) individuals. However, existing sound recognition technologies primarily classify sounds into predefined categories (e.g., door opening, speech), which fail to capture the full complexity of real-world auditory scenes (e.g., temporal variations, sound transitions, overlapping sound layers). In this work, we introduce SoundNarratives, a real-time system that generates rich, contextual auditory scene descriptions tailored to DHH users. We began with conducting a formative study with 10 DHH participants to identify nine key auditory scene parameters (e.g., sound class, loudness, emotion, semantic description), and used these insights to guide prompt engineering with a state-of-the-art audio language model. A user study with 10 DHH participants demonstrated a significant preference for SoundNarratives over a baseline model, along with a potential for improved confidence and situational awareness.

Nine Sound Parameters

SoundNarratives describes each auditory scene through nine key perceptual parameters, combining acoustic and semantic cues.

Sound Class

Categorizes the type of sound, such as speech, music, or environmental noise.

Loudness

Represents the perceived intensity of the sound.

Speaker Dynamics

Captures how speakers vary in their manner of speaking and the intentions behind their calls.

Spatial Dynamics

Describes movement and distance of sounds within the environment.

Emotion

Reflects the affective tone of the sounds, such as happy, angry, or sad.

Pace

Measures the speed or tempo of sound events over time.

Prominence

Indicates which sounds stand out or attract attention in the scene.

Pattern

Captures recurring sequences or temporal structures in sound events.

Semantic Descriptions

Provides human-understandable context and meaning for the sounds.

System Overview

SoundNarratives processes each auditory scene with AudioFlamingo to derive nine key sound parameters, which are then summarized by GPT-4 into a concise, human-readable description.

SoundNarratives: A crow caws loudly and repeatedly, with four caws at irregular intervals. A man speaks briefly.

Poster at CHI 2025: GAI and A11y Workshop

BibTeX

        
          @inproceedings{10.1145/3663547.3746341,
            author = {Wu, Liang-Yuan and Jain, Dhruv},
            title = {SoundNarratives: Rich Auditory Scene Descriptions to Support Deaf and Hard of Hearing People},
            year = {2025},
            isbn = {9798400706769},
            publisher = {Association for Computing Machinery},
            address = {New York, NY, USA},
            url = {https://doi.org/10.1145/3663547.3746341},
            doi = {10.1145/3663547.3746341},
            abstract = {Sound recognition enhances safety, social interaction, and situational awareness for deaf and hard of hearing (DHH) individuals. However, existing sound recognition technologies primarily classify sounds into predefined categories (e.g., door opening, speech), which fail to capture the full complexity of real-world auditory scenes (e.g., temporal variations, sound transitions, overlapping sound layers). In this work, we introduce SoundNarratives, a real-time system that generates rich, contextual auditory scene descriptions tailored to DHH users. We began with conducting a formative study with 10 DHH participants to identify nine key auditory scene parameters (e.g., sound class, loudness, emotion, semantic description), and used these insights to guide prompt engineering with a state-of-the-art audio language model. A user study with 10 DHH participants demonstrated a significant preference for SoundNarratives over a baseline model, along with a potential for improved confidence and situational awareness.},
            booktitle = {Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility},
            articleno = {68},
            numpages = {15},
            keywords = {Accessibility, human-AI interaction, sound awareness, deaf and hard of hearing, generative AI, prompt engineering, auditory scene analysis},
            location = {
            },
            series = {ASSETS '25}
          }

More Works from me!

CARTGPT: Real-Time Correction of CART Captions Using Large Language Models

EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning

Assessing the Role of Medical Caption Technology to Support Physician-Patient Communication for Patients with Hearing Loss: A Pilot Study