SpeechEQ

1New York University 2NVIDIA 3Carnegie Mellon University 4NYU Shanghai
SpeechEQ teaser figure
We synthesize spoken dialogues whose Speaker 2 utterances share identical text but differ in prosody, then evaluate whether Speech-Language Models can pick the socially-appropriate rendition across fifteen EQ-i 2.0 subscales.

Overview

The challenge. As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues — tone, prosody, timing — has become a critical bottleneck for natural human–AI communication. Existing evaluations of machine emotional intelligence assess reasoning either through isolated text or passive acoustic perception, overlooking the cross-modal reasoning required for active, multi-turn dialogue.

SpeechEQ. We introduce a comprehensive framework for evaluating the sociolinguistic reasoning of Speech-Language Models (SLMs). The benchmark consists of a validated dataset of 2,265 dialogues spanning fifteen EQ-i 2.0 subscales (five composite areas: Self-Perception, Self-Expression, Interpersonal, Decision Making, and Stress Management), paired with a multi-turn evaluation protocol scored by our proposed Spoken EQ (SEQ), inspired by human EQ assessments.

Task. The model hears the dialogue and, at Speaker 2's two EQ-test turns (Sentence 4 and Sentence 6), is presented two audio renditions of the same line. The spoken text is identical; only pitch, energy, tempo, and pauses differ. One rendition is the socially-appropriate response to the catalyst's emotional state; the other is not. The model must select the appropriate rendition and justify the choice in a fixed JSON contract (acoustic profile, situational demand, reasoning, selected).

Findings. End-to-end Speech-Language Models outperform cascaded SER + LLM systems, but current multimodal models remain bottlenecked by three recurring failure modes that SpeechEQ surfaces as concrete, subscale-level deficits: a text-reliant modality shortcut, an alignment-induced safety trap, and contextual amnesia across multi-turn dialogue — the barriers between today's voice models and truly emotionally-aware AI.

Examples

One scenario per EQ-i 2.0 subscale (15 total). Each plays as a full six-sentence conversation; the Speaker 2 EQ-test sentences (S4 and S6) ship as a high-EQ and a low-EQ rendition with identical text — only prosody differs.

Metric

SpeechEQ reports two complementary numbers per (model, subscale) pair: pooled accuracy on the S4 + S6 forced-choice judgments, and a robust Spoken Emotional Quotient (SEQ) that places models on a shared scale centered at 100.

Accuracy

Per-subscale accuracy is the fraction of correct selections pooled across both EQ-test turns (S4 and S6) of every item assigned to that subscale. Overall is the run-level accuracy across all items, irrespective of subscale.

Spoken Emotional Quotient (SEQ)

To make scores comparable across very different model families, we apply a robust normalization. Per subscale, let \(x\) denote each model's sample-level joint accuracy — the probability that the model selects correctly at both S4 and S6 for the same item:

\[ \boxed{\;\; \mathrm{SEQ} \;=\; 100 \;+\; 15 \cdot \underbrace{\frac{x \,-\, \operatorname{median}(x)}{1.4826 \cdot \operatorname{MAD}(x)}}_{\text{robust } z\text{-score}} \;\;} \]
\(x\)
per-item joint accuracy, \(\Pr(\text{S4 correct} \land \text{S6 correct})\), averaged across items in the subscale.
\(\operatorname{median}(x),\ \operatorname{MAD}(x)\)
robust center and spread, computed across the model pool for that subscale.
\(1.4826\)
consistency constant that makes \(1.4826 \cdot \operatorname{MAD}\) a robust estimator of \(\sigma\) for Gaussian data — analogous to the IQ scale.

SEQ is centered at 100 by construction; ±15 corresponds to one robust standard deviation in the pool. Edge case: if \(1.4826 \cdot \operatorname{MAD}(x) = 0\), SEQ defaults to 100.

Leaderboard

Sorted by SEQ (Spoken Emotional Quotient, robust normalization centered at 100). Overall and the five EQ-i 2.0 main-scale columns are accuracy (%) on pooled S4+S6 judgments (each main-scale value averages its three subscales).

Rank Model Type SEQ Overall (%)

Performance visualization

Fifteen-subscale radar over the EQ-i 2.0 hierarchy (five composites, three subscales each, clockwise). Radial axis is per-subscale SEQ (median = 100, MAD-rescaled). Values are clipped to the [40, 160] window for readability; tooltips show the unclipped SEQ. Toggle models on or off with the chips above.

BibTeX