SpeechEQ
Overview
The challenge. As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues — tone, prosody, timing — has become a critical bottleneck for natural human–AI communication. Existing evaluations of machine emotional intelligence assess reasoning either through isolated text or passive acoustic perception, overlooking the cross-modal reasoning required for active, multi-turn dialogue.
SpeechEQ. We introduce a comprehensive framework for evaluating the sociolinguistic reasoning of Speech-Language Models (SLMs). The benchmark consists of a validated dataset of 2,265 dialogues spanning fifteen EQ-i 2.0 subscales (five composite areas: Self-Perception, Self-Expression, Interpersonal, Decision Making, and Stress Management), paired with a multi-turn evaluation protocol scored by our proposed Spoken EQ (SEQ), inspired by human EQ assessments.
Task. The model hears the dialogue and, at Speaker 2's two
EQ-test turns (Sentence 4 and Sentence 6), is presented two
audio renditions of the same line. The spoken text is identical; only pitch, energy,
tempo, and pauses differ. One rendition is the socially-appropriate response to the
catalyst's emotional state; the other is not. The model must select the appropriate
rendition and justify the choice in a fixed JSON contract (acoustic profile,
situational demand, reasoning, selected).
Findings. End-to-end Speech-Language Models outperform cascaded SER + LLM systems, but current multimodal models remain bottlenecked by three recurring failure modes that SpeechEQ surfaces as concrete, subscale-level deficits: a text-reliant modality shortcut, an alignment-induced safety trap, and contextual amnesia across multi-turn dialogue — the barriers between today's voice models and truly emotionally-aware AI.
Examples
One scenario per EQ-i 2.0 subscale (15 total). Each plays as a full six-sentence conversation; the Speaker 2 EQ-test sentences (S4 and S6) ship as a high-EQ and a low-EQ rendition with identical text — only prosody differs.
Metric
SpeechEQ reports two complementary numbers per (model, subscale) pair: pooled accuracy on the S4 + S6 forced-choice judgments, and a robust Spoken Emotional Quotient (SEQ) that places models on a shared scale centered at 100.
Accuracy
Per-subscale accuracy is the fraction of correct selections pooled across both EQ-test turns (S4 and S6) of every item assigned to that subscale. Overall is the run-level accuracy across all items, irrespective of subscale.
Spoken Emotional Quotient (SEQ)
To make scores comparable across very different model families, we apply a robust normalization. Per subscale, let \(x\) denote each model's sample-level joint accuracy — the probability that the model selects correctly at both S4 and S6 for the same item:
- \(x\)
- per-item joint accuracy, \(\Pr(\text{S4 correct} \land \text{S6 correct})\), averaged across items in the subscale.
- \(\operatorname{median}(x),\ \operatorname{MAD}(x)\)
- robust center and spread, computed across the model pool for that subscale.
- \(1.4826\)
- consistency constant that makes \(1.4826 \cdot \operatorname{MAD}\) a robust estimator of \(\sigma\) for Gaussian data — analogous to the IQ scale.
SEQ is centered at 100 by construction; ±15 corresponds to one robust standard deviation in the pool. Edge case: if \(1.4826 \cdot \operatorname{MAD}(x) = 0\), SEQ defaults to 100.
Leaderboard
Sorted by SEQ (Spoken Emotional Quotient, robust normalization centered at 100). Overall and the five EQ-i 2.0 main-scale columns are accuracy (%) on pooled S4+S6 judgments (each main-scale value averages its three subscales).
| Rank | Model | Type | SEQ | Overall (%) |
|---|
Performance visualization
Fifteen-subscale radar over the EQ-i 2.0 hierarchy (five composites, three subscales each, clockwise). Radial axis is per-subscale SEQ (median = 100, MAD-rescaled). Values are clipped to the [40, 160] window for readability; tooltips show the unclipped SEQ. Toggle models on or off with the chips above.