CLARIS: Clear and Intelligible Speech from Whispered and Dysarthric Voices

Submission for CHI 2026 (Paper ID: 5919)

Authors: Anonymous

Note: Publically available checkpoints for the baselines ( FreeVC, QuickVC, WESPER & DistillW2N ) were used.

Abstract:

Whispered and disordered speech, such as dysarthria, often lack intelligibility, making voice interaction difficult in both everyday and clinical contexts. We present CLARIS a lightweight autoregressive system that converts atypical input into natural-sounding speech. Unlike prior approaches that rely on paired data or handcrafted pseudo-whispers, CLARIS combines a TTS-based augmentation pipeline, adversarial alignment between synthetic and real speech, and multi-task linguistic supervision. Across benchmarks, it achieves 12.04 % WER on unseen English whisper speakers, adapts to new accents with only 30 minutes of calibration, and restores intelligibility for dysarthric voices where existing models fail. We further show generalization to a language linguistically distant from English with only 7 hours of data. Listener studies confirm gains in naturalness, prosody, and perceived normalness. By enabling lightweight personalization, CLARIS points toward inclusive, private, and socially mindful voice technologies for diverse users.

CLARIS Teaser Diagram

...
CLARIS restores intelligibility for atypical voices across languages and disorders. (A) For whispered Hindi speech, where formant energies are absent in the spectrogram, CLARIS reconstructs the missing cues so that listeners hear natural, intelligible speech through a mobile API. (B) For high-severity dysarthria, where articulation is severely degraded and words are nearly unintelligible, the same pipeline produces speech that can be followed in everyday communication. Together, these scenarios illustrate how CLARIS supports accessible and inclusive conversations across diverse speaking conditions.

Proposed Method

...
High-level overview of the CLARIS speech restoration framework. Whisper or dysarthria-like atypical speech inputs are collected from the user. During training, limited paired audio–text data is expanded through a TTS-based augmentation pipeline to generate synthetic inputs. Both real and synthetic audio are encoded by the AS2UT encoder; embeddings are supervised by character decoders and aligned by the Real–Synthetic Alignment Discriminator (RSAD) via a gradient reversal layer (GRL). The unit prediction decoder then generates speech units, aided by a CTC decoder, which are converted into natural speech by the unit-to-speech renderer. The red dotted path illustrates inference: user audio is passed through the encoder, decoder, and renderer to restore intelligible speech.

R1: CLARIS trained on 44 wTIMIT speakers and evaluated on 5% split

R2: CLARIS evaluated on 4 unseen wTIMIT speakers

R3: CLARIS evaluated on unseen Indian Accented English Speaker. CLARIS (Finetuned) on 30 minutes of whisper data

R4: CLARIS trained on 6 Hindi Speakers and evaluated on 5% split

R5: CLARIS trained on 8 clinically challenging dysarthric speakers and evaluated on 5% split

The following samples are for speakers from TORGO corpus

R6: Failure cases of CLARIS on dysarthric speakers