CogCM:
Cognition-Inspired Contextual Modeling
for Audio-Visual Speech Enhancement

Feixiang Wang1,2, Shuang Yang1,2, Shiguang Shan1,2, Xilin Chen1,2
1State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
ICCV 2025

Abstract

Audio-Visual Speech Enhancement (AVSE) leverages both audio and visual information to improve speech quality. Despite noisy real-world conditions, humans are generally able to perceive and interpret corrupted speech segments. Researches in cognitive science have shown how the brain merges auditory and visual inputs to achieve this. These studies uncover four key insights, reflecting a hierarchical synergy of semantic and signal processes with visual cues enriching both levels: (1) Humans utilize high-level semantic context to reconstruct corrupted speech signals. (2) Visual cues have been proven to strongly correlate with semantic information, enabling visual cues to facilitate semantic context modeling. (3) Visual appearance and vocal information jointly benefit identification, implying that visual cues strengthen low-level signal context modeling. (4) High-level semantic knowledge and low-level auditory processing operate concurrently, allowing the semantics to guide signal-level context modeling. Motivated by these insights, we propose CogCM, a cognition-inspired hierarchical contextual modeling framework. The CogCM framework includes three core modules: (1) A semantic context modeling module (SeCM) to capture high-level semantic context from audio-visual modalities; (2) A signal context modeling module (SiCM) to model fine-grained temporal-spectral structures under multi-modal semantic context guidance; (3) A semantic-to-signal guidance module (SSGM) to leverage semantic context in guiding signal context modeling across both temporal and frequency dimensions. Extensive experiments on 7 benchmarks demonstrate CogCM's superiority, achieving 63.6% SDR and 58.1% PESQ improvements at -15dB SNR -- outperforming state-of-the-art methods across all metrics.

DATASET: LRS3 + DNS

Sample 1

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

MuSE

Spec 3

VisualVoice

Spec 3

DualAVSE

Spec 4

CogCM

Sample 2

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

MuSE

Spec 3

VisualVoice

Spec 3

DualAVSE

Spec 4

CogCM

Sample 3

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

MuSE

Spec 3

VisualVoice

Spec 3

DualAVSE

Spec 4

CogCM

DATASET: TCD-TIMIT + NTCD-TIMIT

Sample 1

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

Sample 2

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

Sample 3

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

DATASET: GRID + CHiME3

Sample 1

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

Sample 2

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

Sample 3

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM

Sample 4

Noisy Video

Spec 1

Noisy Speech

Spec 2

Clean Speech

Spec 3

DualAVSE

Spec 4

CogCM