Academic Project Page

Audio-Visual Speech Enhancement (AVSE) leverages both audio and visual information to improve speech quality. Despite noisy real-world conditions, humans are generally able to perceive and interpret corrupted speech segments. Researches in cognitive science have shown how the brain merges auditory and visual inputs to achieve this. These studies uncover four key insights, reflecting a hierarchical synergy of semantic and signal processes with visual cues enriching both levels: (1) Humans utilize high-level semantic context to reconstruct corrupted speech signals. (2) Visual cues have been proven to strongly correlate with semantic information, enabling visual cues to facilitate semantic context modeling. (3) Visual appearance and vocal information jointly benefit identification, implying that visual cues strengthen low-level signal context modeling. (4) High-level semantic knowledge and low-level auditory processing operate concurrently, allowing the semantics to guide signal-level context modeling. Motivated by these insights, we propose CogCM, a cognition-inspired hierarchical contextual modeling framework. The CogCM framework includes three core modules: (1) A semantic context modeling module (SeCM) to capture high-level semantic context from audio-visual modalities; (2) A signal context modeling module (SiCM) to model fine-grained temporal-spectral structures under multi-modal semantic context guidance; (3) A semantic-to-signal guidance module (SSGM) to leverage semantic context in guiding signal context modeling across both temporal and frequency dimensions. Extensive experiments on 7 benchmarks demonstrate CogCM's superiority, achieving 63.6% SDR and 58.1% PESQ improvements at -15dB SNR -- outperforming state-of-the-art methods across all metrics.