DeepASMR: LLM-Based Zero-Shot ASMR Generation
for Anyone of Any Voice


Abstract

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR)—a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.

System Overview

System Architecture Diagram

Figure 1: System Architecture Overview

Samples


For your listening experience, we have already normalized the volume levels of all the audio files.
We recommend that you wear headphones for better results.
The prompt audios are from the demo page of IndexTTS2 and MaskGCT.
Transcription Prompts Ours Normal Ours ASMR CosyVoice2
+ CosyVoice VC
CosyVoice2
+ Seed VC
The prompt audios are from the demo page of IndexTTS2 and MaskGCT.
Transcription Prompts Ours Normal Ours ASMR CosyVoice2
+ CosyVoice VC
CosyVoice2
+ Seed VC
The prompt audios are extracted from the ElevenLabs and MiniMax voice libraries using "ASMR" as the search keyword.
Transcription Prompts Ours with vocal-fold vibration Ours without vocal-fold vibration Elevenlabs v3(alpha) MiniMax Speech-02-hd CosyVoice2 F5TTS
The prompt audios are extracted from the ElevenLabs and MiniMax voice libraries using "ASMR" as the search keyword.
Transcription Prompts Ours with vocal-fold vibration Ours without vocal-fold vibration Elevenlabs v3(alpha) MiniMax Speech-02-hd CosyVoice2 F5TTS