AISHELL-6-Whisper
开 源 数 据 ,助 力 人 工 智 能 发 展
AISHELL-6-Whisper 语料库
开源时间:2025年9月
语料库在安静的录音棚环境中采集,包含约29.8小时的耳语语音与平行录制的29.5小时正常语音,和同步采集的唇动视频。
该语料库包含 167 名说话人,每位说话人朗读约 10 到 20 分钟不重复的诗歌文本。其中,121 名参与者使用高保真麦克风和同步的 RGB 相机进行录制,其余 46 名参与者仅录制音频信号。音频采用单通道高保真麦克风(Neumann U87)采集(48kHz,16-bit),背景噪声水平低于 20dB。麦克风置于说话人下巴下方,以确保音质且不遮挡嘴唇运动。视频录制使用一台 RGB 相机,置于说话人正前方一米处,分辨率为 1280×720,帧率为 25fps。
我们将数据集按大约 4:1:1 的比例划分为训练集、验证集和测试集,确保各子集在年龄和性别上分布均衡。这三个子集之间没有说话人重叠。另外,该语料库提供句级别的文本标注,耳语音频和正常语音音频路径的对应关系文件,及各说话人信息的文件。
论 文

数据下载

基线系统

License: CC BY-NC-SA 4.0
The AISHELL6-Whisper Corpus was recorded in a quiet studio environment and contains approximately 29.8 hours of whispered speech along with 29.5 hours of parallel normal speech, accompanied by synchronously captured lip-movement videos.
The corpus comprises 167 speakers, each reading approximately 10-20 minutes of poetry texts without any overlap in content. Among them, 121 participants were recorded using both a high-fidelity microphone and a synchronized RGB camera, while the remaining 46 participants only recorded the audio signals. Audio was captured with a single-channel high-fidelity microphone (Neumann U87) at 48 kHz / 16-bit, with a background noise level of less than 20 dB. The microphone was positioned below the speaker's chin to ensure sound quality without obscuring the speaker's lip movement. Video recordings were captured using an RGB camera placed one meter directly in front of the speaker, with a resolution of 1280×720 at 25 fps.
We divided the dataset into training, validation, and test subsets using an approximate 4:1:1 ratio, ensuring a balanced distribution of age and gender across splits. There are no speaker overlaps between these three subsets. In addition, the corpus provides sentence-level transcriptions, the path correspondence file between whispered and normal speech recordings, and the speaker metadata file.
微信公众号
联系我们
商务合作:bd@aishelldata.com
技术服务:tech@aishelldata.com
联系电话:+86-010-80225006
公司地址:
北京市海淀区海淀大悦信息科技园D5-A501
开源数据
