The AISHELL-5 dataset is recorded inside a hybrid electric car, with a far-field microphone placed above the door handles of all four doors to capture far-field audio from different areas of the car. Additionally, each speaker wears a high-fidelity microphone to collect near-field audio for data annotation. A total of 260 participants are involved in the recording with no notable accents. During the recording, 2-4 speakers are randomly seated in the four positions inside the car and engaged in free conversations without content restrictions to ensure the naturalness and authenticity of the audio data. The average duration of each session is 10 minutes. The scripts for all our speech data are prepared in TextGrid format. 


In normal driving scenarios, the car typically contains various noises from both inside and outside. External noises include environmental sounds, wind noise, tire noise, etc., while internal noises come from sources like music players and air conditioning. These noises significantly impact the accuracy of in-car speech recognition systems. To comprehensively cover the various noise types encountered in real-world in-car scenarios, we carefully design the recording scenes. For environmental noise, recordings are made with different driving segments (urban streets and highways) during both daytime and night time. About the wind-induced noise and tire noise, we control the degree to which the car windows are open (fully closed, half open, and one-third open) and the car’s speed (stationary, low- speed, medium-speed, and high-speed). For noise inside the car, we set the music player and air conditioning to different levels to cover a variety of in-car conditions. These different sub-scenes are numbered, and all sub-scenes are combined in various ways to form the final recording scenarios, resulting in over 60 recording scenarios in total.
 

The AISHELL-5 dataset contains more than 100 hours of speech data, divided into 94 hours of training data(Train), 3.3 hours of validation data (Dev), and two test sets(Eval1 and Eval2), with durations of 3.3 and 3.58 hours. Each dataset includes far-field audio from 4 channels, with only the training set containing near-field audio. Additionally,to promote research on speech simulation techniques, we also provide alarge-scale noise dataset (Noise), which has the same recording settings as the far-field data but without any speaker speech, lasting approximately 40 hours.

 

上线准备中
上线准备中

AISHELL-5 车载多通道对话数据集

AISHELL-5 In-Car Multi-Channel Multi-Speaker Speech Dataset

 License: CC BY NC 4.0

论 文

 

数据下载