回顾丨2021 SpeechHome 语音技术研讨会
2021年11月28日,首届SpeechHome语音技术研讨会圆满落幕。希尔贝壳是本次会议的协办方,并在‘ kaldi ’技术环节,CEO卜辉分享了《AISHELL & AISHELL Foundation》。
本届SpeechHome语音技术研讨会由CCF语音对话与听觉专委会、深圳市人工智能学会、小米集团、腾讯天籁实验室、语音之家主办,由希尔贝壳、西北工业大学音频语音与语言处理研究组、昆山杜克大学语音与多模态智能信息处理实验室、清华大学语音和语言技术中心、内蒙古大学语音信号处理组、深蓝学院协办。
接下来是27日上午的会议场景智能语音技术和语音技术分享两个主题的内容回顾。
会议场景智能语音技术
(分享嘉宾:李明、黄申、陈卓、张仕良)
李明分享的主题是《面向复杂场景的说话人日志》。报告首先介绍近年来逐渐成为研究热点的说话人日志这一任务的背景与可能的应用场景;其次,结合近期研究成果主要从模块化方法的角度来介绍说话人日志的技术原理和研究趋势;然后从系统实现层面介绍近期相关国际评测的参赛系统,最后扩展到在线及多模态这两个说话人日志的新兴场景。
黄申分享的主题是《打造低延时、高精度的腾讯会议智能语音识别系统》。报告介绍了在会议实时通信系统中,流式ASR对延迟的要求非常高。重点介绍如何利用前端,编解码,特征等前端信号处理的设计,为流式ASR的延迟一步一步铺路。还结合腾讯会议实时字幕和转写系统,讲述如何利用当前sota的hybrid和端到端两套系统,在保证高精度识别率的前提下,进一步优化解码器的速度,延时和用户体验。
陈卓分享的主题是《面向会议的对话语音识别》。报告介绍了不同于传统发的语音识别应用,在会议中,多人的对话往往包含语音混叠和快速说话人切换。这些现象破环了传统语音处理系统对于单说话人的假设,从而造成了大幅度的性能损失。随后介绍了语音混叠在会议系统中的影响及其特性。分享了过去几年在这个领域在线以及离线处理的一些工作与经验。
张仕良分享的主题是《多通道多方会议转录(M2MeT)国际挑战赛介绍》。报告介绍了阿里巴巴达摩院语音实验室联合希尔贝壳等机构和多位国际上有影响力的行业专家在ICASSP2022国际会议上举办的多通道多方会议转录挑战赛(M2MeT)。以及M2MeT挑战赛的具体信息,包含计划开源的AliMeeting竞赛数据集,赛道设置,以及相关的基线系统等。同时分享了达摩院语音实验室关于会议场景说话人日志的研究进展。
技术分享
(分享嘉宾:任新蕾、田正坤)
任新蕾分享的主题是《用于实时多通道语音增强的因果U-Net神经波束形成网络》。报告介绍了快手参加interspeech2021组织的远场多通道语音增强挑战赛ConferenceSpeech2021所提交的系统,该系统结合了多通道因果U-Net和传统的波束形成结构。该系统相比baseline极大地提升了降噪性能,并获得了挑战赛所设两个赛道的冠军。并围绕比赛设置,模型结构,训练数据,实验结论等方面做了详细介绍。
接下来是27日下午的Kaldi 开源技术和语音技术分享两个主题的内容,由清华大学副研究员王东担任主持。
Kaldi 开源技术
(分享嘉宾:Daniel Povey、卜辉)
Daniel Povey分享的主题是《Next-Gen Kaldi: Status and Near-term Plans》。The Next-gen Kaldi tools, k2, Lhotse and Icefall, are under rapid development. I will talk about the progress we have made so far and also about our near-term plans, including our work to enable streaming/real-time decoding from C++, and recent experiments and modeling improvements.
卜辉分享的主题是《AISHELL & AISHELL Foundation》。智能语音技术已经成为AI人机交互的主要应用技术,算法的逐步开源、算力更不再是高门槛的当下产业环境,数据始终保持着高价值难获取的现状。智能语音技术的实现源头来自规模化的数据集,数据集对技术的落地起到了催化的作用。数据集的可解释性、可使用度、可复现的信息显得尤为重要。本次报告首先介绍了语音技术领域的实验数据被使用现状,AISHELL的Kaldi开源项目及发布的可用数据集信息说明;基于AISHELL的数据集构建能力,赋能前沿语音技术赛事的任务介绍;最后对SpeechHome做为助力AI语音开发者的社区,搭建产学研技术交流平台和以人才为中心的知识分享生态介绍。
技术分享
(分享嘉宾:张俊博、陈航、周恒顺)
张俊博分享的主题是《声音成分分析及其下游应用》。报告首先介绍了声音成分分析是机器学习领域的重要研究方向,其应用场景非常广泛,近年来得到了越来越多的学者的关注。随后,分享了团队在该领域做的一些工作。此外,将声音成分分析技术应用于声源分离(Audio Source Separation)和声音活动检测(Voice Activity Detection)等领域上,取代了传统方法。实验表明新方法在和传统方法的对比中具有显著优势。最后展示了Demo和已上线的应用。
陈航分享的主题是《Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement》。We propose a visual embedding approach to improve embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip frames in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE). Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that our proposed subword based VEASE approach is more effective than conventional embedding at the word level. Moreover, visual embedding at the articulation place level, leveraging upon a high correlation between place of articulation and lip shapes, demonstrates an even better performance than that at the phone level. Finally the experiments establish that the proposed MEASE framework, incorporating both audio and visual embeddings, yields significantly better speech quality and intelligibility than those obtained with the best visual-only and audio-only EASE systems.
周恒顺分享的主题是《Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition》。Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. This talk will mainly describe a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). Specifically, the deep features from the audio encoder and the video encoder are first selected through the embedding attention mechanism to obtain the emotion-related regions for FBP fusion. Then, the adaptive adjustment of audio and video weights is presented for a deep fusion. Furthermore, the multi-level information by using global-trunk data and intra-trunk data is adopted to design a new network architecture. Detailed comparisons and results will be given on different datasets.
28日上午9:30,西北工业大学教授谢磊对专场报告致辞,回顾了27日来自产业界及学术界嘉宾们的精彩报告,智能会议场景实际上是涉及语音技术非常综合的场景,包括了很多前端信号的处理、说话人日志、语音识别等技术,还有许多需要我们探索的地方。Daniel的报告也是非常的精彩,让我们深入了解到新一代kaldi的不同之处。非常感谢希尔贝壳的组织,卜辉总的开源贡献。在开源开放的社区里,WeNet也是在持续做一些开源的活动,github的star数也超过了1500个,也有100多家机构在使用。最后对28日上午专场报告的议程做了介绍并预祝大会成功。
接下来的是WeNet 开源技术、技术分享两个主题的内容,由内蒙古大学教授张学良担任主持。
WeNet 开源技术
(分享嘉宾:张彬彬、范璐、印晶晶)
张彬彬分享的主题是《开源语音和 wenet-e2e 开源社区》。报告介绍到开源在现代计算机和软件生态中发挥着越来越重要的作用。在语音行业,语音社区也变得越来越开放,涌现出越来越多的开源语音的工作。wenet-e2e社区和该社区下的开源项目致力于开源语音生态的建设,致力于解决语音行业落地中的基础、基本问题,致力于探索更适合行业落地和产品化的语音技术。随后,介绍了开源语音和 wenet-e2e 社区,分享现有的 wenet、WenetSpeech、wenet-text-processing、wenet-kws 等语音开源项目,以及 wenet-e2e 社区的进一步项目规划和建设。
范璐分享的主题是《京东智能语音交互技术进展及应用》。基于前沿的智能语音人机交互技术,以及在零售、健康、物流、科技领域最高频率的智能交互场景,京东研发的大规模商用的情感智能交互系统言犀,不仅广泛应用于内部业务,同时在政务、金融、交通、文旅、医疗等垂直产业方向也得到了快速的发展,为商家、客户带来效率、体验与收入的增长,提供7*24小时服务,涵盖售前、售中和售后全场景。京东语音技术在SRE18,Dcase2019,FFSVC2020,M2VOC均取得了较好的成绩,并在interspeech icassp等会议发表论文十余篇,并把其中的技术方案应用到多种人机语音交互场景中。
印晶晶分享的主题是《喜马拉雅语音识别技术和应用介绍》。报告从喜马拉雅的具体业务场景出发,介绍了语音识别相关技术在主播剪辑、内容审核、智能文稿等方面的应用,分享了喜马拉雅自研的HiASR算法演进,一些端到端语音识别的实验和探索,以及HiASR微服务框架的设计与实现。
技术分享
(分享嘉宾:赵靖、吴锡欣、吕术博、付艺辉)
赵靖分享的主题是《AUTOMATIC SPEECH RECOGNITION FOR LOW-RESOURCE LANGUAGES: THE THUEE SYSTEMS FOR THE IARPA OPENASR20 EVALUATION》。The paper introduces our Automatic Speech Recognition (ASR) systems for the IARPA Open Automatic Speech Recognition Challenge (OpenASR20) as well as some post explorations with speech pre-training. We compete in the Constrained training condition for the 10 languages as team THUEE, under which the only speech data permissible for training of each language is a 10-hour corpus. We adopt the hybrid NN-HMM acoustic model and an N-gram Language Model (LM) to construct our basic ASR systems. The acoustic model is proposed as CNN-TDNNF-A, which combines Convolution Neural Network (CNN), Factored Time Delay Neural Network (TDNN-F) and self-attention mechanism. As for low-resource condition, we apply speed and volume perturbation, SpecAugment and reverberation for data enhancement as well as data clean-up to filter interference information. A series of pre-and-post-processing procedures for the evaluation set, such as Speech Activity Detection (SAD), system fusion and results filtering are carried out to obtain the final results. Furthermore, we exploit wav2vec 2.0 pre-trained model to obtain more effective speech representations for the hybrid system with a series of explorations, which brings about evident improvements.
吴锡欣分享的主题是《Speech Emotion Recognition Using Sequential Capsule Networks》。Speech emotion recognition (SER) is a fundamental step towards fluent human-machine interaction. One challenging problem in SER is obtaining utterance-level feature representation for classification. Recent works on SER have made significant progress by using spectrogram features and introducing neural network methods, e.g., convolutional neural networks (CNNs). However, the fundamental problem of CNNs is that the spatial information in spectrograms is not captured, which are basically position and relationship information of low-level features like pitch and formant frequencies. We propose a model based on the capsule networks (CapsNets) that can take into account the spatial relationship of speech features in spectrograms, and provide an effective routing method for obtaining utterance-level features. We also introduce a recurrent connection to improve the CapsNet model’s ability to capture temporal information. Experimental results on the IEMOCAP corpus demonstrate the effectiveness of the CapsNets for SER.
吕术博、付艺辉分享的主题是《复杂声学场景下基于复数谱的语音增强和去混技术研究》。基于深度学习的语音前端处理取得了迅猛的发展,特别是近年来深度复数网络的应用,因其能同时对幅度和相位进行建模,从而进一步提升了语音增强及去混的效果。本次报告汇报了西工大音频语音与语言处理研究组在单通道及多通道场景下基于深度复数网络语音增强及去混方面的一些实践。