N号馆第7期资料

  • 251 浏览

tuoxie2046

2018/10/31 发布于 科学 分类

【标题】人工智能研究最新进展:声音和图像的碰撞 【内容】本讲座主要介绍人工智能在图像和声音联合处理领域的最新研究成果和应用。现实世界中,人的声音信号和唇部动作以及面部表情是深度耦合的关系。通过人工智能学习和运用声音和图像的这种耦合关系,可以实现单独依靠音频信息难以实现的应用,比如音频校正,超级降噪,多声音分离等等。 【主讲】马磊博士,毕业于东京大学生物工程系获得工学博士学位,现在就职于联想日本大和研究所,主任研究员,目前研究领域为计算机视觉,新一代智能人机交互。 【组织者】余锦泽 【主办】华促会 【时间】10月24号晚20:00 【地点】东大本乡工学部9号馆一楼大会议室

深度学习  人工智能  语音分析  人机交互 

文字内容
1. AI in AudioVisual PMArLoEIcessing 2018/10/24
2. Self introduction • 2011 年毕业于中国矿业大学机械工程专业,学士 学位 • 2014 年 6 月毕业于北京航空航天大学机器人所, 获得硕士学位,期间主要研究内容为医疗机器人。 • 2018 年 4 月毕业于东京大学生物医学工程专攻, 获得工学博士学位。博士期间的研究内容为医学图 像处理,手术机器人导航系统等。 • 现就职于联想日本研究院,大和实验室 (Thinkpad 诞生地 ) ,主任研究员。 目前的研究方向为下一代 智能人机交互,计算机视觉。 • 兴趣爱好:旅行,阅读最新科技报道,星际类题材 影视剧,美食 MA LEI, Ph.D. 2
3. Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 3
4. Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 4
5. Era of Artificial Intelligence • AI brings breakthroughs for many important fields – Object detection – Cancer detection – Speech recognition – Self driving – Robot –… • Big companies are all in AI – Google, Huawei, Samsung… • AI in Audio-visual processing – Image processing and Audio processing are fully discussed – What happens when combing audio and image 5
6. Temporal co-occurrence of audio and visual signal • Two example videos • Co-occurrence of audio and visual signal Motion Audio Time 6
7. Single-modality feature Vision feature 3D Convolution 3D Convolution 3D Convolution Audio feature 1D Convolution 1D Convolution 1D Convolution 7
8. Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 8
9. Self-supervision of audio-visual training 9
10. Audio-visual feature alignment • Objective – Align the audio and visual signal – Latency between audio and visual signal caused by capturing, processing and transmission • Solution – Align audio signal to visual signal using audio-visual network. 10
11. Audio-visual feature alignment • Align audio frame with video frame • Relate sound with the motion in video Audio-visual network for aligning the audio-visual feature Owens, Andrew, and Alexei A. Efros. "Audio­visual scene analysis with self­supervised 11 multisensory features." arXiv preprint arXiv:1804.03641 (2018).
12. Audio-visual feature alignment Demo video 12
13. Speaker detection Speaker detection is a specific application of audiovisual feature alignment. Pipeline to generate the audio-visual datas Audio-visual Network for speaker Chung, Joon Son, and Andrew Zisserman. "Out of time: automated lip sync in the wild." In Asian Conference on Computer Vision, pp. 251­263. Springer, Cham, 2016. 13
14. Speaker detection 14
15. Lip reading • 2001, a space odyssey 15
16. Lip reading • Objective – Transform lip speaking motion into text • Solution – Use voice dictation results as labeled data to train lip reading network • Application – Understand user’s intent in noise environment – Spying?? Text 16
17. Lip reading Lipnet:'>Lipnet: End-to-end sentence-level lip reading [1] Combining Residual Networks with LSTMs for lip reading [2] [1] Assael, Yannis M., et al. "Lipnet:'>Lipnet: End­to­end sentence­level lipreading." arXiv preprint arXiv:1611.01599 (2016). [2] Stafylakis, Themos, and Georgios Tzimiropoulos. "Combining residual networks with LSTMs for lipreading." arXiv preprint arXiv:1703.04105 (2017). 17
18. Lip reading 18
19. voice separation • Objective – Separate mixed voice streams – Suppress noise by extracting target voice stream • Problem – Extremely difficulty to realized only relying on audio processing • New solution – Separate voice streams using cooccurring visual information mixed voice streams voice Separation Separated voice streams Co-occurring visual data 19
20. Voice Separation Separated videos and voices Left Speaker Right Speaker Source video and mixed voices Ephrat, Ariel, et al. "Looking to listen at the cocktail party: A speaker­independent audio­visual model for speech separation." arXiv preprint arXiv:1804.03619 (2018). 20
21. Voice Separation—Research 1 Ephrat, Ariel, et al. "Looking to listen at the cocktail party: A speaker­independent 21 audio­visual model for speech separation." arXiv preprint arXiv:1804.03619 (2018).
22. Voice Separation—Research 1 Visualization of the Visual Contribution Per­Frame 22
23. Voice Separation—Research 2 Separation network Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: 23 Deep Audio­Visual Speech Enhancement." arXiv preprint arXiv:1804.04121 (2018).
24. Voice Separation—Research 2 Training pipeline Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: 24 Deep Audio­Visual Speech Enhancement." arXiv preprint arXiv:1804.04121 (2018).
25. Demo video of Research 2 25
26. Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 27
27. Voice is the most natural ways of communication Voice assistant • Benefits of Voice Interaction NNaattuurraall ffoorr UUsseerr AAvvaaiillaabbllee aannyyttiimmeeIInntteerraaccttiioonn eevveenn AAtt aa DDiissttaannccee FFaasstteerr tteexxtt iinnppuutt tthhaann ttyyppiinnggRReeccooggnniizzee GGeenneerraall tteerrmmss wweellllMMuullttii TTaasskk 28
28. Compassion of different interaction Naturalness of interaction Mouse/keyboard interaction Touch interaction Precision of interaction Voice interaction 29
29. Voice interaction • Know user more through visual and audio data – Find who is speaking, Speaker detection – Listen to owner’s voice, face recognition + voice separation – Lip reading (In future) 30