Public defence in Computer Science, M.Sc. Tzu-Jui Wang
When
Where
Event language(s)
Title of the thesis: Deep Visual Understanding and Beyond: Saliency, Uncertainty and Bridges to Natural Language
Doctoral student: Tzu-Jui Wang
Opponent: Dr. Esa Rahtu, Tampere University, Finland
Custos: Prof. Samuel Kaski, Aalto University School of Science, Department of Computer Science
Deep learning methods for modeling uni-modal and multi-modal data
While the human world is predominated by visuals, other modalities, such as natural languages, provide another indispensable aspect of communication among humans. To equip parallel cognition, a cognitive agent has to be able to comprehend both uni-modal and multi-modal signals, e.g. images and texts, when necessary.
The dissertation initially studies the comprehension of visual stimuli from images and videos, focusing on two topics: visual saliency and uncertainty estimations. Next, it progresses towards bridging multi-modal signals by presenting a way to capture the relationships of the visual elements in an image. This is then followed by visual captioning tasks aiming to generate meaningful descriptions for images and videos. Lastly, it lands on vision-language pre-training, with the goal of enhancing the generalization of multi-modal machine learning models to a wide variety of downstream tasks.
This work is highly relevant to ongoing research in computer vision and natural language processing. It presents various methods empowered by different machine learning paradigms for both uni-modal and multi-modal contexts, along with enhancements in model robustness. It also contributes new insights to the field, such as enabling weakly-supervised learning in multi-modal tasks.
The findings presented in the study can be applied to develop more robust and effective cognitive systems capable of handling multi-modal data. In addition, the results suggest the effectiveness and wide applicability of different learning paradigms in improving cognitive systems' abilities in visual understanding and multi-modal reasoning.
Key Words: saliency estimation, visual captioning, scene graph, vision-language representation learning
Thesis available for public display 10 days prior to the defence at: https://aaltodoc.aalto.fi/doc_public/eonly/riiputus/
Contact information:
[email protected] | |
Mobile | +46705674020 |
Doctoral theses in the School of Science: https://aaltodoc.aalto.fi/handle/123456789/52
Zoom Quick Guide: https://www.aalto.fi/en/services/zoom-quick-guide
- Published:
- Updated: