Cross-modal retrieval bridges the modal gap between text and vision. It is a critical task in many multimedia applications. In the past decade, we have witnessed significant progress in cross-modal retrieval. The architecture of the mainstream method for cross-modal retrieval has evolved from the joint-embedding structure to the attention-based structure. More recently, many cross-modal BERT methods emerge. Benefited from exploiting cross-modal attention and being pre-trained on the large-scale dataset, they have achieved state-of-the-art performance in cross-modal retrieval. Meanwhile, the significant advances achieved in the cross-modal retrieval model provide a more reliable solution to multimedia advertising and boost the emergence of the multimedia advertising market. In this tutorial, we will introduce the basics for cross-modal image/video retrieval as well as the application of cross-modal retrieval in multimedia advertising.
Tan Yu is a senior research scientist at Baidu Research, USA. His current research interests include cross-modal retrieval and multimedia advertising. He received a Ph.D. from Nanyang Technological University. He has published more than 10 papers in peer-reviewed international journals and conferences on multimedia retrieval and multimedia advertising.
Jingjing Meng is a faculty member in the Computer Science and Engineering Department at SUNY Buffalo. Before joining UB, she has worked as a research fellow at Nanyang Technological University (NTU), Singapore, and senior staff research engineer in Motorola's Applied Research Center, Schaumburg, IL. Jingjing received the Ph.D. degree from NTU, Singapore, M.S. degree from Vanderbilt University, and B.E. degree from Huazhong University of Science & Technology (HUST), China. Her research interests include computer vision and multimedia content retrieval.