Multimodal Foundation Models for Unified Image, Video and Text Understanding

Ayodele R. Akinyele 1, Oseghale Ihayere 2, Osayi Eromhonsele 1, Ehisuoria Aigbogun 3, Adebayo Nurudeen Kalejaiye 4 and Oluwole Olakunle Ajayi 5, *

1 Kenan-Flagler Business School, University of North Carolina at Chapel Hill, North Carolina, USA.
2 Fuqua School of Business, Duke University, Durham, North Carolina, USA.
3 Booth School of Business, University of Chicago, Illinois, USA.
4 Scheller College of Business, Georgia Institute of Technology, Georgia, USA.
5 Community and Program Specialist, UHAI For Health Inc, Worcester, Massachusetts, USA.
 
 
Review
Open Access Research Journal of Science and Technology, 2024, 12(02), 081–093.
Article DOI: 10.53022/oarjst.2024.12.2.0139
Publication history: 
Received on 08 October 2024; revised on 16 November 2024; accepted on 19 November 2024
 
Abstract: 
Advanced models can now interpret and understand photos, videos, and text. Conventional AI models focused on image classification, text analysis, and video processing. Multimodal foundation models combine data analysis in one framework to meet the requirement for more integrated AI systems. The models may learn joint representations from several modalities to generate text from images, analyze movies with textual context, and answer visual queries. Cross-modal learning's theoretical foundations and architectural advances have helped multimodal foundation models grow rapidly. This study explores their success. Transformer-based architectures have profoundly changed AI model data modalities. Self-attention and contrastive learning help the models align and integrate data across modalities, improving data understanding. The study analyses well-known multimodal models CLIP, ALIGN, Flamingo, and Video BERT, emphasizing on their design, training, and performance across tasks. Their performance in caption generation, video-text retrieval, and visual reasoning has led to more adaptable AI systems that can handle complex real-world scenarios. Despite promising results, multimodal learning faces various obstacles. To create effective models, you need substantial, high-quality datasets. Computers struggle to handle many data formats simultaneously, and bias and interpretability difficulties arise. The limitations and ethical implications of multimodal models in healthcare and autonomous systems are examined in this research. This study investigates the future of multimodal foundation models, focused on reducing processing, enhancing model fairness, and applying them to audio, sensor data, and robotics. Understanding and integrating multimodal information is essential for creating more intuitive and intelligent systems, therefore unified multimodal models could change human-computer interaction. Overall, multimodal foundation models drive the search for generalized and adaptable AI systems. These models' capacity to combine picture, video, and text data could alter many applications, driving creativity across sectors and stimulating AI research.

 

Keywords: 
Multimodal; Artificial Intelligence; Transformer-based architectures; Interactions; Models
 
Full text article in PDF: