ETMI5: Explain to Me in 5

Within this segment of our course, we will delve into the latest research developments surrounding LLMs. Kicking off with an examination of MultiModal Large Language Models (MM-LLMs), we'll explore how this particular area is advancing swiftly. Following that, our discussion will extend to popular open-source models, focusing on their construction and contributions. Subsequently, we'll tackle the concept of agents that possess the capability to carry out tasks autonomously from inception to completion. Additionally, we'll understand the role of domain-specific models in enriching specialized knowledge across various sectors and take a closer look at groundbreaking architectures such as the Mixture of Experts and RWKV, which are set to improve the scalability and efficiency of LLMs.

Multimodal LLMs (MM-LLMs)

In the past year, there have been notable advancements in MultiModal Large Language Models (MM-LLMs). Specifically, MM-LLMs represent a significant evolution in the space of language models, as they incorporate multimodal components alongside their text processing capabilities. While progress has also been made in multimodal models in general, MM-LLMs have experienced particularly substantial improvements, largely due to the remarkable enhancements in LLMs over the year, upon which they heavily rely.

Moreover, the development of MM-LLMs has been greatly aided by the adoption of cost-effective training strategies. These strategies have enabled these models to efficiently manage inputs and outputs across multiple modalities. Unlike conventional models, MM-LLMs not only retain the impressive reasoning and decision-making capabilities inherent in Large Language Models but also expand their utility to address a diverse array of tasks spanning various modalities.

To understand how MM-LLMs function, we can go over some common architectural components. Most MM-LLMs can be divided in 5 main components as shown in the image below. The components explained below are adapted from the paper “MM-LLMs: Recent Advances in MultiModal Large Language Models”. Let’s understand each of the components in detail.

Screenshot 2024-02-18 at 3.09.34 PM.png

Image Source: https://arxiv.org/pdf/2401.13601.pdf

1. Modality Encoder: The Modality Encoder (ME) plays a pivotal role in encoding inputs from diverse modalities $I_X$ to extract corresponding features $F_X$ Various pre-trained encoder options exist for different modalities, including visual, audio, and 3D inputs. For visual inputs, options like NFNet-F6, ViT, CLIP ViT, and Eva-CLIP ViT are commonly employed. Similarly, for audio inputs, frameworks such as CFormer, HuBERT, BEATs, and Whisper are utilized. Point cloud inputs are encoded using ULIP-2 with a PointBERT backbone. Some MM-LLMs leverage ImageBind, a unified encoder covering multiple modalities, including image, video, text, audio, and heat maps.

2. Input Projector: The Input Projector $Θ_(X→T)$ aligns the encoded features of other modalities $F_X$ with the text feature space $T$. This alignment is crucial for effectively integrating multimodal information into the LLM Backbone. The Input Projector can be implemented through various methods such as Linear Projectors, Multi-Layer Perceptrons (MLPs), Cross-attention, Q-Former, or P-Former, each with its unique approach to aligning features across modalities.

3. LLM Backbone: The LLM Backbone serves as the core agent in MM-LLMs, inheriting notable properties from LLMs such as zero-shot generalization, few-shot In-Context Learning (ICL), Chain-of-Thought (CoT), and instruction following. The backbone processes representations from various modalities, engaging in semantic understanding, reasoning, and decision-making regarding the inputs. Additionally, some MM-LLMs incorporate Parameter-Efficient Fine-Tuning (PEFT) methods like Prefix-tuning, Adapter, or LoRA to minimize the number of additional trainable parameters.

4. Output Projector: The Output Projector $Θ_(T→X)$ maps signal token representations $S_X$from the LLM Backbone into features $H_X$ understandable to the Modality Generator $MG_X$. This projection facilitates the generation of multimodal content. The Output Projector is typically implemented using a Tiny Transformer or MLP, and its optimization focuses on minimizing the distance between the mapped features $H_X$ and the conditional text representations of $MG_X$ .

5. Modality Generator: The Modality Generator $MG_X$ is responsible for producing outputs in distinct modalities such as images, videos, or audio. Commonly, existing works leverage off-the-shelf Latent Diffusion Models (LDMs) for image, video, and audio synthesis. During training, ground truth content is transformed into latent features, which are then de-noised to generate multimodal content using LDMs conditioned on the mapped features $H_X$ from the Output Projector.

Training

MM-LLMs are trained in two main stages: MultiModal Pre-Training (MM PT) and MultiModal Instruction-Tuning (MM IT).

MM PT: During MM PT, MM-LLMs are trained to understand and generate content from different types of data like images, videos, and text. They learn to align these different kinds of information to work together. For example, they learn to associate a picture of a cat with the word "cat" and vice versa. This stage focuses on teaching the model to handle different types of input and output.

MM IT: In MM IT, the model is fine-tuned based on specific instructions. This helps the model adapt to new tasks and perform better on them. There are two main methods used in MM IT: