We introduce the MuMu-LLaMA framework, an advancement capable of simultaneously encompassing music understanding and multi-modal music generation tasks, aiming to assist users in music related artistic creation.
We propose a systematic approach for generating large multi-modal music oriented instruction datasets for the training of our MuMu-LLaMA model including the largest publicly available music captioning dataset containing 1.2k plus hours of music using the MU-LLaMA model.
We use the MU-LLaMA Model
along with MosaicML's MPT-7B-Chat Model
to generate our Music Oriented datasets.
We also utilize BLIP Image Captioning Model
and VideoMAE Captioning Model to caption the respective modalities.
Music File MU-LLaMA Model Music Captions
Image File BLIP Image Captioning Model Image Captions
Video File VideoMAE Captioning Model Video Captions
Image Captions Music Captions MPT-7B Model MUImage Instructions
Video Captions Music Captions MPT-7B Model MUVideo Instructions
Music Captions Music Captions MPT-7B Model MUEdit Instructions
The MuMu-LLaMA model is a Music Understanding and Generation model that is capable of Music Question Answering and also Music Generation from texts, images, videos and audios, as well as Music Editing. The model utilizes encoders such as MERT for music understanding, ViT for image understanding and ViViT for video understanding and the MusicGen/AudioLDM2 model as the music generation model (music decoder), coupled with adapters and the LLaMA 2 model.
This is the demo video of our model's capabilities for Music Generation, Understanding and Editing.
This is the interactive demo of our model's capability for Text-to-Music Generation and Music Understanding.
This is the interactive demo of our model's capability for Text-to-Music Generation, Music Understanding and Music Editing.
This is the interactive demo of our model's capability for Image-to-Music Generation and Music Understanding.
This is the interactive demo of our model's capability for Image-to-Music Generation, Music Understanding and Music Editing.
This is the interactive demo of our model's capability for Video-to-Music Generation and Music Understanding.
This is the interactive demo of our model's capability for Video-to-Music Generation, Music Understanding and Music Editing.
This is the interactive demo of our model's capability for Music Editing and Music Understanding.