MU-LLaMA

MusicQA Dataset Generation

We use MosaicML's MPT-7B-Chat Model to generate our MusicQA dataset.

Tags or Captions MPT-7B Model QA Pairs

Closed-ended Question Answer Pair Generation

Generate QA From: Caption List of Tags

Instructions:

- You are given a sentence describing an audio
- You will give answers from the audio to these questions based on the sentence

- Describe the music
- Describe the music in detail
- What do you hear in the audio
- What can be inferred from the audio

- The answers should be numbered

Music Caption:

List of Tags:

Q1. Describe the music

A1. The music is a slow instrumental jazz piece containing violin and electronic sounds.

Q2. Describe the music in detail

A2. The music is a slow-paced instrumental piece that features a violin and electronic sounds. The violin plays a melancholic melody, while the electronic sounds provide a subtle background texture.

Q3. What do you hear in the audio?

A3. I hear a violin playing a slow and melancholic melody, accompanied by electronic sounds that provide a subtle background texture.

Q4. What can be inferred from the audio?

A4. From the audio, it can be inferred that the music is intended to be relaxing and introspective, with a focus on the emotional expression of the violin. The use of electronic sounds suggests a modern twist on traditional jazz music.

Open-ended Question Answer Pair Generation

Generate QA From: Caption List of Tags

Instructions:

- You are given a sentence describing an audio
- You will create 5 questions related to the audio based on the sentence along with answers
- The questions should be relating to things like tempo of the music, mood of the music, instruments used, inference, etc
- The question answers should be long form
- The question answers should be numbered

Music Caption:

List of Tags:

Q1. What is the tempo of the music?

A1. The music is a slow piece, which means it has a tempo that is played at a leisurely pace.

Q2. What instruments are used in this piece?

A2. The piece contains violin and electronic sounds. The violin adds a classical touch to the music, while the electronic sounds give it a modern twist.

Q3. What is the mood of the music?

A3. The slow tempo and instrumental nature of the piece creates a calm and relaxed mood. The combination of violin and electronic sounds also gives it an introspective and contemplative feel.

Q4. What can you infer about the genre of the music?

A4. The use of violin and electronic sounds in the piece suggests that it belongs to the genre of contemporary or experimental jazz.

Q5. How does the use of electronic sounds affect the overall sound of the piece?

A5. The use of electronic sounds adds a unique and distinct quality to the piece. It enhances the overall sound by providing a contrast to the classical sound of the violin. This contrast creates a dynamic and interesting interplay between the two sounds, making the piece stand out.

Caption Generation from Tags

Caption

Soulful jazz and ambient electronic violin with saxophone, slow and jazzy.

MusicQA Dataset Samples

Below are a few samples of Question Answer pairs from our MusicQA dataset
generated using captions/tags from existing music captioning and tagging datasets.

Music	Question	Answer
	What can be inferred from the audio?	The setting of the audio is likely outdoors, perhaps in a campground or backyard, where people may be gathered around a fire.
	Describe the music	The song has a traditional jazzy feel to it.
	What is the overall tone of the recording?	The overall tone of the recording is dark and ominous.
	Describe the music in detail	The music is upbeat and has a catchy melody. The vocalist has a strong and emotive voice that conveys a sense of passion and intensity.
	What genre of music is represented by the audio?	The audio represents techno music, which is characterized by the use of electronic instruments and synthesizers.
	What type of music is this?	This is a slow and quiet electronic solo flute piece.

Music Question Answering

We utilize our model for Music enabled question answering and reasoning.
We compare its performance against other audio enabled question answering models,
namely, the Listen, Think and Understant (LTU) model and the LLaMA Adapter model.
The comparison is done on the MTG-Jamendo dataset which has no overlap with our training data.

Music & Question	MU-LLaMA	LLaMA Adapter	LTU
Describe the audio in detail	The music is described as being fast and danceable. The music is a mix of electronic and synth sounds, with a steady beat and a variety of percussion.	The music in the image is a hip-hop beat, featuring a synthesized sound with a strong bass line. The sound is characterized by a mix of electronic and urban elements, making it suitable for a DJ set or a music video.	The music is a mix of different instruments and sounds, with a steady beat and a clear melody
Who or what is the target audience for this audio?	The target audience for this audio is likely hard rock and heavy metal fans who enjoy fast-paced, energetic music with a strong emphasis on drums and bass.	The target audience for this audio is teenagers.	The target audience for this audio is not specified in the given audio events.
What type of dance music is it?	It is a type of dance music that is commonly associated with electronic and techno genres.	The music is a jazz-infused tune, which is a type of music that combines elements of jazz and blues with contemporary styles.	It is not specified in the given audio clip.
What do you hear in the audio?	The audio features a male vocalist singing in a punk rock style. The audio is a heavy metal song with a fast tempo, intense guitar riffs, and heavy bass and drum beats. The vocals are aggressive and powerful, with a strong emphasis on the lower register.	In the audio, you can hear the sound of a motorcycle engine revving, and the words "surf" and "life" are also visible in the background.	Labels: Musical instrument; Music; Guitar
How does the soundtrack build tension and drama?	The soundtrack uses a combination of slow, haunting melodies and subtle changes in instrumentation to create a sense of tension and drama. The soundtrack builds tension and drama by using a variety of techniques, including the use of orchestral instruments, a slow tempo, and a sense of foreboding.	The soundtrack in the image is composed of a series of musical notes, which creates a sense of tension and drama. The repetition of the same note in a row adds to the intensity and builds anticipation. This type of composition is often used in film scores to enhance the overall experience and convey a particular mood or emotion.	The soundtrack builds tension and drama by using a combination of different musical elements, such as suspenseful melodies, dramatic crescendos, and sudden changes in tempo.

Music Captioning with MU-LLaMA

We use our MU-LLaMA model to caption music files in order to create large open source Text-to-Music generation datasets.
Here we compare the captions generated by different models such as the Listen, Think and Understand model and the LP-MusicCaps model on samples from the opensource FMA dataset.

MU-LLaMA	LP-MusicCaps	LTU
The music is a slow and steady beat with a bass and drums rhythm. The melody is played by a synth and the music is accompanied by a digital drum.	The low quality recording features a electro song that consists of a punchy "4 on the floor" kick pattern, short snare hit and offbeat shimmering hi hats. It sounds groovy, weird and like something you would hear in nightclubs.	The music is characterized by its dynamic and balanced sound, with a rich and full texture, and a clear and distinct pitch and timbre.
The music is a slow tempo with a steady rhythm. The instrument used is an acoustic guitar. The mood of the music is melancholic and nostalgic.	This is a guitar cover video. There are no vocals in this piece. The acoustic guitar is playing a mellow tune with the heavy use of arpeggios. The atmosphere is gentle and relaxing.	The music is being played on a guitar with a mellow sound.
The music is slow and melancholic with a sense of longing and nostalgia. The music is played on a piano.	The low quality recording features a reverberant piano melody playing. It sounds like a cover and it is energetic.	The music is rich and full-bodied, and it is a piece of music that is played.
The music is a blues-rock song with a groovy bass line, electric guitar playing rhythm, and a keyboard playing a bluesy tune. The tempo is medium and the mood of the music is groovy.	The low quality recording features a reggae song that consists of a flat male vocal singing over shimmering shakers, groovy bass, wide electric guitar melody, punchy kick and snare hits, shimmering hi hats and simple bass. It sounds happy, fun and joyful - like something kids would listen to.	The music is described as rich and full of resonant frequencies, with a male singing in a deep voice.
The music is a fast-paced rock song with a strong beat and guitar riffs. The vocals are aggressive and energetic, with a strong emphasis on the lyrics. The drums are powerful and driving, providing a solid foundation for the rest of the music.	This song contains an acoustic drum playing a simple rock groove along with an e-bass. A e-guitar is playing a short melody panned to the right side of the speakers. Then a male voice starts singing in a higher key. This song may be playing live at a concert.	The music is described as rich and full of resonant frequencies, which is a common characteristic of rock music.

MU-LLaMA

Contributions

MU-LLaMA Model

MusicQA Dataset

T2M-Gen Dataset Generation

MusicQA Dataset Generation

Closed-ended Question Answer Pair Generation

Instructions:

Music Caption:

List of Tags:

Q1. Describe the music

A1. The music is a slow instrumental jazz piece containing violin and electronic sounds.

Q2. Describe the music in detail

A2. The music is a slow-paced instrumental piece that features a violin and electronic sounds. The violin plays a melancholic melody, while the electronic sounds provide a subtle background texture.

Q3. What do you hear in the audio?

A3. I hear a violin playing a slow and melancholic melody, accompanied by electronic sounds that provide a subtle background texture.

Q4. What can be inferred from the audio?

A4. From the audio, it can be inferred that the music is intended to be relaxing and introspective, with a focus on the emotional expression of the violin. The use of electronic sounds suggests a modern twist on traditional jazz music.

Open-ended Question Answer Pair Generation

Instructions:

Music Caption:

List of Tags:

Q1. What is the tempo of the music?

A1. The music is a slow piece, which means it has a tempo that is played at a leisurely pace.

Q2. What instruments are used in this piece?

A2. The piece contains violin and electronic sounds. The violin adds a classical touch to the music, while the electronic sounds give it a modern twist.

Q3. What is the mood of the music?

A3. The slow tempo and instrumental nature of the piece creates a calm and relaxed mood. The combination of violin and electronic sounds also gives it an introspective and contemplative feel.

Q4. What can you infer about the genre of the music?

A4. The use of violin and electronic sounds in the piece suggests that it belongs to the genre of contemporary or experimental jazz.

Q5. How does the use of electronic sounds affect the overall sound of the piece?

A5. The use of electronic sounds adds a unique and distinct quality to the piece. It enhances the overall sound by providing a contrast to the classical sound of the violin. This contrast creates a dynamic and interesting interplay between the two sounds, making the piece stand out.

Caption Generation from Tags

Instructions:

List of Tags:

Caption

Soulful jazz and ambient electronic violin with saxophone, slow and jazzy.

MusicQA Dataset Statistics

MusicQA Dataset Samples

Music Understanding LLM (MU-LLaMA)

Music Question Answering

Music Captioning with MU-LLaMA