Nvidia’s AI Model Breaks Sound Barriers: Synthesizing The Impossible

| Updated on November 28, 2024
Nvidia’s AI model

At this stage, individuals who have been keeping up with AI studies are well aware of the generative models capable of creating speech or music from just text prompts. The ‘Fugatto’ model recently unveiled by Nvidia aims to take this a step further by employing innovative synthetic training techniques and methods that combine inference to transform any blend of music, sounds and voices that include the creation of sounds that have never been heard before. 

Although Fugatto is not open yet for public use, it is a website filled with sample content that demonstrates the model’s ability to adjust different audio traits and descriptions. This led to a range of sounds from saxophone howling to voices speaking underwater and ambulance sirens harmonizing like a choir. 

Although the results displayed vary in quality, the extensive range of features that are showcased supports Nvidia’s characterization of Fugatto as a versatile tool for sound. 

The quality of the model depends on the quality of the data users use, in a research paper, more than a dozen researchers from Nvidia detail the challenge of creating a training dataset that can uncover major connections between language and audio. 

While conventional language models often deduce how to handle different instructions from text-based data, it is still challenging to generalize audio traits and descriptions without specific and clear instructions. 

For comparative analysis, the researchers make use of datasets where one variable remains constant whereas the other is varied for example different emotional interpretations of the same text or the same notes played by different instruments. 

After processing several open-source audio collections through this method, the researchers amassed a dataset of 20 million different samples where each represents at least 50,000 hours of audio. A model with 2.5 billion parameters was developed from this using 32 Nvidia tensor cores which began to demonstrate consistent performance in audio quality tests. 

Vikhyaat Vivek

Follow Me:

Comments Leave a Reply
Leave A Reply

Thanks for choosing to leave a comment. Please keep in mind that all comments are moderated according to our comment Policy.

Related Posts