AI and machine learning algorithms are quite skilled at generating works of art — and highly realistic images of apartments, people, and pets to boot. But relatively few have been tuned to singing synthesis, or the task of cloning musicians’ voices.
Researchers from Amazon and Cambridge put their collective minds to the challenge in a recent paper in which they propose an AI system that requires “considerably” less modeling than previous work of features like vibratos and note durations. It taps a Google-designed algorithm — WaveNet — to synthesize the mel-spectrograms, or representations of the power spectrum of sounds, which another model produces using a combination of speech and signing data.
The system comprises three parts, the first of which is a frontend that takes a musical score as input and produces note embeddings (i.e., numerical representations of notes) to be sent to an encoder. The second is a model that is modified to accept the aforementioned embeddings, whose decoder produces mel-specrograms. As for the third and final component — the WaveNet vocoder, which mimics things like stress and intonation in speech — it synthesizes the spectrograms into song.
The frontend performs linguistic analysis on the score lyrics, allowing for three possible vowel levels of stress and ignoring punctuation. In time, it discovers which phonemes (perceptually distinct units of sound) correspond to each note of the score using syllabification information specified in the score itself. It also computes the expected duration in seconds of each note, as well as the tempo and time signature of the score, which it combines into embeddings.
The researchers compiled a data set of 96 songs in English, sung a capella by a single female voice for a total of two hours and 15 seconds of music. (An additional 40 hours of recordings was used to train the WaveNet model and the baseline systems.) It covered several genres, including pop, blues, rock, and some children’s songs, and the songs were split into segments 20-30 seconds in length, corresponding to about 200 phonemes each. This splitting reduced the amount of computation required to train the system, the researchers say, and made it easier to transform the samples (by shifting the pitch and changing the tempo) to augment the corpus.
The research team recruited around 22 human listeners to evaluate the quality of synthesized songs, principally by listening to segments three to five seconds in length and rating their naturalness on a scale of 0 to 100. The results show that the proposed model achieved an average ranking of 58.9%, with most segments in the lower quartile containing either a vocoder glitch or mumbled words.
The model sang in tune, though it performed best on simpler songs that didn’t include extremely high or low-pitched notes. It also learned to reproduce a good vibrato and apply it in the right places — on longer sustained notes — according to the musical context. That said, the system tended to become stuck when it encountered silence in a score, and it occasionally produced out-of-rhythm notes that were too long or too short. Nevertheless, the coauthors of the paper believe it can be stabilized with future work.