> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
https://www.latent.space/p/notebooklm
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available