Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.

I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:

https://www.latent.space/p/notebooklm

TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available




they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo


Really good catch. Ty.


Thank you swyx. How did I miss this episode?


did you LIKE and SUBSCRIBE?? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: