I'm hoping torch.compile is a gateway to "easy" non-Nvidia accelerator support in PyTorch.
Also, I have been using torch.compile for the Stable Diffusion unet/vae since February, to good effect. I'm guessing similar optimizations will pop up for LLaMA.
But I also compile the VAE and some other modules, I will reply again later when I can look at my local code. Some modules (like face restoration or the scheduler) still dont like torch.compile.
I tried changing the options in the config dict one by one, but TBH nothing seems to make a significant difference behind the default settings in benchmarks.
I haven't messed with compiling LORA training yet, as I dont train much and it is sufficiently fast, but I'm sure it could be done.
Also, I have been using torch.compile for the Stable Diffusion unet/vae since February, to good effect. I'm guessing similar optimizations will pop up for LLaMA.