I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
I assume distilled versions can easily be run on one GPU.
So, to clarify, does this mean that companies cannot use these models in the course of business, or is it more about selling the translation results directly?
I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
I assume distilled versions can easily be run on one GPU.