Text Embeddings Reveal (Almost) as Much as Text

Legend2440 · on July 24, 2023

I think this is unsurprising, the point of embeddings is to encode the information from the text. It's not encryption or hashing; it's a compressed representation.

Even if you couldn't recover the original words I would expect to be able to recover equivalent words with the same meaning.

sp332 · on July 24, 2023

The security aspect is important to keep in mind, but a more interesting use case for me is tweaking the embedding values and finding token inputs that correspond. That will let me explore the latent space.

majormajor · on July 24, 2023

If you recover 90% of text exactly does that mean the "semantic overlap" of the vectors isn't as good as it could be? E.g. semantically identical but textually different words cause more meaningful shifts than would be desired for certain use cases?

RC_ITR · on July 24, 2023

This is interesting, but said differently

'when we build models to do a really good job of representing 32 words/tokens as vectors, you can very easily backsolve for 28/32 words just using the vectors. These results are not robust above 32 words.'

fmeyer · on July 24, 2023

Not a single reference to differential privacy?

HocusLocus · on July 27, 2023

Let me be the first in the thread to cite "Overview of SHARD: A System for Highly Available Replicated Data" (Sarkin, DeWitt et. al) ( https://shkspr.mobi/blog/2021/06/where-is-the-original-overv... )

Imnimo · on July 24, 2023

For the experiment in section 5.3, where they try to recover private information from embeddings of clinical notes, it's interesting that the model has to also spend capacity trying to reconstruct non-private information. I wonder if you could do better at recovering names by first learning a custom distance metric that tries to assign a low distance to embeddings of texts that share a name, regardless of other content, and then using this method to minimize that distance.

autokad · on July 24, 2023

is there any kind of embedding that does protect the privacy of the initial words?

lgas · on July 24, 2023

It seems unlikely that this could be accomplished. The fundamental point is to capture a representation of the meaning of original text. If the that meaning is not accessible the embeddings are not useful.

alach11 · on July 25, 2023

In the linked paper they demonstrate that adding gaussian noise can obfuscate the original text. Retrieval performance is hurt by 2% and reconstruction performance is hurt by 87%. So that seems like a favorable tradeoff.