The original reCAPTCHA served the dual purpose of fighting spam and training opt...

gambiting · on Nov 28, 2022

>> It displayed a pair of words, one of which was unambiguously resolved by an OCR and the other of which OCRs couldn’t easily read. The first word was used to disambiguate humans from bots, and the second word was used to train the OCR.

Was I the only person who has always inserted some nonsense word as my answer for the clearly scanned word? It was very obvious which word was generated and would be checked, and which one was scanned and which wouldn't be checked just accepted as-is. I always just typed in something else other than the word - I think it was just being contrarian against being used by a corporation to do their word recognition for them for free.

rollcat · on Nov 28, 2022

4chan famously promoted the usage of the n-word for that purpose. You're going to get enough wrong answers even with compliant users - I'm 100% positive any dev with at least two brain cells and a passing grade in statistics would be able to filter all of these out of the dataset.

But yeah, I also cross the street on the red light.

rkagerer · on Nov 28, 2022

Crazy, I always assumed they sent the same prompts to multiple users and cross-checked.

gambiting · on Nov 28, 2022

I mean, they probably do - so if 9/10 people said the word is "apple" and I said it's "banana" then it will accept apple. But at the same time I can't imagine I was the only one doing this.

rkagerer · on Nov 29, 2022

Sure, but unless you're coordinating with other adversaries it shouldn't be hard for them to filter out the deviant responses.

huggingmouth · on Nov 28, 2022

I do the same with the self-driving training images.

Want reliable input? Pay me.

omoikane · on Nov 28, 2022

> It lets me through the majority of the time

Maybe the system has determined that you are human and you are intentionally attempting to mess with them. Since the primary goal of CAPTCHA (confirm that you are human) has been fulfilled and you appear to not be a good source for the secondary goal (crowdsource training data), the system decided to not waste any more time with you.

Closi · on Nov 28, 2022

> It lets me through the majority of the time, which indicates that my bad input made it into their training data.

I’ve always assumed each new input is tested multiple times on different humans for validation, but this might be incorrect.

simmo9000 · on Nov 28, 2022

But I do this too...

Closi · on Nov 28, 2022

That's ok, the chances of several intentional saboteurs on a single image sample are presumably pretty low.

i.e. even if the saboteur rate was as high as 10%, and I only showed images three times, only 10% * 10% * 10% = 0.1% of data would have three people intentionally picking the wrong answer. I suspect the rate is much lower, and 99%+ people just want to pick the right answer to get the captcha program to go away as quickly as possible.

Images with less than 3/3 matching results in this example would presumably be retested until you got the desired confidence level.

Then assuming your ML model isn't overfit/overtrained, you could even then assess your original input data to detect/flag anomalies for manual review.

teddyuk · on Nov 28, 2022

I figured this out a while ago and I do it as a challenge - how can I incorrectly pass the test - probably there are more of us doing this than you estimate? We could be in the majority (unlikely I know)

greesil · on Nov 28, 2022

Very insightful. You forgot to mention "and is solvable by a human". Otherwise the captcha would be just be proof of work of some kind.

MontyCarloHall · on Nov 28, 2022

Proof-of-work is exactly what Cloudflare’s Turnstyle CAPTCHA alternative is: https://blog.cloudflare.com/turnstile-private-captcha-altern...

The only reason we still solve those stupid image recognition puzzles is because Google/Waymo and other self-driving car companies have managed to trick us into helping them do their training work for them.

rkagerer · on Nov 28, 2022

Is there a link beyond that blog post explaining how it's a proof-of-work and how you keep people with thick wallets willing to pay for compute from exploiting?

toastal · on Nov 28, 2022

I'm waiting for the check to arrive after all of these years of training—especially after moving to the East from the West and now I have 5× as many prompts.

bogomipz · on Nov 28, 2022

How about the text ones that Google uses that are text that are alternately compressed and stretched in the same word to the degree that many of them are completely illegible to human eyes? Some of them appear like text wrapped around an invisible sphere almost like a mercator projection. What meaningful work domain is those captchas targeting?

porphyra · on Nov 28, 2022

Why do you want to screw up the training data though? You have nothing to gain while making life a little harder for everyone.

MontyCarloHall · on Nov 28, 2022

The cynic in me says because I resent being forced to help multi-billion dollar companies crowdsource their AI training.

The techno-optimist in me says because I want to force them to improve their underlying models. When their engineers notice that their model struggles with weird edge cases that I purposefully mislabel (e.g. when prompted to select images containing motorcycles, I also pick a mountain bike with fat, motorcycle-sized tires), perhaps they will contemplate how to rigorously encode the concepts of “motorcycle” and “mountain bike” into their model, rather than simply pushing an abundance of training data through a black box classifier and hoping that by adding more crowdsourced data, it will eventually arrive at the right answer.

jrm4 · on Nov 28, 2022

Not if you believe that the people working on this are going too fast and/or have a misguided goal.

I think it's reasonable to believe that real self-driving cars are not inevitable, or even if they are, deliberate disruption of this process is healthy; e.g. it shouldn't rely on something this dumb.

RockRobotRock · on Nov 28, 2022

Do you really think reCaptcha data only benefits Waymo? What about Google Maps detecting stop lights? Or wheelchair ramps?

jrm4 · on Nov 28, 2022

Don't you think that if this data was known to be widely and mostly beneficial, reCaptcha would be falling all over themselves to grab the good PR? The fact that regular folks hear virtually nothing about this strongly indicates that it's like most data collection -- if people knew the real deal they probably wouldn't happily sign on and would likely bring more questions than they want to deal with.

RockRobotRock · on Nov 28, 2022

Not really, non-tech people don't know or care about reCaptcha. I still think its evil for reCaptcha to be so prolific and used for data collection, but it's a positive side-effect that its also used for less evil things like labeling for data sets.

jrm4 · on Nov 28, 2022

Right. But if it was mostly good, whoever reCaptcha is could raise/make boatloads of money with "you're not just practicing safe computing, you're helping save children's lives" type ads/fundraising.

selfhoster11 · on Nov 28, 2022

Then they can pay for their own mechanical Turk labour, thank you very much. I will not be sponsoring a corporation out of the goodness of my heart, out of my own time.

If I ever learn that they release that dataset to the public, my position on this may change.

ddtaylor · on Nov 28, 2022

If that is something important that we should rely on any company involved in that should be spending the readily available resources to do it correctly, not hoping that random people trying to log in to their email pick the correctly labeled data.

lajosbacs · on Nov 28, 2022

I doo this too, simply because I get angry when a multibillion company asks me to work for free for them.

edit: I get twice as angry when I have to fill in captcha in a service that I pay for.

Fnoord · on Nov 28, 2022

Google's training is involuntary and the training data is private. Only Google benefits from it (by providing the service "for free").

teddyuk · on Nov 28, 2022

as a challenge.

SrithranHi · on Nov 28, 2022

> a pair of words, one of which was unambiguously resolved by an OCR and the other of which OCRs couldn’t easily read

4chan had a lot of fun with this when it was first implemented. Perhaps unsurprisingly, there very quickly developed a campaign to have everyone insert "n**r" in place of the unknown word. Many threads were dedicated to education, onboarding, and, of course, sharing 'trophies' when such a replacement was found to have taken effect in one of Google's products (Books, iirc?).