The original reCAPTCHA served the dual purpose of fighting spam and training optical character recognition algorithms. It displayed a pair of words, one of which was unambiguously resolved by an OCR and the other of which OCRs couldn’t easily read. The first word was used to disambiguate humans from bots, and the second word was used to train the OCR.
Today, CAPTCHAs serve a similar purpose, except they’re used to train self-driving cars’ image recognition AIs. I always try to be a little subversive and correctly identify the images that are clearly unambiguously classified by the AI, and then purposefully screw up identifying the image that the AI struggles with. It lets me through the majority of the time, which indicates that my bad input made it into their training data.
Unlike the CAPTCHAs of yore, when machine vision simply was not advanced enough to solve them, anyone has access to pre-trained vision models easily capable of identifying the unambiguously resolved buses or crosswalks in the CAPTCHA image. The deterrent to spammers is no longer that actual humans need to solve the CAPTCHA, but rather that it’s too computationally expensive to solve them at scale. Today’s CAPTCHAs are basically Hashcash proof-of-work [0], but with the added benefit to Google et al. (and annoyance to users) that they help train computer vision models.
>> It displayed a pair of words, one of which was unambiguously resolved by an OCR and the other of which OCRs couldn’t easily read. The first word was used to disambiguate humans from bots, and the second word was used to train the OCR.
Was I the only person who has always inserted some nonsense word as my answer for the clearly scanned word? It was very obvious which word was generated and would be checked, and which one was scanned and which wouldn't be checked just accepted as-is. I always just typed in something else other than the word - I think it was just being contrarian against being used by a corporation to do their word recognition for them for free.
4chan famously promoted the usage of the n-word for that purpose. You're going to get enough wrong answers even with compliant users - I'm 100% positive any dev with at least two brain cells and a passing grade in statistics would be able to filter all of these out of the dataset.
But yeah, I also cross the street on the red light.
I mean, they probably do - so if 9/10 people said the word is "apple" and I said it's "banana" then it will accept apple. But at the same time I can't imagine I was the only one doing this.
Maybe the system has determined that you are human and you are intentionally attempting to mess with them. Since the primary goal of CAPTCHA (confirm that you are human) has been fulfilled and you appear to not be a good source for the secondary goal (crowdsource training data), the system decided to not waste any more time with you.
That's ok, the chances of several intentional saboteurs on a single image sample are presumably pretty low.
i.e. even if the saboteur rate was as high as 10%, and I only showed images three times, only 10% * 10% * 10% = 0.1% of data would have three people intentionally picking the wrong answer. I suspect the rate is much lower, and 99%+ people just want to pick the right answer to get the captcha program to go away as quickly as possible.
Images with less than 3/3 matching results in this example would presumably be retested until you got the desired confidence level.
Then assuming your ML model isn't overfit/overtrained, you could even then assess your original input data to detect/flag anomalies for manual review.
I figured this out a while ago and I do it as a challenge - how can I incorrectly pass the test - probably there are more of us doing this than you estimate? We could be in the majority (unlikely I know)
The only reason we still solve those stupid image recognition puzzles is because Google/Waymo and other self-driving car companies have managed to trick us into helping them do their training work for them.
Is there a link beyond that blog post explaining how it's a proof-of-work and how you keep people with thick wallets willing to pay for compute from exploiting?
I'm waiting for the check to arrive after all of these years of training—especially after moving to the East from the West and now I have 5× as many prompts.
How about the text ones that Google uses that are text that are alternately compressed and stretched in the same word to the degree that many of them are completely illegible to human eyes? Some of them appear like text wrapped around an invisible sphere almost like a mercator projection. What meaningful work domain is those captchas targeting?
The cynic in me says because I resent being forced to help multi-billion dollar companies crowdsource their AI training.
The techno-optimist in me says because I want to force them to improve their underlying models. When their engineers notice that their model struggles with weird edge cases that I purposefully mislabel (e.g. when prompted to select images containing motorcycles, I also pick a mountain bike with fat, motorcycle-sized tires), perhaps they will contemplate how to rigorously encode the concepts of “motorcycle” and “mountain bike” into their model, rather than simply pushing an abundance of training data through a black box classifier and hoping that by adding more crowdsourced data, it will eventually arrive at the right answer.
Not if you believe that the people working on this are going too fast and/or have a misguided goal.
I think it's reasonable to believe that real self-driving cars are not inevitable, or even if they are, deliberate disruption of this process is healthy; e.g. it shouldn't rely on something this dumb.
Don't you think that if this data was known to be widely and mostly beneficial, reCaptcha would be falling all over themselves to grab the good PR? The fact that regular folks hear virtually nothing about this strongly indicates that it's like most data collection -- if people knew the real deal they probably wouldn't happily sign on and would likely bring more questions than they want to deal with.
Not really, non-tech people don't know or care about reCaptcha. I still think its evil for reCaptcha to be so prolific and used for data collection, but it's a positive side-effect that its also used for less evil things like labeling for data sets.
Right. But if it was mostly good, whoever reCaptcha is could raise/make boatloads of money with "you're not just practicing safe computing, you're helping save children's lives" type ads/fundraising.
Then they can pay for their own mechanical Turk labour, thank you very much. I will not be sponsoring a corporation out of the goodness of my heart, out of my own time.
If I ever learn that they release that dataset to the public, my position on this may change.
If that is something important that we should rely on any company involved in that should be spending the readily available resources to do it correctly, not hoping that random people trying to log in to their email pick the correctly labeled data.
> a pair of words, one of which was unambiguously resolved by an OCR and the other of which OCRs couldn’t easily read
4chan had a lot of fun with this when it was first implemented. Perhaps unsurprisingly, there very quickly developed a campaign to have everyone insert "n**r" in place of the unknown word. Many threads were dedicated to education, onboarding, and, of course, sharing 'trophies' when such a replacement was found to have taken effect in one of Google's products (Books, iirc?).
Today, CAPTCHAs serve a similar purpose, except they’re used to train self-driving cars’ image recognition AIs. I always try to be a little subversive and correctly identify the images that are clearly unambiguously classified by the AI, and then purposefully screw up identifying the image that the AI struggles with. It lets me through the majority of the time, which indicates that my bad input made it into their training data.
Unlike the CAPTCHAs of yore, when machine vision simply was not advanced enough to solve them, anyone has access to pre-trained vision models easily capable of identifying the unambiguously resolved buses or crosswalks in the CAPTCHA image. The deterrent to spammers is no longer that actual humans need to solve the CAPTCHA, but rather that it’s too computationally expensive to solve them at scale. Today’s CAPTCHAs are basically Hashcash proof-of-work [0], but with the added benefit to Google et al. (and annoyance to users) that they help train computer vision models.
[0] https://en.m.wikipedia.org/wiki/Hashcash