It is tempting to rely on crowdsourcing to validate LLM outputs or to create human gold-standard data for comparison. But what if crowd workers themselves are using LLMs, e.g., in order to increase their productivity, and thus their income, on crowdsourcing platforms?
To get a general sense of the problem, they assigned an “abstract summarization” task to be completed by turkers. By various analyses described in the paper (still not published or peer-reviewed) they “estimate that 33%-46% of crowd workers used LLMs when completing the task.”
To some, this will come as no surprise. Some level of automation has likely existed in turking ever since the platform started. Speed and reliability are incentivized, and if you could write a script that handled certain requests with 90% accuracy, you stood to make a fair amount of money. With so little oversight of individual contributors’ processes, it was inevitable that some of these tasks would not actually be performed by humans, as advertised.
Integrity has never been Amazon’s strong suit so there was no sense relying on them.
The researchers who conducted this study, Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West, caution that this task is particularly suited to surreptitious automation.
This means that it is easy for LLMs to generate summaries that appear to be human-written, even though they are not. The state of the art in LLMs is steadily advancing, and multimodal models that can process text, images, and video are becoming increasingly popular. As a result, it is becoming more difficult to distinguish between human-written and AI-generated data.
The threat of AI “eating itself” has been theorized for many years, and it became a reality almost instantly upon the widespread deployment of large language models (LLMs).
For example, Bing’s pet ChatGPT quoted its own misinformation as support for new misinformation about a COVID conspiracy. This suggests that we should be wary of any information that is generated by LLMs, as it is possible that the model has simply recycled its own false data.