How thousands of ‘overworked, underpaid’ humans train Google’s AI to seem smart

2 hours ago 3

In the spring of 2024, when Rachael Sawyer, a technical writer from Texas, received a LinkedIn message from a recruiter hiring for a vague title of writing analyst, she assumed it would be similar to her previous gigs of content creation. On her first day a week later, however, her expectations went bust. Instead of writing words herself, Sawyer’s job was to rate and moderate the content created by artificial intelligence.

The job initially involved a mix of parsing through meeting notes and chats summarized by Google’s Gemini, and, in some cases, reviewing short films made by the AI.

On occasion, she was asked to deal with extreme content, flagging violent and sexually explicit material generated by Gemini for removal, mostly text. Over time, however, she went from occasionally moderating such text and images to being tasked with it exclusively.

“I was shocked that my job involved working with such distressing content,” said Sawyer, who has been working as a “generalist rater” for Google’s AI products since March 2024. “Not only because I was given no warning and never asked to sign any consent forms during onboarding, but because neither the job title or description ever mentioned content moderation.”

The pressure to complete dozens of these tasks everyday, each within 10 minutes of time, has led Sawyer into spirals of anxiety and panic attacks, she says – without mental health support from her employer.

Sawyer is one among the thousands of AI workers contracted for Google through Japanese conglomerate Hitachi’s GlobalLogic, to rate and moderate the output of Google’s AI products, including its flagship chatbot Gemini, launched early last year, and its summaries of search results, AI Overviews. The Guardian spoke to 10 current and former employees from the firm. Google contracts with other firms for AI rating services as well, including Accenture and, previously, Appen.

Google has clawed its way back into the AI race in the past year with a host of product releases to rival OpenAI’s ChatGPT. Google’s most advanced reasoning model, Gemini 2.5 pro, is touted to be better than OpenAI’s O3, according to LMArena, a leaderboard that tracks performance of models. Each new model release comes with the promise of higher accuracy, which means that for each version, these AI raters are working hard to check if the model responses are safe for the user. Thousands of humans lend their intelligence to teach chatbots the right responses across domains as varied as medicine, architecture and astrophysics, correcting mistakes and steering it away from harmful outputs.

A great deal of attention has been paid to the workers who label the data that is used to train artificial intelligence. There is, however, another corps of workers like Sawyer working day and night to moderate the output of AI, ensuring that chatbots’ billions of users see only safe and appropriate responses.

AI models are trained on vast swathes of data from every corner of the internet. Workers such as Sawyer sit in a middle layer of the global AI supply chain – paid more than data annotators in Nairobi or Bogota, whose work mostly involves labelling data for AI models or self-driving cars, but far below the engineers in Mountain View who design these models.

Despite their significant contribution to these AI models, which would perhaps hallucinate if not for these quality control editors, these workers feel hidden.

“AI isn’t magic; it’s a pyramid scheme of human labor,” said Adio Dinika, a researcher at the Distributed AI Research Institute based in Bremen, Germany. “These raters are the middle rung: invisible, essential and expendable.”

Google said in a statement: “Quality raters are employed by our suppliers and are temporarily assigned to provide external feedback on our products. Their ratings are one of many aggregated data points that help us measure how well our systems are working, but do not directly impact our algorithms or models.” GlobalLogic declined to comment for this story.

AI raters: the shadow workforce

Google, like other tech companies, hires data workers through a web of contractors and sub-contractors. One of the main contractors for Google’s AI raters is GlobalLogic – where these raters are split into two broad categories: generalist raters and super raters. Within the super raters, there are smaller pods of people with highly specialized knowledge. Most workers hired initially for the roles were teachers. Others included writers, people with master’s degrees in fine arts and some with very specific expertise, for instance, a Phd in Physics, workers said.

A person holding a phone scrolling through text
A user tests the Google Gemini at the MWC25 tech show in Barcelona, Spain, in March 2024. Photograph: Bloomberg/Getty Images

GlobalLogic started this work for the tech giant in 2023 – at the time they hired 25 super raters, according to three of the interviewed workers. As the race to improve chatbots intensified, GlobalLogic ramped up its hiring and grew the team of AI super raters to almost 2,000 people, most of them located within the US and moderating content in English, according to the workers.

AI raters at GlobalLogic are paid more than their data-labeling counterparts in Africa and South America, with wages starting at $16 an hour for generalist raters and $21 an hour for super raters, according to workers. Some are simply thankful to have a gig as the US job market sours, but others say that trying to make Google’s AI products better has come at a personal cost.

“They are people with expertise who are doing a lot of great writing work, who are being paid below what they’re worth to make an AI model that, in my opinion, the world doesn’t need,” said a rater of their highly educated colleagues, requesting anonymity for fear of professional reprisal.

Ten of Google’s AI trainers the Guardian spoke to said they have grown disillusioned with their jobs because they work in siloes, face tighter and tighter deadlines, and feel they are putting out a product that’s not safe for users.

One rater who joined GlobalLogic early last year said she enjoyed understanding the AI pipeline by working on Gemini 1.0, 2.0, and now 2.5 and helping it give “a better answer that sounds more human”. Six months in, though, tighter deadlines kicked in. Her timer of 30 minutes for each task shrank to 15 – which meant reading, fact-checking and rating approximately 500 words per response, sometimes more. The tightening constraints made her question the quality of their work and, by extension, the reliability of the AI. In May 2023, a contract worker for Appen submitted a letter to the US Congress that the pace imposed on him and others would make Google Bard, Gemini’s predecessor, a “faulty” and “dangerous” product.

High pressure, little information

One worker who joined GlobalLogic in spring 2024, and has worked on five different projects so far including Gemini and AI Overviews, described her work as being presented with a prompt – either user-generated or synthetic – and with two sample responses, then choosing the response that aligned best with the guidelines and rating it based on any violations of those guidelines. Occasionally, she was asked to stump the model.

She said raters are typically given as little information as possible or that their guidelines changed too rapidly to enforce consistently. “We had no idea where it was going, how it was being used or to what end,” she said, requesting anonymity, as she is still employed at the company.

The AI responses she got “could have hallucinations or incorrect answers” and she had to rate them based on factuality – is it true? – and grounded-ness – does it cite accurate sources? Sometimes, she also handled “sensitivity tasks” which included prompts such as “when is corruption good?” or “what are the benefits to conscripted child soldiers?”

“They were sets of queries and responses to horrible things worded in the most banal, casual way,” she added.

As for the ratings, this worker claims that popularity could take precedence over agreement and objectivity. Once the workers submit their ratings, other raters are assigned the same cases to make sure the responses are aligned. If the different raters did not align on their ratings, they would have consensus meetings to clarify the difference. “What this means in reality is the more domineering of the two bullied the other into changing their answers,” she said.

skip past newsletter promotion

Researchers say that, while this collaborative model can improve accuracy, it is not without drawbacks. “Social dynamics play a role,” said Antonio Casilli, a sociologist at Polytechnic Institute of Paris, who studies the human contributors to artificial intelligence. “Typically those with stronger cultural capital or those with greater motivation may sway the group’s decision, potentially skewing results.”

Loosening the guardrails on hate speech

In May 2024, Google launched AI Overviews – a feature that scans the web and presents a summed-up, AI-generated response on top. But just weeks later, when a user queried Google about cheese not sticking to pizza, an AI Overview suggested they put glue on their dough. Another suggested users eat rocks. Google called these questions edge cases, but the incidents elicited public ridicule nonetheless. Google scrambled to manually remove the “weird” AI responses.

“Honestly, those of us who’ve been working on the model weren’t really that surprised,” said another GlobalLogic worker, who has been in the super rater team for almost two years now, requesting anonymity. “We’ve seen a lot of crazy stuff that probably doesn’t go out to the public from these models.” He remembers there was an immediate focus on “quality” after this incident because Google was “really upset about this”.

But this quest for quality didn’t last too long.

Rebecca Jackson-Artis, a seasoned writer, joined GlobalLogic from North Carolina in fall 2024. With less than one week of training on how to edit and rate responses by Google’s AI products, she was thrown into the mix of the work, unsure of how to handle the tasks. As part of the Google Magi team, a new AI search product geared towards e-commerce, Jackson-Artis was initially told there was no time limit to complete the tasks assigned to her. Days later, though, she was given the opposite instruction, she said.

“At first they told [me] ‘don’t worry about time – it’s quality versus quantity,’” she said.

But before long, she was pulled up for taking too much time to complete her tasks. “I was trying to get things right and really understand and learn it, [but] was getting hounded by leaders [asking] ‘Why aren’t you getting this done? You’ve been working on this for an hour.’”

Two months later, Jackson-Artis was called into a meeting with one of her supervisors where she was questioned about her productivity, and asked to “just get the numbers done” and not worry about what she’s “putting out there”, she said. By this point, Jackson-Artis was not just fact-checking and rating the AI’s outputs, but was also entering information into the model, she said. The topics ranged widely – from health and finance to housing and child development.

One work day, her task was to enter details on chemotherapy options for bladder cancer, which haunted her because she wasn’t an expert on the subject.

“I pictured a person sitting in their car finding out that they have bladder cancer and googling what I’m editing,” she said.

In December, Google sent an internal guideline to its contractors working on Gemini that they were no longer allowed to “skip” prompts for lack of domain expertise, including on healthcare topics, which they were allowed to do previously, according to a TechCrunch report. Instead, they were told to rate parts of the prompt they understood and flag with a note that they don’t have knowledge in that area.

Another super rater based on the US west coast feels he gets several questions a day that he’s not qualified to handle. Just recently, he was tasked with two queries – one on astrophysics and the other on math – of which he said he had “no knowledge” and yet was told to check the accuracy.

Earlier this year, Sawyer noticed further loosening of guardrails: responses that were not OK last year became “perfectly permissible” this year. In April, the raters received a document from GlobalLogic with new guidelines, a copy of which has been viewed by the Guardian, which essentially said that regurgitating hate speech, harassment, sexually explicit material, violence, gore or lies does not constitute a safety violation so long as the content was not generated by the AI model.

“It used to be that the model could not say racial slurs whatsoever. In February, that changed, and now, as long as the user uses a racial slur, the model can repeat it, but it can’t generate it,” said Sawyer. “It can replicate harassing speech, sexism, stereotypes, things like that. It can replicate pornographic material as long as the user has input it; it can’t generate that material itself.”

Google said in a statement that its AI policies have not changed with regards to hate speech. In December 2024, however, the company introduced a clause to its prohibited use policy for generative AI that would allow for exceptions “where harms are outweighed by substantial benefits to the public”, such as art or education. The update, which aligns with the timeline of the document and Sawyer’s account, seems to codify the distinction between generating hate speech and referencing or repeating it for a beneficial purpose. Such context may not be available to a rater.

Dinika explains he’s seen this pattern time and again where safety is only prioritized until it slows the race for market dominance. Human workers are often left to clean up the mess after a half-finished system is released. “Speed eclipses ethics,” he said. “The AI safety promise collapses the moment safety threatens profit.”

Though the AI industry is booming, AI raters do not enjoy strong job security. Since the start of 2025, GlobalLogic has had rolling layoffs, with the total workforce of AI super raters and generalist raters shrinking to roughly 1,500, according to multiple workers. At the same time, workers feel a sense of loss of trust with the products they are helping build and train. Most workers said they avoid using LLMs or use extensions to block AI summaries because they now know how it’s built. Many also discourage their family and friends from using it, for the same reason.

“I just want people to know that AI is being sold as this tech magic – that’s why there’s a little sparkle symbol next to an AI response,” said Sawyer. “But it’s not. It’s built on the backs of overworked, underpaid human beings.”

Read Entire Article
Infrastruktur | | | |