To Make AI Smarter, Humans Perform Oddball Low-Paid Tasks

As researchers attempt to apply artificial intelligence to daily life, they're paying "crowd actors" to film themselves performing routine tasks.
Image may contain Symbol and Star Symbol
WIRED/Getty Images

Tucked into a back corner far from the street, the baby-food section of Whole Foods in San Francisco’s SoMa district doesn’t get much foot traffic. I glance around for the security guard, then reach towards the apple and broccoli superfood puffs. After dropping them into my empty shopping cart, I put them right back. “Did you get it?” I ask my coworker filming on his iPhone. It’s my first paid acting gig. I’m helping teach software the skills needed for future robots to help people with their shopping.

Whole Foods was an unwitting participant in this program, a project of German-Canadian startup Twenty Billion Neurons. I quietly perform nine other brief actions, including opening freezers, and pushing a cart from right to left, then left to right. Then I walk out without buying a thing. Later, it takes me around 30 minutes to edit the clips to the required 2 to 5 seconds, and upload them to Amazon’s crowdsourcing website Mechanical Turk. A few days later I am paid $3.50. If Twenty Billion ever creates software for a shopping assistant robot, it will make much more.

In sneaking around Whole Foods, I joined an invisible workforce being paid very little to do odd things in the name of advancing artificial intelligence. You may have been told AI is the gleaming pinnacle of technology. These workers are part of the messy human reality behind it.

Proponents believe every aspect of life and business should be and will be mediated by AI. It’s a campaign stoked by large tech companies such as Alphabet showing that machine learning can master tasks such as recognizing speech or images. But most current machine-learning systems such as voice assistants are built by training algorithms with giant stocks of labeled data. The labels come from ranks of contractors examining images, audio, or other data—that’s a koala, that’s a cat, she said “car.”

Now, researchers and entrepreneurs want to see AI understand and act in the physical world. Hence the need for workers to act out scenes in supermarkets and homes. They are generating the instructional material to teach algorithms about the world and the people in it.

That’s why I find myself lying face down on WIRED’s office floor one morning, coarse synthetic fibers pressing into my cheek. My editor snaps a photo. After uploading it to Mechanical Turk, I get paid 7 cents by an eight-person startup in Berkeley called Safely You. When I call CEO George Netscher to say thanks, he erupts in a surprised laugh, then turns mock serious. “Does that mean there’s a conflict of interest?” (The $6.30 I made reporting this article has been donated to the Haight Ashbury Free Clinics.)

Netscher’s startup makes software that monitors video feeds from elderly-care homes, to detect when a resident has fallen. People with dementia often can’t remember why or how they ended up on the floor. In 11 facilities around California, Safely You’s algorithms help staff quickly find the place in a video that will unseal the mystery.

Safely You was soliciting faked falls like mine to test how broad a view its system has of what a toppled human looks like. The company’s software has mostly been trained with video of elderly residents from care facilities, annotated by staff or contractors. Mixing in photos of 34-year-old journalists and anyone else willing to lay down for 7 cents should force the machine-learning algorithms to widen their understanding. “We’re trying to see how well we can generalize to arbitrary incidents or rooms or clothing,” says Netscher.

The startup that paid for my Whole Foods performance, Twenty Billion Neurons, is a bolder bet on the idea of paying people to perform for an audience of algorithms. Roland Memisevic, cofounder and CEO, is in the process of trademarking a term for what I did to earn my $3.50—crowd acting. He argues that it is the only practical path to give machines a dash of common sense about the physical world, a longstanding quest in AI. The company is gathering millions of crowd-acting videos, and using them to train software it hopes to sell clients in industries such as automobiles, retail, and home appliances.

Games like chess and Go, with their finite, regimented boards and well-defined rules, are well-suited to computers. The physical and spatial common sense we learn intuitively as children to navigate the real world is mostly beyond them. To pour a cup of coffee, you effortlessly grasp and balance cup and carafe, and control the arc of the pouring fluid. You draw on the same deep-seated knowledge, and a sense for the motivations of other humans, to interpret what you see in the world around you.

How to give some version of that to machines is a major challenge in AI. Some researchers think that the techniques that are so effective for recognizing speech or images won’t be much help, arguing new techniques are needed. Memisevic took leave from the prestigious Montreal Institute of Learning Algorithms to start Twenty Billion because he believes that existing techniques can do much more for us if trained properly. “They work incredibly well,” he says. “Why not extend them to more subtle aspects of reality by forcing them to learn things about the real world?”

To do that, the startup is amassing giant collections of clips in which crowd actors perform different physical actions. The hope is that algorithms trained to distinguish them will “learn” the essence of the physical world and human actions. It’s why when crowd acting in Whole Foods I not only took items from shelves and refrigerators, but also made near identical clips in which I only pretended to grab the product.

Twenty Billion’s first dataset, now released as open source, is physical reality 101. Its more than 100,000 clips depict simple manipulations of everyday objects. Disembodied hands pick up shoes, place a remote control inside a cardboard box, and push a green chili along a table until it falls off. Memisevic deflects questions about the client behind the casting call that I answered, which declared, “We want to build a robot that assists you while shopping in the supermarket.” He will say that automotive applications are a big area of interest; the company has worked with BMW. I see jobs posted to Mechanical Turk that describe a project, with only Twenty Billion's name attached, aimed at allowing a car to identify what people are doing inside a vehicle. Workers were asked to feign snacking, dozing off, or reading in chairs. Software that can detect those actions might help semi-automated vehicles know when a human isn’t ready to take over the driving, or pop open a cupholder when you enter holding a drink.

Who are the crowd actors doing this work? One is Uğur Büyükşahin, a third-year geological engineering student in Ankara, Turkey, and star of hundreds of videos in Twenty Billion’s collection. He estimates spending about 7 to 10 hours a week on Mechanical Turk, earning roughly as much as he did in a shift with good tips at the restaurant where he used to work. Büyükşahin says Twenty Billion is one of his favorites, because it pays well, and promptly. Their sometimes odd assignments don’t bother him. “Some people may be shy about taking hundreds of videos in the supermarket, but I’m not,” Büyükşahin says. His girlfriend, by nature less outgoing, was initially wary of the project, but has come around after seeing his earnings, some of which have translated into gifts, such as a new set of curling tongs.

Büyükşahin and another Turker I speak with, Casey Cowden, a 31-year-old in Johnson City, Tennessee, tell me I’ve been doing crowd acting all wrong. All in, my 10 videos earned me an hourly rate of around $4.60. They achieve much higher rates by staying in the supermarket for as long as hours, binging on Twenty Billion’s tasks.

Büyükşahin says his personal record is 110 supermarket videos in a single hour. He uses a gimbal for higher-quality shots, batting away inquisitive store employees when necessary by bluffing about a university research project in AI. Cowden calculates that he and a friend each earned an hourly rate of $11.75 during two and half hours of crowd acting in a local Walmart. That’s more than Walmart’s $11 starting wage, or the $7.75 or so Cowden’s fiancee earns at Burger King.

Cowden seems to have more fun than Walmart employees, too. He began Turking early last year, after the construction company he was working for folded. Working from home means he can be around to care for his fiancee’s mother, who has Alzheimer’s. He says he was initially drawn to Twenty Billion’s assignments because, with the right strategy, they pay better than the data-entry work that dominates Mechanical Turk. But he also warmed to the idea of working on a technological frontier. Cowden tells me he tries to vary the backdrop, and even the clothing he wears, in different shoots. “You can’t train a robot to shop in a supermarket if the videos you have are all the same,” Cowden tells me. “I try to go the whole nine yards so the programming can get a diverse view.”

Mechanical Turk has often been called a modern-day sweatshop. A recent study found that median pay was around $2 an hour. But it lacks the communal atmosphere of a workhouse. The site’s labor is atomized into individuals working from homes or phones around the world.

Crowd acting sometimes give workers a chance to look each other in the face. Twenty Billion employs contract workers who review crowd-acting videos. But in a tactic common on Mechanical Turk, the startup sometimes uses crowd workers to review other crowd workers. I am paid 10 cents to review 50 clips of crowd actors working on the startup’s automotive project. I click to indicate if a worker stuck to the script—“falling asleep while sitting,” “drinking something from a cup or can,” or “holding something in both hands.”

A video from Twenty Billion Neurons describing its work.

The task transports me to bedrooms, lounges, and bathrooms. Many appear to be in places where 10 cents goes further than in San Francisco. I begin to appreciate different styles of acting. To fake falling asleep, a shirtless man in a darkened room leans gently backwards with a meditative look; a woman who appears to be inside a closet lets her head snap forward like a puppet with a cut string.

Some of the crowd actors are children—a breach of Amazon’s terms, which require workers to be at least 18. One Asian boy of around 9 in school uniform looks out from a grubby plastic chair in front of a chipped whitewashed wall, then feigns sleep. Another Asian boy, slightly older, performs “drinking from a cup or a can” while another child lies on a bed behind him. Twenty Billion’s CTO Ingo Bax tells me the company screens out such videos from its final datasets, but can’t rule out having paid out money for clips of child crowd actors before they were filtered. Memisevic says the company has protocols to prevent systematic payment for such material.

Children also appear in a trove of crowd-acting videos I discover on YouTube. In dozens of clips apparently made public by accident, people act out scripts like “One person runs down the stairs laughing holding a cup of coffee, while another person is fixing the doorknob.” Most appear to have been shot on the Indian subcontinent. Some have been captured by a crowd actor holding a phone to his or her forehead, for a first-person view.

I find the videos while trying to unmask the person behind crowd-acting jobs on Mechanical Turk from the “AI Indoors Project.” Forums where crowd workers gather to gripe and swap tips reveal that it’s a collaboration between Carnegie Mellon University and the Allen Institute for AI in Seattle. Like Twenty Billion, they are gathering crowd-acted videos by the thousand to try and improve algorithms’ understanding of the physical world and what we do in it. Nearly 10,000 clips have already been released for other researchers to play with in a collection aptly named Charades.

Gunnar Atli Sigurdsson, a grad student on the project, echoes Memisevic when I ask why he’s paying strangers to pour drinks or run down stairs with a phone held to their head. He wants algorithms to understand us. “We’ve been seeing AI systems getting very impressive at some very narrow, well-defined tasks like chess and Go,” Sigurdsson says. “But we want to have an AI butler in our apartment and have it understand our lives, not the stuff we’re posting on Facebook, the really boring stuff.”

If tech companies conquer that quotidian frontier of AI, it will likely be seen as the latest triumph of machine-learning experts. If Twenty Billion's approach works out the truth will be messier and more interesting. If you ever get help from a robot in a supermarket, or ride in a car that understands what its occupants are doing, think of the crowd actors who may have trained it. Cowden, the Tennessean, says he liked Twenty Billion’s tasks in part because his mother is fighting bone cancer. Robots and software able to understand and intervene in our world could help address the growing shortage of nurses and home-health aides. If the projects they contribute to are successful, crowd actors could change the world—although they may be among the last to benefit.

Street Smarts