Is the University of Michigan Selling Student Data to Train AI?

The University of Michigan has allegedly sold 85 hours of audio recordings from various academic settings including lectures, interviews, office hours, study groups, and student presentations to third parties for the purposes of training artificial intelligence. The school has also allegedly sold a dataset of 829 academic papers from students to help fine tune large language models (LLMs) as well.

It is unclear whether those included in the data consented to having their audio and texts used in such a manner. However, a sample dataset downloaded by The Daily Beast included a recording of a lecture from 1999 making it highly unlikely that they knew their data would be used to train future generative AI models.

AI engineer Susan Zhang took to X to post a screenshot showing what looks to be an advertisement from Catalyst Research Alliance, a firm selling the UM data, that she recently received on LinkedIn. The sender wrote that they were “reaching out because, based on your profile, you may be working with” LLMs.

“I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLMs,” the user wrote.

“So I guess this is a thing now,” Zhang said. “Universities running ads to resell students data for training LLMs.”

so i guess this is a thing now

universities running ads to resell students' data

for training llms

💰💰💰 pic.twitter.com/8SR0gP6R10
— Susan Zhang (@suchenzang) February 15, 2024

In a statement to The Daily Beast, UM spokesperson Colleen Mastony said that the ad was "sent out by a new third party vendor that shared inaccurate information and has since been asked to halt their work."

"Student data was not and has never been for sale by the University of Michigan," Mastony said. She added that the papers and speech recordings were "voluntarily contributed by student volunteers" who participated in two research studies "under signed consent." One study occurred between 1997 and 2000, while the other occurred between 2006 and 2007.

However, the nature of UM's relationship with Catalyst Research Alliance as a "third party vendor" was still unclear. Whether or not the students knew that their data would later be sold to help train AI was also not clear. Catalyst Research Alliance did not respond when reached for comment.

According to the firm's website, the cost of licensing the datasets varies depending on whether or not customers want to purchase just the audio recordings or the papers as well. However, the price goes as high as $25,000 for both datasets.

“The University of Michigan has recorded 65 speech events from a wide range of academic settings, including lectures, discussion sections, interviews, office hours, study groups, seminars and student presentations,” Catalyst Research Alliance said on its website. “Speakers represent broad demographics, including male and female and native and non-native English speakers from a wide variety of academic disciplines.”

The sample dataset included an audio lecture titled “Graduate Cellular Biotechnology Lecture” dated Feb. 1, 1999. In it, the unidentified lecturer speaks for roughly an hour and a half. The dataset also included a .txt file of a paper titled “The Democratic Inadequacies of the European Union.”

This Social Media Obsessed Chatbot May Save Teen Lives EXTREMELY ONLINE

Amanda Coopersmith, Tony Ho Tran

An image of a young girl sitting in front of a desk with a phone in her hands

If true, the licensing deal is just another example of how personal data is being packaged and sold to help fuel emerging technologies such as generative AI. Even students whose work is completely unrelated to AI and LLMs can find their voice and writings being used in order to help train them.

“The whole thing feels deeply unethical,” Charles Logan, a learning sciences PhD candidate at Northwestern University, told The Daily Beast. Logan saw Zhang’s post on X and also commented on the situation, decrying it as the “logical progression of data capitalism.”

“When students are in a class or attending office hours there’s a trust implicit in that relationship,” Logan said. “They’re there to learn.”

He added that even if they are consenting to be a part of these datasets “there are still ways that they’re leaky.” “Private companies are monetizing student intellectual property and conversations that, if you’re in office hours or study groups, are deeply personal.”

Brett Karlan, an assistant professor of philosophy who researchers AI ethics, told The Daily Beast in an email that there's an "ethical tension" between the role of the university and its alleged sale of student data. "Students, ideally, should feel that they are able to try things out and make mistakes in classes, with those mistakes not being available to anyone except the professor," he explained. He added that now the material might end up training AI. "This feels like, if not a fundamental betrayal of the mission of a university."

That said, there is some room for doubt. Mastony said that the papers and recordings have "long been available for free to academics" and have "been used as a tool to improve writing and articulation in education. She added that "none of the papers or recordings included identifying information, such as names or other personal data."

“My first reaction is one of skepticism,” Vincent Conitzer, an AI ethics researcher at Carnegie Mellon University, told The Daily Beast. “Also, even taking this message mostly at face value, I suppose it may just all be based on recordings and papers that are anyway in the public domain.”

He added that “it seems odd to me to imagine the university at the highest levels standing behind something like what this message is suggesting.”

Is the University of Michigan Is Selling Student Data to Train AI?

It is unclear whether those included in the datasets consented to having their audio and texts used in such a manner.

Tony Ho Tran