The way we communicate in our native language is a lot more complicated than you think. Say you go to a coffee shop and order a cold brew. You might say, “I’ll have a cold brew.” However, there’s also a good chance you’ll stretch it out a little and say something like, “Ah, it’s a pretty hot out. You know, I think I’ll go with a cold brew this morning, please.” That’s because we’re prone to a lot more complexity in our languages. We like to play with it and get creative without even realizing it—especially if we grew up with it and know it very well.
If you’re just learning a new language, however, the opposite is probably true. Your sentences are simpler. They’re more direct. There are fewer linguistic flourishes and less variation—which makes sense. You’re more concerned about getting things right, versus sounding like a loquacious native speaker..
However, it’s this lack of linguistic complexity that also distinguishes text written by large language models (LLM) like ChatGPT or Bard, from text written by humans. This idea is what underpins many AI detection apps used by professors and teachers to assess whether or not a student’s essay was actually written by one of their students.
ADVERTISEMENT
The same approach, however, might also result in these apps wrongfully targeting people whose first language isn’t English. That’s at the heart of a study published July 10 in the journal Patterns, which found that these models “frequently misclassify non-native English writing as AI generated.” This could result in the marginalization of non-native English speakers in schools and universities
“The design of many GPT detectors inherently discriminates against non-native authors, particularly those exhibiting restricted linguistic diversity and word choice,” the study’s authors wrote. They added that the parameters used by these detectors include “measures that also unintentionally distinguish non-native- and native-written samples.”
These detectors are typically trained using other open-source LLMs. For example, popular AI detector GPTZero was initially trained using GPT-2 in order to help professors identify bot-written essays in their classes. Edward Tian, GPTZero’s founder and CEO, told The Daily Beast in January that the bot uses perplexity (a measurement of randomness in a sentence) and burstiness (a measurement of randomness in a text overall) to determine whether or not a piece of text was created by ChatGPT. Other AI detectors use similar metrics.
Put it simply: If your sentence surprises the bot, then it’s more likely to have been written by a flesh-and-blood human (in theory).
For this study, the authors used seven popular AI detectors on 91 Test of English as a Foreign Language (TOEFL) essays collected from a Chinese forum, and 88 essays written by U.S. eighth grade students. The detectors incorrectly labeled an average of 61.3 percent of the TOEFL essays as bot-generated. One detector even said that AI created 98 percent of the essays.
However, the detectors were much better at identifying the essays from native English speakers, correctly identifying 95 percent of the texts. This shows a clear bias with these apps, that could do more harm towards already marginalized groups like international students and students of color.
“Our findings emphasize the need for increased focus on the fairness and robustness of GPT detectors, as overlooking their biases may lead to unintended consequences, such as the marginalization of non-native speakers in evaluative or educational settings,” the authors wrote.
Ironically, the study’s authors note that students could use LLMs to make their essays more complex. The researchers used ChatGPT to enhance the language of the TOEFL essays to emulate a native English speaker. They found that the average false-positive rate plummeted by 50 percent. Inversely, simplifying the essays of the U.S. eighth graders led to a spike in false positive rates.
The study underscores the pernicious effects of bias and marginalization when it comes to AI. These issues are well documented and include bots that discriminate in home lending, jail sentencing, and job hunting. Luckily, some government bodies are attempting to do something about it to reign in its effects. For example, New York City introduced a new law on July 5 requiring all businesses that utilize hiring bots to prove their process is free from racism and sexism, and publicly disclose that they use AI in hiring.
The same type of regulation is unlikely going to come for AI detectors any time soon—at least not on a federal level. However, the paper highlights the need for greater public awareness and literacy when it comes to these emerging technologies.
It also shows that the old adage still rings true: Sometimes the cure is worse than the disease. While we might be able to fight against instances of AI plagiarism, we might end up harming a lot of innocent students along the way.