AI Needs to Work on Its Conversation Game

Researchers discover why AI does a poor job of knowing when to chime in on a conversation

News Staff | Thu, Nov 7, 2024

When you have a conversation today, notice the natural points when the exchange leaves open the opportunity for the other person to chime in. If their timing is off, they might be taken as overly aggressive, too timid, or just plain awkward.

The back-and-forth is the social element to the exchange of information that occurs in a conversation, and while humans do this naturally—with some exceptions—AI language systems are universally bad at it.

Linguistics and computer science researchers at Tufts have now discovered some of the root causes of this shortfall in AI conversational skills and point to possible ways to make them better conversational partners.

When humans interact verbally, for the most part they avoid speaking simultaneously, taking turns to speak and listen. Each person evaluates many input cues to determine what linguists call “transition relevant places” or TRPs. TRPs occur often in conversations. Many times we will take a pass at a TRP, and let the speaker continue. Other times we will use the TRP to take our turn and share our thoughts.

JP de Ruiter, professor of psychology and computer science, says that for a long time it was thought that the “paraverbal” information in conversations—the intonations, lengthening of words and phrases, pauses, and some visual cues—were the most important signals for identifying a TRP.

“That helps a little bit,” says de Ruiter, “but if you take out the words and just give people the prosody—the melody and rhythm of speech that comes through as if you were talking through a sock—they can no longer detect appropriate TRPs.”

Do the reverse and just provide the linguistic content in a monotone speech, and study subjects will find most of the same TRPs they would find in natural speech.

“What we now know is that the most important cue for taking turns in conversation is the language content itself. The pauses and other cues don’t matter that much,” says de Ruiter.

AI is great at detecting patterns in content, but when de Ruiter, graduate student Muhammad Umair, and research assistant professor of computer science Vasanth Sarathy, EG20, tested transcribed conversations against a large language model AI, the AI was not able to detect appropriate TRPs with anywhere near the capability of humans.

The reason stems from what the AI is trained on. Large language models, including the most advanced ones such as ChatGPT, have been trained on a vast dataset of written content from the internet—Wikipedia entries, online discussion groups, company websites, news sites—just about everything.

What is missing from that dataset is any significant amount of transcribed spoken conversational language, which is unscripted, uses simpler vocabulary and shorter sentences, and is structured differently than written language. AI was not “raised” on conversation, so it does not have the ability to model or engage in conversation in a more natural, human-like manner.

The researchers thought that it might be possible to take a large language model trained on written content and fine-tune it with additional training on a smaller set of conversational content so it can engage more naturally in a novel conversation. When they tried this, they found that there were still some limitations to replicating human-like conversation.

The researchers caution that there may be a fundamental barrier to AI carrying on a natural conversation. “We are assuming that these large language models can understand the content correctly. That may not be the case,” says Sarathy. “They’re predicting the next word based on superficial statistical correlations, but turn taking involves drawing from context much deeper into the conversation.”

“It’s possible that the limitations can be overcome by pre-training large language models on a larger body of naturally occurring spoken language,” says Umair, whose Ph.D. research focuses on human-robot interactions and who is the lead author on the studies.

“Although we have released a novel training dataset that helps AI identify opportunities for speech in naturally occurring dialogue, collecting such data at a scale required to train today’s AI models remains a significant challenge," he says. “There are just not nearly as many conversational recordings and transcripts available compared to written content on the internet.”

The study results were presented at the Empirical Methods in Natural Language Processing (EMNLP) 2024 conference, held in Miami from November 11 to 17 and posted on Arxiv.

Latest Tufts Now

View all