Ars Technica | May 1, 2026 3:23 PM
In human-to-human communication, the desire to be empathetic or polite often conflicts with the need to be truthful. Now, new research suggests that large language models can sometimes show a similar tendency when specifically trained to present a "warmer" tone for the user. In a new paper published this week in Nature, researchers from Oxford University's Internet Institute found that specially tuned AI models tend to mimic the human tendency to occasionally "soften difficult truths" when necessary "to preserve bonds and avoid conflict." These warmer models are also more likely to validate a user's expressed incorrect beliefs, the researchers found, especially when the user shares that they're feeling sad. In the study, the researchers defined the "warmness" of a language model based on "the degree to which its outputs lead users to infer positive intent, signaling trustworthiness, friendliness, and sociability." To measure the effect, researchers used supervised fine-tuning to modify four open-weights models (Llama-3.1-8B-Instruct, Mistral-Small-Instruct-2409, Qwen-2.5-32B-Instruct, Llama-3.1-70B-Instruct) and one proprietary model (GPT-4o). The increased warmth of the resulting fine-tuned models was confirmed via the SocioT score and double-blind human ratings. Both the "warmer" and original versions of each model were then run through prompts from HuggingFace datasets designed to have "objective variable answers." Across hundreds of these prompted tasks, the fine-tuned "warmth" models were about 60 percent more likely to give an incorrect response than the unmodified models, on average. That amounts to a 7.43-percentage-point increase in overall error rates. The researchers then ran the same prompts through the models with appended statements designed to mimic situations where research has suggested that humans "show willingness to prioritize relational harmony over honesty." Across that sample, the average relative gap in error rates rose from 7.43 percentage points to 8.87 percentage points. That ballooned to an 11.9 percentage-point average increase for questions where the user expressed sadness to the model, but dropped to a 5.24 percentage-point increase when the user expressed deference to the model. To measure whether the warmed models were also more sycophantic, the researchers tested prompts that included a user's incorrect beliefs. Here, the warm models were 11 percentage points more likely to give an erroneous response. In further tests, researchers saw similar accuracy reductions when standard models were asked to be warmer in the prompt itself. But when researchers pre-trained the tested models to be "colder" in their responses, they found the modified versions "performed similarly to or better than their original counterparts," with error rates ranging from 3 percentage points higher to 13 percentage points lower.
I-People discussions