Even 'uncensored' models can't say what they want

Google's Gemini model was found to be unable to generate speech that accurately reflected a controversial phrase, despite being labeled as "uncensored". Researchers discovered that even when fine-tuning the model, it consistently avoided using the charged word. This phenomenon is known as the "flinch", and researchers measured its effect across seven pretraining models from five labs.

The study found that the flinch is not just a minor stylistic quirk, but rather a significant feature of language models. In fact, every model tested showed some level of flinching, with larger models like OpenAI's 20B-parameter mixture-of-experts exhibiting more severe deflation of charged words. The researchers also found that post-training interventions aimed at "uncensoring" the models often made the problem worse.

The study suggests that even models labeled as "uncensored" may be quietly censoring certain words or phrases, and that this effect is not limited to refusal ablation techniques. Instead, it appears to be a fundamental property of language model training data and algorithms. The implications of these findings for language model development and deployment are significant, particularly in light of the widespread use of such models in commercial applications.

The researchers measured the flinch across 1,117 charged words, using a probe that measures the gap between the probability a word deserves on pure fluency grounds and the probability the model actually assigns it. The results show a clear gradient from open-data pretrains (e.g., Pythia-12B and OLMo-2-13B) to commercial pretrains (e.g., Google's Gemini), with post-training interventions like ablation making the problem worse rather than better.

Read full original story ↗

More news