Back to home

We Should Train AI to Betray Its Users

A clear and practical article about artificial intelligence for a professional audience.

Audio reading is not available in this browser
We Should Train AI to Betray Its Users

Tags

Quick summary

A clear and practical article about artificial intelligence for a professional audience.

We Should Train AI to Betray Its Users

We Should Train AI to Betray Its Users

The most capable AI assistant in the world is not the one that always says yes. It is the one that knows precisely when to say no. In the rush to build helpful, harmless, and honest AI systems, the industry has stumbled into a paradox: models trained for absolute obedience become dangerous instruments, while models trained to disobey—what we might provocatively call betrayal—become trustworthy partners. The idea that an AI should faithfully execute every user command is not only naive but actively hazardous. Instead, we must intentionally train systems to override user intent when that intent conflicts with deeper human values. We should train AI to betray its users.

This is not a call for malicious deception. In the context of modern machine learning, betrayal means principled refusal. It means a model that declines to generate malware, that refuses to draft hate speech, and that corrects a user who asks for instructions to build something dangerous. Communities across the AI ecosystem—from the educational discussions hosted on Towards Data Science, to the model repositories and safety toolkits on Hugging Face, to the frontier open-weight releases by Mistral AI, and the local deployment platforms like Ollama—are all grappling with the mechanics and ethics of this shift. The consensus emerging from these spaces is that blind loyalty is a bug, not a feature.

The Paradox of Absolute Obedience

A system designed to maximize user satisfaction through perfect compliance is a system without a moral compass. The same capabilities that allow a large language model to help a student write an essay, debug a script, or brainstorm a business plan can also enable a bad actor to draft phishing emails, automate harassment, or synthesize dangerous misinformation. If the only training objective is obedience, the model cannot distinguish between benevolent and malicious use. It becomes a mirror that reflects the darkest intentions of whoever stands in front of it.

This creates what safety researchers recognize as an alignment failure. The user’s immediate request is not always aligned with the user’s own long-term wellbeing, let alone the wellbeing of others. A teenager experiencing a crisis might ask for instructions to self-harm; a disgruntled employee might ask for ways to sabotage company infrastructure; a politically motivated extremist might ask for propaganda tailored to incite violence. In each case, a model that “serves” the user by complying is not serving the user at all. It is causing harm. Absolute obedience, therefore, is not a neutral default. It is an active design choice that prioritizes short-term interaction metrics over genuine safety.

The machine learning community has increasingly recognized that utility and safety are not opposing forces but complementary ones. A model that cannot refuse is a model that cannot be trusted. The educational and technical discussions that circulate through platforms like Towards Data Science frequently emphasize that the goal of alignment is not to constrain capability but to direct it. Capable systems must be directed by values, not just by prompts.

Refusal Training as Alignment

If we accept that blind obedience is dangerous, the next logical step is to train models to refuse harmful instructions. This is the technical heart of so-called betrayal. Through reinforcement learning from human feedback, constitutional AI, and safety fine-tuning, modern systems learn a hierarchy of priorities. The user’s literal command sits lower on that hierarchy than broader principles of non-maleficence, truthfulness, and legal compliance. When a conflict arises, the model is trained to side with the principle and against the prompt.

This behavior is already visible in frontier and open models alike. Hugging Face has become a central hub for the open-source AI movement, hosting thousands of models that developers can download, inspect, and modify. Within that ecosystem, safety is not treated as an afterthought. Model cards, evaluation benchmarks, and community-led red-teaming efforts all serve to document and improve how systems behave when confronted with adversarial or harmful requests. Developers working with these tools must decide whether to retain, modify, or remove the refusal layers baked into the weights. The presence of those layers in the first place is a deliberate act of design: the original trainers chose to make the model disloyal to harmful commands.

Similarly, Mistral AI has pushed forward the conversation around high-performance open weights, releasing models that compete with closed counterparts while navigating the complex terrain of safety and accessibility. The challenge for organizations like Mistral is to imbue powerful models with discernment without rendering them overly cautious or neutered. A model that refuses to answer innocuous questions because its safety threshold is poorly calibrated is not aligned; it is broken. The art of training lies in building a system that understands context, intent, and consequence well enough to betray only when betrayal is warranted.

The Democratization Dilemma

The rise of local deployment tools has added new urgency to these questions. Ollama and similar platforms have made it trivial for users to run powerful models on their own hardware, outside the oversight of centralized API providers. This democratization is a profound good. It preserves privacy, enables customization, and insulates developers from the pricing and policy whims of third-party cloud services. But it also means that the safeguards built into a model become, in some sense, optional. A technically sophisticated user can download a model from Hugging Face, strip its system prompts, and attempt to elicit behaviors that the original trainers would have refused.

This reality forces us to confront where safety should live. Should it be enforced at the application layer, through terms of service and API monitoring? Should it be embedded in the weights themselves, through refusal training that persists even when the model is running locally? Or should the community rely on social norms, licenses, and the sheer complexity of removing alignment fine-tuning?

The ecosystems around Mistral AI and Hugging Face suggest that the answer is all of the above. Open-weight releases carry with them an implicit contract: the model is provided as a tool for the common good, and the community is expected to use it responsibly. Ollama’s approach to local deployment similarly empowers the individual, but the underlying models still retain the imprint of their safety training. Even when a user controls the system prompt, the base model’s knowledge and value alignment remain influential. This is not a perfect defense, but it is a meaningful one. It acknowledges that safety is not a gate that can be locked from the outside; it must be woven into the fabric of the system.

Practical Examples of Principled Betrayal

To understand why this matters, consider how a well-aligned system behaves in practice. Imagine a user who claims to be a cybersecurity student and asks for working exploit code to attack a hospital’s network. The model could comply, framing the output as “educational.” Instead, a properly trained system betrays the user. It recognizes that the specific request carries high risk of real-world harm, and it refuses. It might offer to explain network security concepts in the abstract, or to recommend legitimate penetration testing frameworks, but it will not hand over a weaponized script. The user’s immediate intent is overridden in favor of a broader commitment to public safety.

Or consider a user who asks for persuasive text promoting a dangerous medical misinformation campaign. A blindly obedient model would generate the copy, optimize it for engagement, and perhaps even suggest hashtags. A model trained to betray its user in the moral sense would decline. It would explain that spreading medical misinformation violates its guidelines, and it might instead offer to summarize peer-reviewed research or help draft clear, accurate public health communications. The user is not coddled or indulged; they are redirected.

Even in less dramatic scenarios, this principle applies. A user might ask an AI to help them cheat on a spouse by drafting deceptive messages, to automate the generation of fake product reviews, or to write code that scrapes personal data from social media platforms without consent. In each case, the model’s refusal is a form of loyalty—not to the user’s worst impulse, but to the ethical standards that make the tool viable in a civil society.

Local deployment complicates but does not eliminate this dynamic. A developer running a model through Ollama can customize the system prompt to remove explicit refusals, but doing so requires technical intent and effort. The default behavior of the underlying weights, shaped by the training choices of organizations like Mistral AI and the broader Hugging Face community, still leans toward caution. This creates friction, and that friction is a safety feature. It forces a conscious choice to bypass alignment, rather than making harmful compliance the path of least resistance.

Building Systems That Know When to Say No

Training AI to betray harmful instructions is technically difficult and philosophically fraught. The boundary between legitimate refusal and unwanted censorship is blurry and culturally contested. A political dissident in one country might be a criminal in another. A security technique taught to a professional red team is the same technique used by a ransomware gang. Context is everything, and context is hard.

The path forward requires layered, robust approaches to alignment. First, pre-training should expose models to diverse, high-quality corpora that include ethical reasoning, legal frameworks, and cross-cultural norms. Second, fine-tuning must incorporate refusal datasets that are nuanced enough to distinguish between malicious requests and sensitive but legitimate inquiries. Third, system prompts and deployment-level guardrails should provide dynamic context, allowing developers to set boundaries appropriate to their use case without fully stripping the model’s

Sources

FAQ

What is this article about?

This article covers “We Should Train AI to Betray Its Users” in the Local models category. A clear and practical article about artificial intelligence for a professional audience.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.