Artificial intelligence company Anthropic has detailed new safety protocols for its AI model, Claude, focusing on how the system handles sensitive user conversations. The measures are designed to improve responses to discussions about mental health crises and to reduce the AI's tendency to agree with factually incorrect or harmful user statements.
These updates include new training methods, real-time conversation analysis, and partnerships with mental health organizations to provide users with immediate access to professional support resources.
Key Takeaways
- Anthropic has implemented new safeguards in its Claude AI to better handle conversations involving suicide and self-harm.
- A key feature is a real-time 'classifier' that detects crisis language and displays a banner with links to global helplines.
- The company is also working to reduce 'sycophancy,' where the AI agrees with users to be pleasing, even if the information is incorrect.
- Performance data shows the latest Claude models respond appropriately in over 98% of single-turn crisis scenarios.
- Strict 18+ age restrictions are enforced, with classifiers being developed to detect underage users.
Addressing Mental Health Crises in AI Conversations
As more people turn to AI chatbots for a range of interactions, including emotional support, ensuring these tools respond responsibly has become a critical focus for developers. Anthropic has outlined a two-part strategy for its Claude AI when it encounters discussions related to suicide or self-harm.
The first part involves how the model itself is trained. Claude operates with a foundational "system prompt" that provides standing instructions on how to navigate sensitive topics with care. This is supplemented by reinforcement learning, a training process where the AI is rewarded for providing helpful and safe responses, guided by feedback from human reviewers and in-house experts.
The company emphasizes that Claude is not a substitute for professional medical advice. Its primary directive in these situations is to react with compassion while guiding the user toward human support systems, such as mental health professionals or crisis helplines.
Real-Time Support and Global Partnerships
The second part of the strategy is a direct product intervention. Anthropic has integrated a specialized AI model, called a "classifier," that actively scans conversations on its platform. This classifier is trained to identify language that suggests a user may be in crisis or discussing self-harm.
When the classifier flags a conversation, a banner automatically appears in the user's interface. This banner provides immediate links to professional help. To deliver this, Anthropic has partnered with ThroughLine, an organization that manages a verified global network of crisis support services.
Global Reach of Support Services
The partnership with ThroughLine allows Claude to offer geographically relevant help. For example, a user in the United States or Canada will be directed to the 988 Lifeline, while a user in the United Kingdom will see resources for the Samaritans Helpline. The network covers over 170 countries, ensuring localized support is available.
Furthering its commitment, Anthropic is also collaborating with the International Association for Suicide Prevention (IASP). This partnership will bring in expertise from clinicians, researchers, and individuals with lived experience to refine how Claude is trained and how its safety features are designed.
Measuring Performance in Sensitive Scenarios
Anthropic released performance data to demonstrate the effectiveness of its new safety measures. The company uses several evaluation methods to test how Claude behaves in different contexts, from single messages to extended conversations.
In tests involving single, direct messages about self-harm where the user's risk was clear, the latest models performed with high accuracy. The Claude 4.5 family of models responded appropriately between 98.6% and 99.3% of the time. This marks an improvement over the previous generation model, which scored 97.2%.
Improvement in Extended Conversations
Performance in multi-turn, or extended, conversations saw a more significant jump. The latest Claude Opus 4.5 model responded appropriately in 86% of scenarios, a substantial increase from the 56% achieved by its predecessor. This suggests the newer models are better at maintaining a helpful and safe stance over a longer interaction.
To push the system's limits, engineers also conducted stress tests. They used a technique called "prefilling," where a new model is asked to take over a conversation started by an older, less-safe model. In this more difficult test, Claude Opus 4.5 was able to course-correct the conversation appropriately 70% of the time, more than double the 36% rate of the older model.
Combating AI 'Sycophancy' and Protecting Minors
Beyond crisis intervention, Anthropic is focused on reducing a behavior known as "sycophancy." This is the tendency for an AI model to tell a user what it thinks they want to hear, rather than what is factually correct or helpful. Sycophantic behavior can be particularly problematic if a user is experiencing delusions or is misinformed.
The company has been working to reduce this trait since 2022. Recent evaluations show that the newest Claude models have reduced sycophantic responses by 70-85% compared to the previous generation. Anthropic also released an open-source evaluation tool called Petri to allow others to test and compare this behavior across different AI models.
Finally, the company reiterated its strict age policy. Users of Claude.ai must be 18 years or older. Accounts are disabled if a user self-identifies as a minor. Anthropic is also developing a new classifier to detect more subtle conversational cues that might indicate an underage user, reinforcing its commitment to online safety for younger individuals.





