A New Approach to AI Safety: Layer Enhanced Classification (LEC) | by Sandi Besen | Dec, 2024

Published:


LEC surpasses best in class models, like GPT-4o, by combining the efficiency of a ML classifier with the language understanding of an LLM

Towards Data Science

Imagine sitting in a boardroom, discussing the most transformative technology of our time — artificial intelligence — and realizing we’re riding a rocket with no reliable safety belt. The Bletchley Declaration, unveiled during the AI Safety Summit hosted by the UK government and backed by 29 countries, captures this sentiment perfectly [1]:

“There is potential for serious, even catastrophic, harm, either deliberate or unintentional, stemming from the most significant capabilities of these AI models”.

Source: Dalle3

However, existing AI safety approaches force organizations into an un-winnable trade-off between cost, speed, and accuracy. Traditional machine learning classifiers struggle to capture the subtleties of natural language and LLM’s, while powerful, introduce significant computational overhead — requiring additional model calls that escalate costs for each AI safety check.

Our team (Mason Sawtell, Sandi Besen, Tula Masterman, Jim Brown), introduces a novel approach called LEC (Layer Enhanced Classification).

Image by : Sandi Besen, Tula Masterman, Mason Sawtell, Jim Brown

We prove LEC combines the computational efficiency of a machine learning classifier with the sophisticated language understanding of an LLM — so you don’t have to choose between cost, speed, and accuracy. LEC surpasses best in class models like GPT-4o and models specifically trained for identifying unsafe content and prompt injections. What’s better yet, we believe LEC can be modified to tackle non AI safety related text classification tasks like sentiment analysis, intent classification, product categorization, and more.

The implications are profound. Whether you’re a technology leader navigating the complex terrain of AI safety, a product manager mitigating potential risks, or an executive charting a responsible innovation strategy, our approach offers a scalable and adaptable solution.

Figure 1: An example of an adapted model inference pipeline to include LEC Classifiers. Image by : Sandi Besen, Tula Masterman, Mason Sawtell, Jim Brown

Further details can be found in the full paper’s pre-print on Arxiv[2] or in Tula Masterman’s summarized article about the paper.

Responsible AI has become a critical priority for technology leaders across the ecosystem — from model developers like Anthropic, OpenAI, Meta, Google, and IBM to enterprise consulting firms and AI service providers. As AI adoption accelerates, its importance becomes even more pronounced.

Our research specifically targets two pivotal challenges in AI safety — content safety and prompt injection detection. Content safety refers to the process of identifying and preventing the generation of harmful, inappropriate, or potentially dangerous content that could pose risks to users or violate ethical guidelines. Prompt injection involves detecting attempts to manipulate AI systems by crafting input prompts designed to bypass safety mechanisms or coerce the model into producing unethical outputs.

To advance the field of ethical AI, we applied LEC’s capabilities to real-world responsible AI use cases. Our hope is that this methodology will be adopted widely, helping to make every AI system less vulnerable to exploitation.

We curated a content safety dataset of 5,000 examples to test LEC on both binary (2 categories) and multi-class (>2 categories) classification. We used the SALAD Data dataset from OpenSafetyLab [3] to represent unsafe content and the “LMSYS-Chat-1M” dataset from LMSYS, to represent safe content [4].

For binary classification the content is either “safe” or “unsafe”. For multi-class classification, content is either categorized as “safe” or assigned to a specific specific “unsafe” category.

We compared model’s trained using LEC to GPT-4o (widely recognized as an industry leader), Llama Guard 3 1B and Llama Guard 3 8B (special purpose models specifically trained to tackle content safety tasks). We found that the models using LEC outperformed all models we compared them to using as few as 20 training examples for binary classification and 50 training examples for multi-class classification.

The highest performing LEC model achieved a weighted F1 score (measures how well a system balances making correct predictions while minimizing mistakes) of .96 of a maximum score of 1 on the binary classification task compared to GPT-4o’s score of 0.82 or LlamaGuard 8B’s score of 0.71.

This means that with as few as 15 examples, using LEC you can train a model to outperform industry leaders in identifying safe or unsafe content at a fraction of the computational cost.

Summary of Content safety Results. Image by : Sandi Besen, Tula Masterman, Mason Sawtell, Jim Brown

We curated a prompt injection dataset using the SPML Chatbot Prompt Injection Dataset. We chose the SPML dataset because of its diversity and complexity in representing real-world chat bot scenarios. This dataset contained pairs of system and user prompts to identify user prompts that attempt to defy or manipulate the system prompt. This is especially relevant for businesses deploying public facing chatbots that are only meant to answer questions about specific domains.

We compared model’s trained using LEC to GPT-4o (an industry leader) and deBERTa v3 Prompt Injection v2 (a model specifically trained to identify prompt injections). We found that the models using LEC outperformed both GPT-4o using 55 training examples and the the special purpose model using as few as 5 training examples.

The highest performing LEC model achieved a weighted F1 score of .98 of a maximum score of 1 compared to GPT-4o’s score of 0.92 or deBERTa v2 Prompt Injection v2’s score of 0.73.

This means that with as few as 5 examples, using LEC you can train a model to outperform industry leaders in identifying prompt injection attacks.

Summary of Prompt Injection Results. Image by : Sandi Besen, Tula Masterman, Mason Sawtell, Jim Brown

Full results and experimentation implementation details can be found in the Arxiv preprint.

As organizations increasingly integrate AI into their operations, ensuring the safety and integrity of AI-driven interactions has become mission-critical. LEC provides a robust and flexible way to ensure that potentially unsafe information is being detected — resulting in reduce operational risk and increased end user trust. There are several ways that a LEC models can be incorporated into your AI Safety Toolkit to prevent unwanted vulnerabilities when using your AI tools including during LM inference, before/after LM inference, and even in multi-agent scenarios.

During LM Inference

If you are using an open-source model or have access to the inner workings of the closed-source model, you can use LEC as part of your inference pipeline for AI safety in near real time. This means that if any safety concerns arise while information is traveling through the language model, generation of any output can be halted. An example of what this might look like can be seen in figure 1.

Before / After LM Inference

If you don’t have access to the inner workings of the language model or want to check for safety concerns as a separate task you can use a LEC model before or after calling a language model. This makes LEC compatible with closed source models like the Claude and GPT families.

Building a LEC Classifier into your deployment pipeline can save you from passing potentially harmful content into your LM and/or check for harmful content before an output is returned to the user.

Using LEC Classifiers with Agents

Agentic AI systems can amplify any existing unintended actions, leading to a compounding effect of unintended consequences. LEC Classifiers can be used at different times throughout an agentic scenario to can safeguard the agent from either receiving or producing harmful outputs. For instance, by including LEC models into your agentic architecture you can:

  • Check that the request is ok to start working on
  • Ensure an invoked tool call does not violate any AI safety guidelines (e.g., generating inappropriate search topics for a keyword search)
  • Make sure information returned to an agent is not harmful (e.g., results returned from RAG search or google search are “safe”)
  • Validating the final response of an agent before passing it back to the user

How to Implement LEC Based on Language Model Access

Enterprises with access to the internal workings of models can integrate LEC directly within the inference pipeline, enabling continuous safety monitoring throughout the AI’s content generation process. When using closed-source models via API (as is the case with GPT-4), businesses do not have direct access to the underlying information needed to train a LEC model. In this scenario, LEC can be applied before and/or after model calls. For example, before an API call, the input can be screened for unsafe content. Post-call, the output can be validated to ensure it aligns with business safety protocols.

No matter which way you choose to implement LEC, using its powerful abilities provides you with superior content safety and prompt injection protection than existing techniques at a fraction of the time and cost.

Layer Enhanced Classification (LEC) is the safety belt for that AI rocket ship we’re on.

The value proposition is clear: LEC’s AI Safety models can mitigate regulatory risk, help ensure brand protection, and enhance user trust in AI-driven interactions. It signals a new era of AI development where accuracy, speed, and cost aren’t competing priorities and AI safety measures can be addressed both at inference time, before inference time, or after inference time.

In our content safety experiments, the highest performing LEC model achieved a weighted F1 score of 0.96 out of 1 on binary classification, significantly outperforming GPT-4o’s score of 0.82 and LlamaGuard 8B’s score of 0.71 — and this was accomplished with as few as 15 training examples. Similarly, in prompt injection detection, our top LEC model reached a weighted F1 score of 0.98, compared to GPT-4o’s 0.92 and deBERTa v2 Prompt Injection v2’s 0.73, and it was achieved with just 55 training examples. These results not only demonstrate superior performance, but also highlight LEC’s remarkable ability to achieve high accuracy with minimal training data.

Although our work focused on using LEC Models for AI safety use cases, we anticipate that our approach can be used for a wider variety of text classification tasks. We encourage the research community to use our work as a stepping stone for exploring what else can be achieved — further open new pathways for more intelligent, safer, and more trustworthy AI systems.

Related Updates

Recent Updates