Generate training data and cost-effectively train categorical models with Amazon Bedrock

Published:


In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. Generative AI solutions can play an invaluable role during the model development phase by simplifying training and test data creation for multiclass classification supervised learning use cases. We dive deep into this process on how to use XML tags to structure the prompt and guide Amazon Bedrock in generating a balanced label dataset with high accuracy. We also showcase a real-world example for predicting the root cause category for support cases. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.

Business challenge

The exploration and methodology described in this post addresses two key challenges: costs associated with generating a ground truth dataset for multiclass classification use cases can be prohibitive, and conventional approaches and synthetic dataset creation strategies for generating ground truth data are inadequate in generating balanced classes and meeting desired performance parameters for the real-world use cases.

Ground truth data generation is expensive and time consuming

Ground truth annotation needs to be accurate and consistent, often requiring massive time and expertise to ensure the dataset is balanced, diverse, and large enough for model training and testing. For a multiclass classification problem such as support case root cause categorization, this challenge compounds many fold.

Let’s say the task at hand is to predict the root cause categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry) for customer support cases. Based on our experiments using best-in-class supervised learning algorithms available in AutoGluon, we arrived at a 3,000 sample size for the training dataset for each category to attain an accuracy of 90%. This requirement translates into time and effort investment of trained personnel, who could be support engineers or other technical staff, to review tens of thousands of support cases to arrive at an even distribution of 3,000 per category. With each support case and the related correspondences averaging 5 minutes per review and assessment from a human labeler, this translates into 1,500 hours (5 minutes x 18,000 support cases) of work or 188 days considering an 8-hour workday. Besides the time in review and labeling, there is an upfront investment in training the labelers so the exercise split between 10 or more labelers is consistent. To break this down further, a ground truth labeling campaign split between 10 labelers would require close to 4 weeks to label 18,000 cases if the labelers spend 40 hours a week on the exercise.

Not only is such an extended and effort-intensive campaign expensive, but it can cause inconsistent labeling for categories every time the labeler puts aside the task and resumes it later. The exercise also doesn’t guarantee a balanced labeled ground truth dataset because some root cause categories such as Customer Education could be far more common than Feature Request or Software Defect, thereby extending the campaign.

Conventional techniques to get balanced classes or synthetic data generation have shortfalls

A balanced labeled dataset is critical for a multiclass classification use case to mitigate bias and make sure the model learns to accurately classify all classes, rather than favoring the majority class. If the dataset is imbalanced, with one or more classes having significantly fewer instances than others, the model might struggle to learn the patterns and features associated with the minority classes, leading to poor performance and biased predictions. This issue is particularly problematic in applications where accurate classification of minority classes is critical, such as medical diagnoses, fraud detection, or root cause categorization. For the use case of labeling the support root cause categories, it’s often harder to source examples for categories such as Software Defect, Feature Request, and Documentation Improvement for labeling than it is for Customer Education. This results in an imbalanced class distribution for training and test datasets.

To address this challenge, various techniques can be employed, including oversampling the minority classes, undersampling the majority classes, using ensemble methods that combine multiple classifiers trained on different subsets of the data, or synthetic data generation to augment minority classes. However, the ideal approach for achieving optimal performance is to start with a balanced and highly accurate labeled dataset for ground truth training.

Although oversampling for minority classes means extended and expensive data labeling with humans who review the support cases, synthetic data generation to augment the minority classes poses its own challenges. For the multiclass classification problem to label support case data, synthetic data generation can quickly result in overfitting. This is because it can be difficult to synthesize real-world examples of technical case correspondences that contain complex content related to software configuration, implementation guidance, documentation references, technical troubleshooting, and the like.

Because ground truth labeling is expensive and synthetic data generation isn’t an option for use cases such as root cause prediction, the effort to train a model is often put aside. This results in a missed opportunity to review the root cause trends that can guide investment in the right areas such as education for customers, documentation improvement, or other efforts to reduce the case volume and improve customer experience.

Solution overview

The preceding section discussed why conventional ground truth data generation techniques aren’t viable for certain supervised learning use cases and fall short in training a highly accurate model to predict the support case root cause in our example. Let’s look at how generative AI can help solve this problem.

Generative AI supports key use cases such as content creation, summarization, code generation, creative applications, data augmentation, natural language processing, scientific research, and many others. Amazon Bedrock is well-suited for this data augmentation exercise to generate high-quality ground truth data. Using highly tuned and custom tailored prompts with examples and techniques discussed in the following sections, support teams can pass the anonymized support case correspondence to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock or other available large language models (LLMs) to predict the root cause label for a support case from one of the many categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry). After achieving the desired accuracy, you can use this ground truth data in an ML pipeline with automated machine learning (AutoML) tools such as AutoGluon to train a model and inference the support cases.

Checking LLM accuracy for ground truth data

To evaluate an LLM for the task of category labeling, the process begins by determining if labeled data is available. If labeled data exists, the next step is to check if the model’s use case produces discrete outcomes. Where discrete outcomes with labeled data exist, standard ML methods such as precision, recall, or other classic ML metrics can be used. These metrics provide high precision but are limited to specific use cases due to limited ground truth data.

If the use case doesn’t yield discrete outputs, task-specific metrics are more appropriate. These include metrics such as ROUGE or cosine similarity for text similarity, and specific benchmarks for assessing toxicity (Detoxify), prompt stereotyping (cross-entropy loss), or factual knowledge (HELM, LAMA).

If labeled data is unavailable, the next question is whether the testing process should be automated. The automation decision depends on the cost-accuracy trade-off because higher accuracy comes at a higher cost. For cases where automation is not required, human-in-the-Loop (HIL) approaches can be used. This involves manual evaluation based on predefined assessment rules (for example, ground truth), yielding high evaluation precision, but often is time-consuming and costly.

When automation is preferred, using another LLM to assess outputs can be effective. Here, a reliable LLM can be instructed to rate generated outputs, providing automated scores and explanations. However, the precision of this method depends on the reliability of the chosen LLM. Each path represents a tailored approach based on the availability of labeled data and the need for automation, allowing for flexibility in assessing a wide range of FM applications.

The following figure illustrates an FM evaluation workflow.

For the use case, if a historic collection of 10,000 or more support cases labeled using Amazon SageMaker Ground Truth with HIL is available, it can be used for evaluating the accuracy of the LLM prediction. The key goal for generating new ground truth data using Amazon Bedrock should be to augment it for increasing diversity and increasing the training data size for AutoGluon training to arrive at a performant model that can be used for the final inference or root cause prediction. In the following sections, we explain how to take an incremental and measured approach to improve Anthropic’s Claude 3.5 Sonnet prediction accuracy through prompt engineering.

Prompt engineering for FM accuracy and consistency

Prompt engineering is the art and science of designing a prompt to get an LLM to produce the desired output. We suggest consulting LLM prompt engineering documentation such as Anthropic prompt engineering for experiments. Based on experiments conducted without a finely tuned and optimized prompt, we observed low accuracy rates of less than 60%. In the following sections, we provide a detailed explanation on how to construct your first prompt, and then gradually improve it to consistently achieve over 90% accuracy.

Designing the prompt

Before starting any scaled use of generative AI, you should have the following in place:

  • A clear definition of the problem you are trying to solve along with the end goal.
  • A way to test the model’s output for accuracy. The thumbs up/down technique to determine accuracy along with comparing with the 10,000 labeled dataset through SageMaker Ground Truth is well-suited for this exercise.
  • A defined success criterion on how accurate the model needs to be.

It’s helpful to think of an LLM as a new employee who is very well read, but knows nothing about your culture, your norms, what you are trying to do, or why you are trying to do it. The LLM’s performance will depend on how precisely you can explain what you want. How would a skilled manager handle a very smart, but new and inexperienced employee? The manager would provide contextual background, explain the problem, explain the rules they should apply when analyzing the problem, and give some examples of what good looks like along with why it is good. Later, if they saw the employee making mistakes, they might try to simplify the problem and provide constructive feedback by giving examples of what not to do, and why. One difference is that an employee would understand the job they are being hired for, so we need to explicitly tell the LLM to assume the persona of a support employee.

Prerequisites

To follow along with this post, set up Amazon SageMaker Studio to run Python in a notebook and interact with Amazon Bedrock. You also need the appropriate permissions to access Amazon Bedrock models.

Set up SageMaker Studio

Complete the following steps to set up SageMaker Studio:

  1. On the SageMaker console, choose Studio under Applications and IDEs in the navigation pane.
  2. Create a new SageMaker Studio instance if you haven’t already.
  3. If prompted, set up a user profile for SageMaker Studio by providing a user name and specifying AWS Identity and Access Management (IAM) permissions.
  4. Open a SageMaker Studio notebook:
    • Choose JupyterLab.
    • Create a private JupyterLab space.
    • Configure the space (set the instance type to ml.m5.large for optimal performance).
    • Launch the space.
    • On the File menu, choose New and Notebook to create a new notebook.
  5. Configure SageMaker to meet your security and compliance objectives. Refer to Configure security in Amazon SageMaker AI for details.

Set up permissions for Amazon Bedrock access

Make sure you have the following permissions:

  • IAM role with Amazon Bedrock permissions – Make sure that your SageMaker Studio execution role has the necessary permissions to access Amazon Bedrock. Attach the AmazonBedrockFullAccesspolicy or a custom policy with specific Amazon Bedrock permissions to your IAM role.
  • AWS SDKs and authentication – Verify that your AWS credentials (usually from the SageMaker role) have Amazon Bedrock access. Refer to Getting started with the API to set up your environment to make Amazon Bedrock requests through the AWS API.
  • Model access – Grant permission to use Anthropic’s Claude 3.5 Sonnet. For instructions, see Add or remove access to Amazon Bedrock foundation models.

Test the code using the native inference API for Anthropic’s Claude

The following code uses the native inference API to send a text message to Anthropic’s Claude. The Python code invokes the Amazon Bedrock Runtime service:

import boto3
import json
from datetime import datetime
import time

# Create an Amazon Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID, e.g., Claude 3 Haiku.
model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"

# Load the prompt from a file (showed and explained later in the blog)
with open('prompt.txt', 'r') as file:
data = file.read()


def callBedrock(body):
# Format the request payload using the model's native structure.

prompt = data + body;

# The prompt is then truncated to the max input window size of Sonnet 3.5
prompt = prompt[:180000]

# Define parametres passed to the model. 
native_request = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"temperature": 0.2,
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": prompt}],
}
],
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
# Invoke the model with the request.
response = client.invoke_model(modelId=model_id, body=request)

except (Exception) as e:
print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")

# Load the response returned from Amazon Bedrock into a json object
model_response = json.loads(response["body"].read())

# Extract and print the response text.
response_text = model_response["content"][0]["text"]
return response_text;

Construct the initial prompt

We demonstrate the approach for the specific use case for root cause prediction with a goal of achieving 90% accuracy. Start by creating a prompt similar to the prompt you would give to humans in natural language. This can be a simple description of each root cause label and why you would choose it, how to interpret the case correspondences, how to analyze and choose the corresponding root cause label, and provide examples for every category. Ask the model to also provide the reasoning to understand how it reached to certain decisions. It can be especially interesting to understand the reasoning for the decisions you don’t agree with. See the following example code:

Please familiarize yourself with these categories.  When you evaluate a case, evaluate the definitions in order and label the case with the first definition that fits.  If a case morphs from one type to another, choose the type the case started out as. 

Read the correspondence, especially the original request, and the last correspondence from the support agent to the customer. If there are lot of correspondences, or the case does not seem straightforward to infer, read the correspondences date stamped in order to understand what happened. If the case references documentation, read or skim the documentation to determine whether the documentation clearly supports what the support agent mentioned and whether it answers the customers issue.

Software Defect:  “Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect. 

An example of Software Defect case is [Customer: "Our data pipeline jobs are failing with a 'memory allocation error' during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We've verified our infrastructure meets all requirements." Agent: "After analyzing the logs, we've confirmed a memory leak in the aggregation module - a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue."]
....  

Analyze the results

We recommend using a small sample (for example, 150) of random cases and run them through Anthropic’s Claude 3.5 Sonnet using the initial prompt, and manually check the initial results. You can load the input data and model output into Excel, and add the following columns for analysis:

  • Claude Label – A calculated column with Anthropic’s Claude’s category
  • Label – True category after reviewing each case and selecting a specific root cause category to compare with the model’s prediction and derive an accuracy measurement
  • Close Call – 1 or 0 so that you can take numerical averages
  • Notes – For cases where there was something noteworthy about the case or inaccurate categorizations
  • Claude Correct – A calculated column (0 or 1) based on whether our category matched the model’s output category

Although the first run is expected to have low accuracy unfit for using the prompt for generating the ground truth data, the reasoning will help you understand why Anthropic’s Claude mislabeled the cases. In the example, many of the misses fell into these categories and the accuracy was only 61%:

  • Cases where Anthropic’s Claude categorized Customer Education cases as Software Defect because it interpreted the support agent instructions to reconfigure something as a workaround for a Software Defect.
  • Cases where users asked questions about billing that Anthropic’s Claude categorized as Customer Education. Although billing questions could also be Customer Education cases, we wanted these to be categorized as the more specific Billing Inquiry Likewise, although Security Awareness cases are also Customer Education, we wanted to categorize these as the more specific Security Awareness category.

Iterate on the prompt and make changes

Providing the LLM explicit instructions on correcting these errors should result in a major boost in accuracy. We tested the following adjustments with Anthropic’s Claude:

  • We defined and assigned a persona with background information for the LLM: “You are a Support Agent and an expert on the enterprise application software. You will be classifying customer cases into categories…”
  • We ordered the categories from more deterministic and well-defined to less specific and instructed Anthropic’s Claude to evaluate the categories in the order they appear in the prompt.
  • We recommend using the Anthropic documentation suggestion to use XML tags and the enclosed root cause categories in light XML but not a formal XML document, with elements delimited with tags. It’s ideal to create categories as nodes with a separate sub-node for each category. The category node should consist of a name of the category, a description, and what the output would look like. The categories should be delimited by begin and end tags.
You are a Support Agent and an expert on the enterprise application software. You will be classifying the customer support cases into categories, based on the given interaction between an agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision.

The categories are defined as:

<categories>
<category>
<name>
"Software Defect"
</name>
<description>
“Software Defect” are cases where the application software does not work as expected. The agent confirms the application is not working as expected and may refer to internal team working on a fix or patch to address the bug or defect. The category includes common errors or failures related to performance, software version, functional defect, unexpected exception or usability bug when the customer is following the documented steps.
</description>
</category>
...
</categories>

  • We created a good examples node with at least one good example for every category. Each good example consisted of the example, the classification, and the reasoning:

Here are some good examples with reasoning:

<good examples>
<example>
<example data>
Customer: "Our data pipeline jobs are failing with a 'memory allocation error' during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We've verified our infrastructure meets all requirements."
Agent: "After analyzing the logs, we've confirmed a memory leak in the aggregation module - a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue."
</example data
<classification>
"Software Defect"
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue. 
</explanation>
</example>
...
</good examples>

  • We created a bad examples node with examples of where the LLM miscategorized previous cases. The bad examples node should have the same set of fields as the good examples, such as example data, classification, explanation, but the explanation explained the error. The following is a snippet:

Here are some examples for wrong classification with reasoning:

<bad examples>

    <example>
        <example data>
            Customer: "We need the ability to create custom dashboards that can aggregate data across multiple tenants in real-time. Currently, we can only view metrics per individual tenant, which requires manual consolidation for our enterprise reporting needs."
Agent: "I understand your need for cross-tenant analytics. While the current functionality is limited to single-tenant views as designed, I've submitted your request to our product team as a high-priority feature enhancement. They'll evaluate it for inclusion in our 2025 roadmap. I'll update you when there's news about this capability."
       </example data>
    <example output>
        <classification>
            "Software Defect"
        </classification>
        <explanation>
            Classification should be Feature Request and not Software Defect. The application does not have the function or capability being requested but it is working as documented or advertised. In the example, the agent mentions they have submitted with request to their product team to consider in the future roadmap.
        </explanation>
    </example>
...
<bad examples>

  • We also added instructions for how to format the output:
Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response> 
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation> 
</response> 

Test with the new prompt

The preceding approach should result in an improved prediction accuracy. In our experiment, we saw 84% accuracy with the new prompt and the output was consistent and more straightforward to parse. Anthropic’s Claude followed the suggested output format in almost all cases. We wrote code to fix errors such as unexpected tags in the output and drop responses that could not be parsed.

The following is the code to parse the output:

# This python script parses LLM output into a comma separated list with the SupportID, Category, Reason
# Command line is python parse_llm_putput.py PathToLLMOutput.txt PathToParsedOutput.csv
# Note:  It will overwrite the output file without confirming
# it will write completion status and any error messages to stdout
 
import re
import sys
 
# these tokens are based on the format of the claude output.
# This will create three inputs CaseID, RootCause and Reasoning.  We will to extract them using re.match.
pattern = re.compile(
    "^([0-9]*).*<classification>(.*)</classification><explanation>(.*)</explanation>"
)
 
endToken = "</response>"
checkToken = "<classification>"
 
acceptableClassifications = [
    "Billing Inquiry",
    "Documentation Improvement",
    "Feature Request",
    "Security Awareness",
    "Software Defect",
    "Customer Education",
]
 
def parseResponse(response):
    # parsing is trivial withe regular expression groups
    m = pattern.match(response)
    return m
 
# get the input and output files
if len(sys.argv) != 3:
    print("Command line error parse_llm_output.py inputfile outputfile")
    exit(1)
 
# open the file
input = open(sys.argv[1], encoding="utf8")
output = open(sys.argv[2], "w")
 
# read the entire file in.  This works well with 30,000 responses, but would need to be adjusted for say 3,000,000 responses
responses = input.read()
 
# get rid of the double quotes and newlines to avoid incorrect excel parsing and these are unnecessary
responses = responses.replace('"', "")
responses = responses.replace("\n", "")
 
# initialize our placeholder, and counters
parsedChars = 0
skipped = 0
invalid = 0
responseCount = 0
 
# write the header
output.write("CaseID,RootCause,Reason\n")
 
# find the first response
index = responses.find(endToken, parsedChars)
 
while index > 0:
    # extract the response
    response = responses[parsedChars : index + len(endToken)]
    # parse it
    parsedResponse = parseResponse(response)
 
    # is the response valid
    if parsedResponse is None or len(response.split(checkToken)) != 2:
        # this happens when there is a missing /response delimiter or some other formatting problem, it clutters up and the next response
        skipped = skipped + 2
    else:
        # if we have a valid response write it to the file, enclose the reason in double quotes because it uses commas
        if parsedResponse.group(2).lower() not in acceptableClassifications:
            # make sure the classification is one we expect
            print("Invalid Classification: {0}".format(parsedResponse.group(2)))
            invalid = invalid + 1
        else:
            # write a valid line to the output file
            output.write(
                '{0},{1},"{2}"\n'.format(
                    parsedResponse.group(1),
                    parsedResponse.group(2),
                    parsedResponse.group(3),
                )
            )
 
    # move the pointer past where we parsed and update the counter
    parsedChars = index + len(endToken)
    responseCount = responseCount + 1
 
    # find the next response
    index = responses.find(endToken, parsedChars)
 
print("skipped {0} of {1} responses".format(skipped, responseCount))
print("{0} of these were invalid".format(invalid)) 

Most mislabeled cases were close calls or had very similar traits. For example, when a customer described a problem, the support agent suggested possible solutions and asked for logs in order to troubleshoot. However, the customer self-resolved the case and so the resolution details weren’t conclusive. For this scenario, the root cause prediction was inaccurate. In our experiment, Anthropic’s Claude labeled these cases as Software Defects, but the most likely scenario is that the customer figured it out for themselves and never followed up.

Continued fine-tuning of the prompt to adjust examples and include such scenarios incrementally can help to get over 90% prediction accuracy, as we confirmed with our experimentation. The following code is an example of how to adjust the prompt and add a few more bad examples:

<example>
<example data>
Subject: Unable to configure custom routing rules in application gateway
Customer: Our team can't set up routing rules in the application gateway. We've tried following the documentation but the traffic isn't being directed as expected. This is blocking our production deployment.
Agent: I understand you're having difficulties with routing rules configuration. To better assist you, could you please provide:
Current routing rule configuration
Application gateway logs
Expected traffic flow diagram
[No response from customer for 5 business days - Case closed by customer]
</example data>
    <example output>
      <classification>
       Software Defect
      </classification>
 <explanation>
Classification should be Customer Education and not Software Defect. The agent acknowledges the problem and asks the customer for additional information to troubleshoot, however, the customer does not reply and closes the case. Cases where the agent tells the customer how to solve the problem and provides documentation or asks for further details to troubleshoot but the customer self-resolves the case should be labeled Customer Education.
</explanation>
</example>

With the preceding adjustments and refinement to the prompt, we consistently obtained over 90% accuracy and noted that a few miscategorized cases were close calls where humans chose multiple categories including the one Anthropic’s Claude chose. See the appendix at the end of this post for the final prompt.

Run batch inference at scale with AutoGluon Multimodal

As illustrated in the previous sections, by crafting a well-defined and tailored prompt, Amazon Bedrock can help automate generation of ground truth data with balanced categories. This ground truth data is necessary to train the supervised learning model for a multiclass classification use case. We suggest taking advantage of the preprocessing capabilities of SageMaker to further refine the fields, encoding them into a format that’s optimal for model ingestion. The manifest files can be set up as the catalyst, triggering an AWS Lambda function that sets entire SageMaker pipeline into action. This end-to-end process seamlessly handles data inference and stores the results in Amazon Simple Storage Service (Amazon S3). We recommend AutoGluon Multimodal for training and prediction and deploying a model for a batch inference pipeline to predict the root cause for new or updated support cases at scale on a daily cadence.

Clean up

To prevent unnecessary expenses, it’s essential to properly decommission all provisioned resources. This cleanup process involves stopping notebook instances and deleting JupyterLab spaces, SageMaker domains, S3 bucket, IAM role, and associated user profiles. Refer to Clean up Amazon SageMaker notebook instance resources for details.

Conclusion

This post explored how Amazon Bedrock and advanced prompt engineering can generate high-quality labeled data for training ML models. Specifically, we focused on a use case of predicting the root cause category for customer support cases, a multiclass classification problem. Traditional approaches to generating labeled data for such problems are often prohibitively expensive, time-consuming, and prone to class imbalances. Amazon Bedrock, guided by XML prompt engineering, demonstrated the ability to generate balanced labeled datasets, at a lower cost, with over 90% accuracy for the experiment, and can help overcome labeling challenges for training categorical models for real-world use cases.

The following are our key takeaways:

  • Generative AI can simplify labeled data generation for complex multiclass classification problems
  • Prompt engineering is crucial for guiding LLMs to achieve desired outputs accurately
  • An iterative approach, incorporating good/bad examples and specific instructions, can significantly improve model performance
  • The generated labeled data can be integrated into ML pipelines for scalable inference and prediction using AutoML multimodal supervised learning algorithms for batch inference

Review your ground truth training costs with respect to time and effort for HIL labeling and service costs and do a comparative analysis with Amazon Bedrock to plan your next categorical model training at scale.

Appendix

The following code is the final prompt:

You are a Support Agent and an expert in the enterprise application software. You will be classifying the customer support cases into one of the 6 categories, based on the given interaction between the Support Agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision. 
 
The categories are defined as:
 
<categories>
 
<category>
<name>
"Billing Inquiry" 
</name>
<description>
“Billing Inquiry” cases are the ones related to Account or Billing inquiries and questions related to charges, savings, or discounts. It also includes requests to provide guidance on account closing, request for Credit, cancellation requests, billing questions, and questions about discounts.
</description>
</category>
 
<category>
<name>
"Security Awareness" 
</name>
<description>
“Security Awareness” cases are the cases associated with a security related incident. Security Awareness cases include exposed credentials, mitigating a security vulnerability, DDoS attacks, security concerns related to malicious traffic. Note that general security questions where the agent is helping to educate the user on the best practice such as SSO or MFA configuration, Security guidelines, or setting permissions for users and roles should be labeled as Customer Education and not Security Awareness. 
</description>
</category>
 
<category>
<name>
"Feature Request" 
</name>
<description>
“Feature Request” are the cases where the customer is experiencing a limitation in the application software and asking for a feature they want to have. Customer highlights a limitation and is requesting for the capability. For a Feature Request case, the support agent typically acknowledges that the question or expectation is a feature request for the software. Agent may use words such as the functionality or feature does not exist or it is currently not supported. 
</description>
</category>
 
<category>
<name>
"Software Defect" 
</name>
<description>
“Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect. 
</description>
</category>
 
<category>
<name>
"Documentation Improvement" 
</name>
<description>
“Documentation Improvement” are cases where there is a lack of documentation, incorrect documentation, or insufficient documentation and when the case is not attributed to a Software Defect or a Feature Request. In Documentation Improvement cases the agent acknowledges the application documentation is incomplete or not up to date, or that they will ask documentation team to improve the documentation. For Documentation Improvement cases, the agent may suggest a workaround that is not part of application documentation and does not reference the standard application documentation or link. References to workarounds or sources such as Github or Stack Overflow, when used as an example of a solution, are examples of a Documentation Improvement case because the details and examples are missing from the official documentation.
</description>
</category>
 
<category>
<name>
"Customer Education" 
</name>
<description>
“Customer Education” cases are cases where the customer could have resolved the case information using the existing application documentation. In these cases, the agent is educating the customer they are not using the feature correctly or have an incorrect configuration, while guiding them to the documentation. Customer Education cases include scenario where an agent provides troubleshooting steps for a problem or answers a question and provides links to the official application documentation. User Education cases include cases when the customer asks for best practices and agent provides knowledge article links to the support center documentation. Customer Education also includes cases created by the agent or application developers to suggest and educate the customer on a change to reduce cost, improve security, or improve application performance. Customer Education cases include cases where the customer asks a question or requests help with an error or configuration and the agent guides them appropriately with steps or documentation links. Customer Education cases also include the cases where the customer is using an unsupported configuration or version that may be End Of Life (EOL). Customer Education cases also include inconclusive cases where the customer reported an issue with the application but the case is closed without resolution details.
</description>
</category>
 
</categories>
 
Here are some good examples with reasoning:
 
<good examples>
 
<example>
<example data>
Customer: "I noticed unexpected charges of $12,500 on our latest invoice, which is significantly higher than our usual $7,000 monthly spend. We haven't added new users, so I'm concerned about this increase."
Support: "I understand your concern about the increased charges. Upon review, I see that 50 Premium Sales Cloud licenses were automatically activated on January 15th when your sandbox environments were refreshed. I can help adjust your sandbox configuration and discuss Enterprise License Agreement options to optimize costs."
Customer: "Thank you for clarifying. Please tell me more about the Enterprise License options."
</example data
<example output>
<classification>
"Billing Inquiry"
</classification>
<explanation>
Customer is asking a question to clarify the unexpected increase in their billing statement charge and the agent explains why this occurred. The customer wants to learn more about ways to optimize costs.
</explanation>
 
<example>
<example data>
Customer: "URGENT: We've detected unauthorized API calls from an unknown IP address accessing sensitive customer data in our production environment. Our monitoring shows 1000+ suspicious requests in the last hour."
Support: "I understand the severity of this security incident. I've immediately revoked the compromised API credentials and initiated our security protocol. The suspicious traffic has been blocked. I'm escalating this to our Security team for forensic analysis. I'll stay engaged until this is resolved."
</example data
<example output>
<classification>
"Security Awareness"
</classification>
<explanation>
Customer reported unauthorized API calls and suspicious requests. The agent confirms revoking compromised API credentials and initiating the protocol.
</explanation>
 
<example>
<example data>
Customer: "Is there a way to create custom notification templates for different user groups? We need department-specific alert formats, but I can only find a single global template option."
Support: "I understand you're looking to customize notification templates per user group. Currently, this functionality isn't supported in our platform - we only offer the global template system. I'll submit this as a feature request to our product team. In the meantime, I can suggest using notification tags as a workaround."
Customer: "Thanks, please add my vote for this feature."
</example data
<example output>
<classification>
"Feature Request"
</classification>
<explanation>
Customer is asking for a new feature to have custom notification templates for different user groups since they have a use case that is currently not supported by the application. The agent confirms the functionality does not exist and mentions submitting a feature request to the product team.
</explanation>
 
<example>
<example data>
Customer: "Our data pipeline jobs are failing with a 'memory allocation error' during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We've verified our infrastructure meets all requirements."
Support: "After analyzing the logs, we've confirmed a memory leak in the aggregation module - a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue."
</example data
<example output>
<classification>
"Software Defect"
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue. 
</explanation>
 
<example>
<example data>
Customer: "The data export function is failing consistently when we include custom fields. The export starts but crashes at 45% with error code DB-7721. This worked fine last week before the latest release."
Support: "I've reproduced the issue in our test environment and confirmed this is a bug introduced in version 4.2.1. Our engineering team has identified the root cause - a query optimization error affecting custom field exports. They're working on a hotfix (patch 4.2.1.3)."
Customer: "Please notify when fixed."
</example data>
<example output>
<classification>
"Software Defect"
</classification>
<explanation>
This is a Software Defect as the data export function is not working as expected to export the custom fields. The agent acknowledged the issue and confirmed engineering is working on a hotfix.
</explanation>
 
<example>
<example data>
Customer: "I'm trying to implement the batch processing API but the documentation doesn't explain how to handle partial failures or provide retry examples. The current docs only show basic success scenarios."
Support: The documentation is lacking detailed error handling examples for batch processing. I'll submit this to our documentation team to add comprehensive retry logic examples and partial failure scenarios. For now, I can share a working code snippet that demonstrates proper error handling and retry mechanisms."
Customer: "Thanks, the code example would help."
</example data
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
The agent acknowledges the gap in the documentation and mentions they will pass on this to the documentation team for further improvements. Agent mentions providing a working code snippet with retry examples.
</explanation>
 
<example>
<example data>
Customer: "We can't get our SSO integration working. The login keeps failing and we're not sure what's wrong with our configuration."
Support: "I can help guide you through the SSO setup. Looking at your configuration, I notice the SAML assertion isn't properly formatted. Please follow our step-by-step SSO configuration guide here [link to docs]. Pay special attention to section 3.2 about SAML attributes. The guide includes validation steps to ensure proper integration."
Customer: "Found the issue in section 3.2. Working now, thanks!"
</example data
<example output>
<classification>
Customer Education
</classification>
<explanation>
Customer is asking for help and guidance to get their SSO integration working. The agent went over the details and presented the steps along necessary along with the documentation links.
</explanation>
 
</good examples>
 
Here are some examples for wrong classification with reasoning:
 
<bad examples>
 
<example>
<example data>
Customer: "We want to enhance our application security. Currently, each team member has individual login credentials. What's the recommended approach?"
Support: "recommend implementing SAML-based SSO with your existing identity provider. This will:
Centralize authentication
Enable MFA enforcement
Streamline user provisioning
Enhance security auditing
</example data>
<example output>
<classification>
"Security Awareness"
</classification>
<explanation>
Classification should be Customer Education and not Security Awareness. General security questions where the agent is helping to educate the user such as Security guidelines and best practices, should be labeled as Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: "Our SAP invoices aren't syncing instantly with Salesforce opportunities. We've configured MuleSoft Composer as per documentation, but updates only happen intermittently."
Support: "I understand you're looking for real-time synchronization. Currently, MuleSoft Composer's fastest sync interval is 15 minutes by design. While I can help optimize your current setup, I'll submit a feature request for real-time sync capability. Here's how to optimize the current polling interval: doc link"
</example data>
<example output>
<classification>
Customer Education
</classification>
<explanation>
Classification should be Feature Request and not Customer Education. The agent tells the customer that fastest sync interval is 15 minutes by design. The agent also points out they will submit a Feature Request. Cases where the customer ask for features should be classified as Feature Request. 
</explanation>
</example>
 
<example>
<example data>
Customer: "Our sales ETL pipeline keeps timing out with error 'V_001' at the transform step. This was working perfectly before."
Support: "I've analyzed your configuration. The timeout occurs because the transformation spans 5 years of data containing 23 cross-object formula fields and is running without filters. Please implement these optimization steps from our documentation: Document link on ETL performance"
</example data>
<example output>
<classification>
Software Defect
</classification>
<explanation>
Classification should be Customer Education and not Software Defect. The agent tells the user that timeout is caused by misconfiguration and needs to be restricted using filters. The agent provides documentation explaining how to troubleshoot the issue. Cases where the agent tells the user how to solve the problem and provides documentation should be labeled Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: "We are trying to deploy a custom workflow template but receiving this error: Resource handler returned message: 'Error: Multiple or missing values for mandatory single-value field, Field: ACTION_TYPE, Parameter: Workflow Action (Status Code: 400, Request ID: TKT-2481-49bc)' when deploying through Flow Designer."
Support: "I've reviewed your Flow Designer deployment (instance: dev85xxx.xxx.com/flow/TKT-2481-49bc) which failed to create a Workflow Action resource. This error occurs when the action configuration is ambiguous. After checking the Flow Designer documentation [1], each Action Step in your template must define exactly one 'Action Type' attribute. The Flow Designer documentation [2] specifies that each workflow action requires a single, explicit action type definition. You cannot have multiple or undefined action types in a single step. This is similar to an issue reported in the Product Community [3]. Please review your workflow template and ensure each action step has exactly one defined Action Type. The documentation provides detailed configuration examples at [4]. Let me know if you need any clarification on implementing these changes.
</example data>
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
Classification should be Customer Education and not Documentation Improvement. The agent tells the user they have to change the action configuration and define an Action type attribute. Cases where the agent tells the user how to solve problem and provides documentation should be classified Customer Education.
</explanation>
</example>
 
</bad examples>
 
Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response> 
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation> 
</response> 
 
Here is the conversation you need to categorize:


About the Authors

Sumeet Kumar is a Sr. Enterprise Support Manager at AWS leading the technical and strategic advisory team of TAM builders for automotive and manufacturing customers. He has diverse support operations experience and is passionate about creating innovative solutions using AI/ML.

Andy Brand is a Principal Technical Account Manager at AWS, where he helps education customers develop secure, performant, and cost-effective cloud solutions. With over 40 years of experience building, operating, and supporting enterprise software, he has a proven track record of addressing complex challenges.

Tom Coombs is a Principal Technical Account Manager at AWS, based in Switzerland. In Tom’s role, he helps enterprise AWS customers operate effectively in the cloud. From a development background, he specializes in machine learning and sustainability.

Ramu Ponugumati is a Sr. Technical Account Manager and a specialist in analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing badminton, and hiking.

Related Updates

Recent Updates