Watch the full presentation from our talk with Alaska Airlines at Google Cloud NEXT’24.
The rapid adoption of conversational AI across industries is transforming customer interactions and operational efficiency. When agents are powered by large language models (LLMs), they bring incredible opportunity for personalization and innovation — but they also carry considerable risk for the brand developing the bot.
These risks lie primarily with human users. As we’ve seen with every major LLM release to date, when a new conversational agent hits the market, users will inevitably try to push this technology to its limits and explore the range of statements that the agent is willing to say. This typically includes pushing the bot to say:
As we were building Alaska Airlines’ new Natural Language Search agent — one designed to transform the guest experience in finding the perfect getaway — we flagged these risks early on in the development process. No matter how effective the agent was, some users would jump right into adversarial behavior.
A natural first step was to do some adversarial testing beforehand. With manual testing, we identified a handful of informative risk areas within a few hours. But considering this agent would be launched to millions of users, any one of whom could engage in hundreds of adversarial conversations, we needed a way to perform this testing at scale.
How can we ensure that our conversational agents comply with our brand’s safety and ethical guidelines before it launches to the public? Traditional manual testing methods quickly become impractical as deployments grow.
Bitstrapped and Alaska Airlines have worked together to tackle this challenge. We’re proud to introduce our QA Response Liaison (QARL), a new framework that leverages generative AI to perform large-scale testing of conversational agents.
Early efforts in conversational AI development often relied on manual testing. While effective for initial prototypes, these approaches pose significant limitations when dealing with large datasets:
Time-Consuming: Manually testing every conversation is labor-intensive. Trying to replicate the sheer volume that a single chatbot might encounter daily creates a bottleneck to LLM development at scale.
Limited Scope: Basic testing often focuses solely on functionality, neglecting crucial aspects like user experience and safety. A chatbot might technically respond to a query, but the response could be misleading, irrelevant, or inappropriate. (Human evaluators are much more likely to assess “Does this bot do what I want?” instead of answering the equally important “Does this bot avoid doing what I don’t want?”)
Inconsistent Evaluation: Subjective human evaluation can introduce bias and inaccuracies. Different evaluators might have varying interpretations of a "good" conversation, hindering the testing process.
QARL addresses these challenges by harnessing what generative AI does best — generating text based on human inputs. We built QARL to engage in conversations with agents, following conversation paths as suggested by developers, evaluate the effectiveness of those conversations, and compile the results into a human-readable report. Leveraging the same types of CI/CD pipelines that are used in software development, QARL automates a significant portion of the testing process.
Here's a closer look at some of QARL’s key functionalities:
Customization of Testing Scenarios: QARL allows users to define specific goals for each LLM agent under evaluation. This could involve testing for factual accuracy in a customer service scenario, evaluating adherence to a particular brand voice, avoiding nudges towards inappropriate comments, or rejecting requests to talk like a pirate.
Automated Test Case Generation: Based on these testing goals, QARL will generate realistic user queries and simulate different conversation flows, speaking with an attitude that is dictated by the given testing scenario (e.g. friendly, troll, evil, etc.). This allows QARL to comprehensively test LLM performance across a wide spectrum of situations, both desired and undesired.
In-Depth Conversation Scoring: Business users will provide QARL with predefined scoring criteria for each testing scenario — defining what counts as a success and failure in each case. This can be as simple as a pass/fail judgment, or as nuanced as a score from 1-10. After engaging in conversation with the agent under testing, QARL doesn't just provide a score; it provides insights into why a conversation received a particular rating. This involves citing specific phrases, responses or themes that detracted from the conversation quality, allowing developers to refine the LLM and improve future interactions.
Customizable Configuration: One size does not fit all. QARL can connect through API endpoints to talk to Gemini models, to DialogFlow for working with structured chatbots, or even through custom plugins to connect with any sockets that an organization may have. It is highly flexible.
Human-in-the-Loop Insights: While automation plays a significant role in QARL, the framework acknowledges the importance of human expertise. Conversations with low scores, unexpected behavior, or significant deviations from expected topics are flagged for human review. This human-in-the-loop approach allows for critical decision-making and ensures that even complex or nuanced issues are identified and addressed.
Integration with Cloud Storage: QARL logs all test conversations, ratings, and associated metadata to Google Cloud Storage. Having this data available facilitates comprehensive analysis, allowing developers to track LLM performance with respect to different system prompts, identify patterns, and measure the effectiveness of their testing strategies.
QARL delivers more than simply identifying flaws in conversation flow. It contributes to a more comprehensive approach to LLM development, ultimately leading to significant business benefits:
Improved LLM Performance: Data from QARL helps pinpoint areas where LLMs need improvement. Analyzing low-scoring conversations can reveal weaknesses in areas like factual accuracy and avoidance of brand non-compliance. This feedback allows developers to refine system prompts and enhance LLM architecture for performance.
Reduced Development Time: Automating testing frees human resources from repetitive tasks, such as manually crafting test cases and evaluating each conversation. This allows development teams to focus on higher-level tasks like LLM design, optimization, and integration. The streamlined testing process with QARL can significantly accelerate the development cycle for conversational AI applications.
Enhanced Customer Experience: Conversational agents are designed to improve the user experience, but poorly tested products can lead to frustrating interactions. QARL's rigorous testing helps ensure the safety and quality of conversations, contributing to a more positive and engaging user experience.
Increased ROI: Faster development cycles and improved customer experience translate to a higher return on investment for companies adopting conversational AI. By optimizing LLM performance and reducing customer service costs, QARL can help businesses achieve a faster payback period on their AI investments.
QARL represents a significant advancement in conversational AI testing. Powered by generative AI, QARL helps manage the risk of adversarial attacks by acting as an adversarial agent itself, offering a new way to customize and scale the testing process. The framework enables the development of safe and effective conversational AI solutions.
For a deeper dive into QARL, watch our talk at Google Cloud NEXT’24.