Are you A/B testing and reviewing performance of your prompts?

Most AI applications, whether a RAG based chatbot or a simple model wrapper, rely on prompts to generate responses. User or application inputs are converted into prompts, which are then fed to the underlying model to generate responses. Due to the nature of the models, the quality of response is highly dependent on the quality of the prompt. Of course, we can manually test prompts to some extent, but it's not scalable. In this article I will discuss Latitude - a prompt engineering platform that can help in refining prompts, A/B testing them and measuring their performance.

Photo by Stephen Dawson on Unsplash

Imagine you have a chatbot on your website that helps with user queries and is driven by RAG (built on your documents, FAQs, etc.). Here are a few sample prompts that you might use to get a response from the chatbot:

Option 1

You are a customer support AI assistant that helps users by retrieving the most relevant information from our documentation, past support cases, and call logs. Carefully analyze the user query to determine intent. Retrieve the most relevant documents and prioritize those with direct answers. Summarize the findings in a clear, concise, and user-friendly manner. Provide a concise response with supporting details, formatted as:

Summary
Supporting Information with brief reference to relevant docs

User query: {user_query}

Option 2

You are an advanced customer support assistant that helps users with queries by retrieving relevant information from documentation, past support cases, and call logs. Identify the core intent behind the user's question. Retrieve the top 3 most relevant documents. Generate a response that follows this format:

Answer Summary in one sentence
Detailed Explanation with a summary of key points from relevant documents
Sources Cited

User query: {user_query}

Both prompts are similar but the quality of the response will vary. It'll be great if we could test these prompts and measure their performance.

With multiple teams, apps, etc. manually testing these prompts, versioning them, and measuring their performance isn't scalable. What parameters do you test on? What defines "success"? How do you standardize this across teams?

This is where Latitude comes in. Latitude is a prompt engineering platform that can help in refining prompts, A/B testing them, and measuring their performance. It provides a simple interface to create, manage, and test prompts. You can create multiple versions of a prompt, test them against each other, and measure their performance. Latitude provides detailed analytics on how each prompt is performing, which can help in refining the prompts and improving the quality of responses.

These prompts can be pulled into your code using the Latitude API. This also enables dedicated teams to focus on prompt engineering, while devs can focus on application and business logic.

For example, the below code pulls the prompt from Latitude before feeding it to the model:

import requests
from latitude_sdk import Latitude, LatitudeOptions

sdk = Latitude('latitude-api-key', LatitudeOptions(
    project_id=12345,
    version_uuid='abc-def'
))

doc_path = 'abc'

prompt = await sdk.prompts.get(doc_path)

optimized_prompt = get_optimized_prompt(user_query)
completion = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": optimized_prompt
        }
    ]
)

Latitude is open source and can be self-hosted. They also have a cloud offering that can be used to quickly get started.

To get started, you create a project and under the project, create a prompt using Latitude's prompt Editor. When you enter the above prompt, Latitude converts this prompt to PromptL - a language that Latitude uses to define prompts.

Latitude Prompt Editor

After a prompt is created, you can test it against multiple inputs and evaluate its performance. You can also create datasets and use them to test the prompts. My favorite feature is that Latitude let's you generate these datasets as well.

Latitude Dataset Generator

Once you run test, Latitude provides detailed evaluations to help you measure the quality of LLM output.

Latitude Evaluations

Out of the box Latitude provides various evaluators based on accuracy, adaptability, bias, coherence, etc. These evaluators use LLM and basically measure the output against a prompt and provide a score. You can also create custom evaluations and define your own criteria to decide if the result from the prompt is acceptable or not.

Latitude Evaluation Creator

Latitude also integrates with OpenTelemtry and enables logging of all the requests and responses. I didn't try enabling it but it's a good feature to have especially as it shows cost per request which can be a factor when evaluating the performance of the prompts.

What do you think about Latitude? Have you used something similar?