How the HAi Lab team test new GenAI Models like GPT-4o

Trevor Killick - May 16, 2024

Discover how the Hornbill AI Lab team puts new AI models to the test, and the effects these models have on current functionality.

Why is a model change important?

Hornbill AI makes use of OpenAI's Large Language Models (LLM) for its Generative AI capabilities. These models are frequently changing. In the past, versions like GPT-3.5 or GPT-4 and GPT-4-Turbo have been released with significant changes in capability, knowledge cut off and accuracy. Each time a new major flagship model is released, existing prompts need to be validated and sometimes need to be updated to maintain functionality (or make use of new capabilities).

This week, OpenAI released GPT-4o, their latest flagship model with improvements across the board. This was the first model update since Hornbill Ai shipped to the Beta Program, providing a unique challenge for the lab team to validate all the existing functionality and output variations already in use.

Significant model changes typically alter the way system prompts are handled and can, on occasion, cause the output behaviour to change in an undesirable way. It's important that all the existing functionality is validated against the following criteria:

Quality of the output based on the prompt - Has the new knowledge cut-off (the inclusion of newer, "fresher" data since the previous model) improved or degraded the quality of output? And is the response still aligned with the prompt?
Output formatting - Our use cases support 3 different output formats (based on where in Hornbill the prompt is used) - Wiki Markup, HTML, or plain text. Some of these output formats behave better than others after model updates (spoiler alert - Wiki Markup didn't behave).
Performance - Flagship model changes more recently have been getting faster on average - but on occasion the change can introduce additional latency to the response.

How do we test new models?

The Hornbill AI Lab Team typically test changes to prompts in situ within Hornbill, as they typically are bug fixes or tweaks that can be easily replicated and validated. But when testing a new model, each and every prompt and variation needs to be validated - so a benchmarking suite was setup.

A list of models from GPT-3.5-Turbo up to GPT-4o was created to add historic reference. Hornbill AI launched using GPT-4-Turbo but adding some of the new preview models and GPT-4o to the testing suite creates some useful historic benchmarks.

We created a dynamic system for building out a test list - based on our application prompts, output formats, and example test inputs. This means the test suite will cover new and historic tests - everything new and historic will get picked up and tested automatically.

This has led to an initial test suite of 900 tests:

Screenshot 2024-05-15 at 13.27.50

The output of each test, as well as the list of input messages, is saved and then analysed by the Hornbill AI Lab team.

Quality of the output

Manual analysis is initially undertaken against the output for each test. The team look at each test and, based on what the prompt was asked to do, we decide if the test needs to be flagged for review (to correct the prompt). This test highlights the improvements of the flagship models and how the quality of the output is improving over time - making Generative AI more and more useful.

Output Formatting

The output messages, grouped by format, are scanned for any formatting that doesn't match. Are HTML tags included in a plain text response? Or is there incorrect Wiki Markup? (Yes there was). Older models behaved up to a point, but GPT-4o showed signs of ignoring some system messages (something we have seen in the past, especially around Wiki Markup). Our thinking is that as the Hornbill Wiki Markup is a variation on Wikipedia markup (and not widely documented or used), the LLM struggles - even with a well defined style guide to follow. Plain text and HTML typically behaves itself.

Performance

With the announcement of GPT-4o and the touted 2x-4x speed improvement, the team wanted to benchmark our current performance against the latest model. Two metrics were established, a TTFC (Time To First Character) and TTFR (Time To Full Response). These measure how long a user waits to see HAi start outputting a response, and how long before it's complete. Screenshot 2024-05-15 at 14.17.15

TTFC on GPT-4o was (on average) 45% faster than GPT-4-0125-preview, which we are currently using. But more importantly, the max TTFC seen is significantly improved: From around 9 seconds down to 2.4 seconds. This is noticeable for Hornbill users - with HAi being much snappier when it starts responding (about half a second on average).

TTFR again is an incredible improvement: 8.3 seconds down to 3.4 seconds (on average). When generating large blocks of text, GPT-4o is much faster - meaning analysts save time.

Outcome

The testing framework has highlighted a number of prompts that can use improvement, and some fixes (especially around output formatting). The Hornbill AI Lab's documented and repeatable process allows reliable tests to be performed and analysed - highlighting things the team wouldn't normally see, and allow for improvements to HAi (as well as validating the new OpenAI models).

There's more work to be done adding test cases and automating the analysis. Expect to see HAi powered by GPT-4o in the near future.