OpenAI Announces Evals, An Open-Source Software Framework for Evaluating AI Models

OpenAI Announces Evals, An Open-Source Software Framework for Evaluating AI Models

Alongside the announcement of GPT-4, OpenAI has announced the open-source software framework OpenAI Evals. This tool is designed to create and run benchmarks that evaluate the performance of models like GPT-4. With Evals, OpenAI hopes to crowdsource benchmarks for AI model testing. 

“We use Evals to guide development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions (which will now be coming out regularly) and evolving product integrations,” the company explains in a blog post.

Stripe, a popular payment processing company, has already used Evals to complement its human evaluations and measure the accuracy of their GPT-powered documentation tool.

Developers can use Evals to create and run evaluations that:

  • Use datasets to generate prompts,
  • Measure the quality of completions provided by an OpenAI model, and
  • Compare performance across different datasets and models.

With the open-source code, developers can also write and add a custom Eval as well as several templates that may accommodate different benchmarks. The company has included templates that have been most useful internally, including a template for “model-graded evals,” which GPT-4 can use to check its own work. As an example to follow, the company has created a logic puzzles eval containing ten prompts where GPT-4 fails.

Evals is also compatible with implementing existing benchmarks, including several notebooks implementing academic benchmarks and a few variations of integrating small subsets of CoQA.

While developers will not be paid for contributing Evals, OpenAI will be granting GPT-4 access for a limited time to those who contribute “high-quality evals.” 

The announcement of Evals comes after OpenAI recently said it would stop using data submitted by customers via its API to train or improve its models unless the customers decide to opt in. The company joins Meta in crowdsourcing benchmarks as the latter tasks humans with “finding adversarial examples that fool current state-of-the-art models” for its DynaBench platform.

Read more:

(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); = id;
js.src = “//”;
fjs.parentNode.insertBefore(js, fjs);
}(document, ‘script’, ‘facebook-jssdk’));

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Hey, wait!

Before you go, Subscribe to our mailing list to get the new updates!

%d bloggers like this: