
In a stunning move, OpenAI has unveiled a new software framework called Evals to evaluate the performance of its AI models alongside the highly anticipated GPT-4.
This tool will allow individuals to report shortcomings in its models, with the hope that it will guide improvements in the future.
According to OpenAI, Evals represents a sort of crowdsourcing approach to model testing, as it relies on the collective efforts of the masses to identify failures and difficulties.
In a recent blog post, OpenAI explained that it uses Evals to guide the development of its models, which helps in identifying shortcomings and preventing regressions.
Furthermore, users can apply this framework for tracking performance across model versions and evolving product integrations.
The ultimate goal of the framework is to become a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.
OpenAI’s Evals: A New Software Framework to Evaluate AI Model Performance
Evals by OpenAI evaluates models like GPT-4 with benchmark tests. Developers can generate prompts using datasets, measure model completion quality, and compare performance.
This Software framework is compatible with several popular AI benchmarks and supports writing new classes to implement custom evaluation logic. OpenAI created a logic puzzles evaluation that contains 10 prompts where GPT-4 fails as an example to follow.
Unfortunately, all of this work is unpaid. However, OpenAI plans to grant GPT-4 access to those who contribute “high-quality” benchmarks to incentivize Evals usage.
The company believes that Evals will be an integral part of the process of using and building on top of its models, and it welcomes direct contributions, questions, and feedback.
OpenAI’s Evals follows in the footsteps of others who have turned to crowdsource to robustify AI models.
The Computational Linguistics and Information Processing Laboratory at the University of Maryland launched a platform dubbed Break It, Build It, which lets researchers submit models to users tasked with coming up with examples to defeat them.
Meta’s Dynabench platform lets users test models on various tasks, including sentiment analysis and hate speech detection.