CoinRSS: Bitcoin, Ethereum, Crypto News and Price Data

  • CONTACT
  • MARKETCAP
  • BLOG
CoinRSS: Bitcoin, Ethereum, Crypto News and Price Data
  • BOOKMARKS
  • Blockchain
  • Crypto
    • Bitcoin
    • Ethereum
    • Forex
    • Tether
  • Market
    • Binance
    • Business
    • Investor
    • Money
    • Trading
  • News
    • Coinbase
    • Mining
    • NFT
    • Stocks
Reading: Did OpenAI Cheat on Its Big Math Test?
Share
You have not selected any currencies to display
CoinRSS: Bitcoin, Ethereum, Crypto News and Price DataCoinRSS: Bitcoin, Ethereum, Crypto News and Price Data
0
Font ResizerAa
  • Blockchain
  • Crypto
  • Market
  • News
Search
  • Blockchain
  • Crypto
    • Bitcoin
    • Ethereum
    • Forex
    • Tether
  • Market
    • Binance
    • Business
    • Investor
    • Money
    • Trading
  • News
    • Coinbase
    • Mining
    • NFT
    • Stocks
Have an existing account? Sign In
Follow US
© Foxiz News Network. Ruby Design Company. All Rights Reserved.
CoinRSS: Bitcoin, Ethereum, Crypto News and Price Data > Blog > News > Did OpenAI Cheat on Its Big Math Test?
News

Did OpenAI Cheat on Its Big Math Test?

CoinRSS
Last updated: January 26, 2025 11:20 pm
CoinRSS Published January 26, 2025
Share

Contents
Not the first, not the lastGenerally Intelligent Newsletter

How intelligent is a model that memorizes the answers before an exam? That’s the question facing OpenAI after it unveiled o3 in December, and touted its model’s impressive benchmarks. At the time, some pundits hailed it as being almost as powerful as AGI, the level at which artificial intelligence is capable of achieving the same performance as a human on any task required by the user.

But money changes everything—even math tests, apparently.

OpenAI’s victory lap over its o3 model’s stunning 25.2% score on FrontierMath, a challenging mathematical benchmark developed by Epoch AI, hit a snag when it turned out the company wasn’t just acing the test—OpenAI helped write it, too.

“We gratefully acknowledge OpenAI for their support in creating the benchmark,” Epoch AI wrote in an updated footnote on the FrontierMath whitepaper—and this was enough to raise some red flags among enthusiasts.

screenshot from Epoch AI's research paper recognizing OpenAI's support during the development of their FrontierMath benchmark datasted
Image: Epoch AI via ArXiv

Worse, OpenAI had not only funded FrontierMath’s development but also had access to its problems and solutions to use as it saw fit. Epoch AI later revealed that OpenAI hired the company to provide 300 math problems, as well as their solutions.

“As is typical of commissioned work, OpenAI retains ownership of these questions and has access to the problems and solutions,” Epoch said Thursday.

Neither OpenAI nor Epoch replied to a request for comment from Decrypt. Epoch has however said that OpenAI signed a contract in advance indicating it would not use the questions and answers in its database to train its o3 model.

The Information first broke the story.

While an OpenAI spokesperson maintains OpenAI didn’t directly train o3 on the benchmark, and the problems were “strongly held out” (meaning OpenAI didn’t have access to some of the problems), experts note that access to the test materials could still allow performance optimization through iterative adjustments.

Tamay Besiroglu, associate director at Epoch AI, said that OpenAI had initially demanded that its financial relationship with Epoch not be revealed.

“We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible,” he wrote in a post. “Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much, but not all of the dataset.”

Tamay said that OpenAI said it wouldn’t use Epoch AI’s problems and solutions—but didn’t sign any legal contract to make sure that would be enforced. “We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions,” he wrote. “However, we have a verbal agreement that these materials will not be used in model training.”

Fishy as it sounds, Elliot Glazer, Epoch AI’s lead mathematician, said he believes OpenAI was true to its word: “My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances,” he posted on Reddit.

The researcher also took to Twitter to address the situation, sharing a link to an online debate about the issue in the online forum Less Wrong.

As for where the o3 score on FM stands: yes I believe OAI has been accurate with their reporting on it, but Epoch can’t vouch for it until we independently evaluate the model using the holdout set we are developing.

— Elliot Glazer (@ElliotGlazer) January 19, 2025

Not the first, not the last

The controversy extends beyond OpenAI, pointing to systemic issues in how the AI industry validates progress. A recent investigation by AI researcher Louis Hunt revealed that other top performing models including Mistral 7b, Google’s Gemma, Microsoft’s Phi-3, Meta’s Llama-3 and Alibaba’s Qwen 2.5 were able to reproduce verbatim 6,882 pages of the MMLU and GSM8K benchmarks.

MMLU is a synthetic benchmark, just like FrontierMath, that was created to measure how good models are at multitasking. GSM8K is a set of math problems used to benchmark how proficient LLMs are at math.

LLMs reproducing the training dataset of some AI benchmarks
Image: Louis Hunt

That makes it impossible to properly assess how powerful or accurate their models truly are. It’s like giving a student with a photographic memory a list of the problems and solutions that will be on their next exam; did they reason their way to a solution, or simply spit back the memorized answer? Since these tests are intended to demonstrate that AI models are capable of reasoning, you can see what the fuss is about.

“It’s actually A VERY BIG ISSUE,” RemBrain founder Vasily Morzhakov warned. “The models are tested in their instruction versions on MMLU and GSM8K tests. But the fact that base models can regenerate tests—it means those tests are already in pre-training.”

Going forward, Epoch said it plans to implement a “hold out set” of 50 randomly selected problems that will be withheld from OpenAI to ensure genuine testing capabilities.

But the challenge of creating truly independent evaluations remains significant. Computer scientist Dirk Roeckmann argued that ideal testing would require “a neutral sandbox which is not easy to realize,” adding that even then, there’s a risk of “leaking of test data by adversarial humans.”

Edited by Andrew Hayward

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Source link

You Might Also Like

Bitcoin dips after hitting new ATH – Is another surge on the horizon?

Altcoins crash as market loses $600 Billion – But is the worst over?

Dave Portnoy Is Trading Solana Meme Coins and Just Doxxed His Wallet: ‘I’m Not Trying to Be Shady’

Bitcoin Bull Howard Lutnick Defends Tether in Senate Hearing—But Supports Stablecoin Audits

Northern Mariana Islands issues ‘first U.S. Govt-issued stablecoin’ – Will Wyoming fall back?

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Solana’s price to $600? Trader makes bold projection DESPITE $2.5B unlock
Next Article This Week in Crypto Games: ‘Rumble Kong’ to Ronin, ‘Farm Frens’ Airdrop Snapshot, AI Chess Battle
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recipe Rating




Follow US

Find US on Socials
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Subscribe to our newslettern

Get Newest Articles Instantly!

- Advertisement -
Ad image
Popular News
Ripple Gains Regulatory Approval for RLUSD Stablecoin in Dubai
BTC Price will Hit $100K before Bitcoin Sweeps $30K Lows
Crypto Bahamas: Regulations Enter Critical Stage as Gov’t Shows Interest

Follow Us on Socials

We use social media to react to breaking news, update supporters and share information

Twitter Youtube Telegram Linkedin
CoinRSS: Bitcoin, Ethereum, Crypto News and Price Data coin-rss-logo

We influence 20 million users and is the number one business blockchain and crypto news network on the planet.

Subscribe to our newsletter

You can be the first to find out the latest news and tips about trading, markets...

Ad imageAd image
© CoinRSS: Bitcoin, Ethereum, Crypto News and Price Data. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?