• Latest
  • Trending
  • All
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY

December 25, 2024
Indices: Already not extreme fear

Indices: Already not extreme fear

April 24, 2025
Eurozone: Tariff reversal is some relief, but no game changer – ABN AMRO

Eurozone: Tariff reversal is some relief, but no game changer – ABN AMRO

April 24, 2025
US: The US has already lost the trade war – ABN AMRO

US: The US has already lost the trade war – ABN AMRO

April 24, 2025
Predictive Analytics Promise the End of ‘Gut Feelings’ in Construction

Predictive Analytics Promise the End of ‘Gut Feelings’ in Construction

April 24, 2025
First Border Wall Contracts of Second Trump Term Awarded in Texas, San Diego

First Border Wall Contracts of Second Trump Term Awarded in Texas, San Diego

April 24, 2025
Construction Economics for April 28, 2025

Construction Economics for April 28, 2025

April 24, 2025
AI startups backed to boost construction productivity

AI startups backed to boost construction productivity

April 24, 2025
Why is building safety litigation on the rise?

Why is building safety litigation on the rise?

April 24, 2025
Severfield to cut 6 per cent of staff despite ‘solid’ order book

Severfield to cut 6 per cent of staff despite ‘solid’ order book

April 24, 2025
Bovis promotes operations head to board

Bovis promotes operations head to board

April 24, 2025
China expresses condolences over death of Pope Francis, World News

China expresses condolences over death of Pope Francis, World News

April 24, 2025
Pope Francis’ body taken in procession to St Peter’s for lying in state, World News

Pope Francis’ body taken in procession to St Peter’s for lying in state, World News

April 24, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Thursday, May 15, 2025
No Result
View All Result
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US
  • Login
  • Register
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US
No Result
View All Result
Huewire
No Result
View All Result
Home TECHNOLOGY

by huewire
December 25, 2024
in TECHNOLOGY
0
491
SHARES
1.4k
VIEWS
Share on FacebookShare on Twitter

December 24, 2024 11:40 AM

Image created with DALL-E 3 for VentureBeat

Image created with DALL-E 3 for VentureBeat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


OpenAI’s latest o3 model has achieved a breakthrough that has surprised the AI research community. o3 scored an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%. 

While the achievement in ARC-AGI is impressive, it does not yet prove that the code to artificial general intelligence (AGI) has been cracked.

Abstract Reasoning Corpus

The ARC-AGI benchmark is based on the Abstract Reasoning Corpus, which tests an AI system’s ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI. 

Example of ARC puzzle (source: arcprize.org)

ARC has been designed in a way that it can’t be cheated by training models on millions of examples in hopes of covering all possible combinations of puzzles. 

The benchmark is composed of a public training set that contains 400 simple examples. The training set is complemented by a public evaluation set that contains 400 puzzles that are more challenging as a means to evaluate the generalizability of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking the data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of computation participants can use to ensure that the puzzles are not solved through brute-force methods.

A breakthrough in solving novel tasks

o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.

In a blog post, François Chollet, the creator of ARC, described o3’s performance as “a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.”

It is important to note that using more compute on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about o3’s architecture, we can be confident that it is not orders of magnitude larger than its predecessors.

Performance of different models on ARC-AGI (source: arcprize.org)

“This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”

It is worth noting that o3’s performance on ARC-AGI comes at a steep cost. On the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on the high-compute budget, the model uses around 172X more compute and billions of tokens per problem. However, as the costs of inference continue to decrease, we can expect these figures to become more reasonable.

A new paradigm in LLM reasoning?

The key to solving novel problems is what Chollet and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs for solving very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from figuring out puzzles that are beyond their training distribution.

Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the past few months. 

Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 can actually be just the forward passes from one language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 was “just an LLM trained with RL. o3 is powered by further scaling up RL beyond o1.”

On the same day, Denny Zhou from Google DeepMind’s reasoning team called the combination of search and current reinforcement learning approaches a “dead end.” 

“The most beautiful thing on LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on search (e.g. mcts) over the generation space, whether by a well-finetuned model or a carefully designed prompt,” he posted on X.

While the details of how o3 reasons might seem trivial in comparison to the breakthrough on ARC-AGI, it can very well define the next paradigm shift in training LLMs. There is currently a debate on whether the laws of scaling LLMs through training data and compute have hit a wall. Whether test-time scaling depends on better training data or different inference architectures can determine the next path forward.

Not AGI

The name ARC-AGI is misleading and some have equated it to solving AGI. However, Chollet stresses that “ARC-AGI is not an acid test for AGI.” 

“Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don’t think o3 is AGI yet,” he writes. “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.”

Moreover, he notes that o3 cannot autonomously learn these skills and it relies on external verifiers during inference and human-labeled reasoning chains during training. 

Other scientists have pointed to the flaws of OpenAI’s reported results. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver should not need much specific ‘training’, either on the domain itself or on each specific task,” writes scientist Melanie Mitchell.

To verify whether these models possess the kind of abstraction and reasoning the ARC benchmark was created to measure, Mitchell proposes “seeing if these systems can adapt to variants on specific tasks or to reasoning tasks using the same concepts, but in other domains than ARC.”

Chollet and his team are currently working on a new benchmark that is challenging for o3, potentially reducing its score to under 30% even at a high-compute budget. Meanwhile, humans would be able to solve 95% of the puzzles without any training.

“You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” Chollet writes.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Read More

Share196Tweet123
huewire

huewire

Recent Comments

No comments to show.

Recent Posts

  • Indices: Already not extreme fear
  • Eurozone: Tariff reversal is some relief, but no game changer – ABN AMRO
  • US: The US has already lost the trade war – ABN AMRO
  • Predictive Analytics Promise the End of ‘Gut Feelings’ in Construction
  • First Border Wall Contracts of Second Trump Term Awarded in Texas, San Diego
Huewire

Copyrights © 2024 Huewire.com.

Navigate Site

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Follow Us

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US

Copyrights © 2024 Huewire.com.