Warning: exif_imagetype(https://venturebeat.com/wp-content/uploads/2024/11/robotic-llama.jpg?w=750): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3338

Warning: file_get_contents(https://venturebeat.com/wp-content/uploads/2024/11/robotic-llama.jpg?w=750): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3358

Warning: exif_imagetype(https://venturebeat.com/wp-content/uploads/2024/11/stage-level-beam-search.jpg?w=800): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3338

Warning: file_get_contents(https://venturebeat.com/wp-content/uploads/2024/11/stage-level-beam-search.jpg?w=800): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3358

Warning: exif_imagetype(https://venturebeat.com/wp-content/uploads/2024/11/Llava-o1-training-data.jpg?w=800): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3338

Warning: file_get_contents(https://venturebeat.com/wp-content/uploads/2024/11/Llava-o1-training-data.jpg?w=800): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3358

Warning: exif_imagetype(https://venturebeat.com/wp-content/uploads/2024/11/LLaVA-o1-results.jpg?w=770): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3338

Warning: file_get_contents(https://venturebeat.com/wp-content/uploads/2024/11/LLaVA-o1-results.jpg?w=770): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests in /dom281985/wp-includes/functions.php on line 3358
  • Latest
  • Trending
  • All
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY

November 24, 2024
NYPD condemns Trump’s DHS for playing politics with counterterrorism funds

NYPD condemns Trump’s DHS for playing politics with counterterrorism funds

October 2, 2025
Morocco: The 14th edition of the Magreb International Film Festival opens in Oujda

Morocco: The 14th edition of the Magreb International Film Festival opens in Oujda

October 2, 2025
South Korea airport workers go on strike starting Wednesday, Korea Airports Corp says, Asia News

South Korea airport workers go on strike starting Wednesday, Korea Airports Corp says, Asia News

October 2, 2025
Mike Johnson Caught on Camera Admitting Trump Is ‘Unwell’

Mike Johnson Caught on Camera Admitting Trump Is ‘Unwell’

October 2, 2025
Madagascar: Protests ongoing to demand president’s resignation as police presence grows

Madagascar: Protests ongoing to demand president’s resignation as police presence grows

October 2, 2025
ICA foils attempt to smuggle 9,200 e-vaporiser pods declared as power banks, 25-year-old Singaporean man arrested, Singapore News

ICA foils attempt to smuggle 9,200 e-vaporiser pods declared as power banks, 25-year-old Singaporean man arrested, Singapore News

October 2, 2025

Pope makes rare comments on U.S. politics, military gathering

October 2, 2025
DRC: Joseph Kabila’s death sentence sends shockwaves through Goma

DRC: Joseph Kabila’s death sentence sends shockwaves through Goma

October 2, 2025
Former lovers acquitted of all charges over alleged sexual abuse of woman’s daughter, Singapore News

Former lovers acquitted of all charges over alleged sexual abuse of woman’s daughter, Singapore News

October 2, 2025
A government shutdown role reversal: From the Politics Desk

A government shutdown role reversal: From the Politics Desk

October 2, 2025
Athens paralyzed by general strike against new labor laws

Athens paralyzed by general strike against new labor laws

October 2, 2025
Nicole Kidman and Keith Urban separate after nearly 2 decades together, Entertainment News

Nicole Kidman and Keith Urban separate after nearly 2 decades together, Entertainment News

October 2, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Thursday, December 11, 2025
No Result
View All Result
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US
  • OUR POLICY
  • Login
  • Register
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US
  • OUR POLICY
No Result
View All Result
Huewire
No Result
View All Result
Home TECHNOLOGY

by huewire
November 24, 2024
in TECHNOLOGY
0
494
SHARES
1.4k
VIEWS
Share on FacebookShare on Twitter

November 22, 2024 3:26 PM

robotic llama

Credit: VentureBeat with DALL-E 3

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


OpenAI‘s o1 model has shown that inference-time scaling—using more compute during inference—can significantly boost a language model’s reasoning abilities. LLaVA-o1, a new model developed by researchers from multiple universities in China, brings this paradigm to open-source vision language models (VLMs).

Early open-source VLMs typically use a direct prediction approach, generating answers without reasoning about the prompt and the steps required to solve the prompt. Without a structured reasoning process, they are less effective at tasks that require logical reasoning. Advanced prompting techniques such as chain-of-thought (CoT) prompting, where the model is encouraged to generate intermediate reasoning steps, produce some marginal improvements. But VLMs often produce errors or hallucinate.

The researchers observed that a key issue is that the reasoning process in existing VLMs is not sufficiently systematic and structured. The models do not generate reasoning chains and often get stuck in reasoning processes where they don’t know at what stage they are and what specific problem they must solve.

“We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

Multistage reasoning

OpenAI o1 uses inference-time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it gradually solves the problem. While OpenAI has not released much detail about the underlying mechanism of o1, its results show promising directions for improving the reasoning abilities of foundational models.

Inspired by o1, the researchers designed LLaVA-o1 to perform stage-by-stage reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct stages:

Summary: The model first provides a high-level summary of the question, outlining the core problem it needs to address.

Caption:  If an image is present, the model describes the relevant parts, focusing on elements related to the question.

Reasoning:  Building on the summary, the model performs structured, logical reasoning to derive a preliminary answer.

Conclusion: Finally, the model presents a concise summary of the answer based on the preceding reasoning.

Only the conclusion stage is visible to the user; the other three stages represent the model’s internal reasoning process, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to improved performance on complex tasks.

“This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks,” the researchers write.

Stage-level beam search (right) vs other inference-time scaling techniques Source: arXiv

LLaVA-o1 also introduces a novel inference-time scaling technique called “stage-level beam search.” Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each stage to continue the generation process. This is in contrast to the classic best-of-N approach, in which the model is prompted to generate multiple complete responses before selecting one.

“Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage,” the researchers write. “This validates the effectiveness of structured output in improving inference time scaling.”

Training LLaVA-o1

Llava o1 training data
LLaVA-o1 training data is annotated with GPT-4o Source: arXiv

To train LLaVA-o1, the researchers compiled a new dataset of around 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from multi-turn question answering to chart interpretation and geometric reasoning.

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for each example, including the summary, caption, reasoning and conclusion stages. 

The researchers then fine-tuned Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released the model but plan to release the dataset, called the LLaVA-o1-100k.

LLaVA-o1 in action

The researchers evaluated LLaVA-o1 on several multimodal reasoning benchmarks.  Despite being trained on only 100,000 examples, LLaVA-o1 showed significant performance improvements over the base Llama model, with an average benchmark score increase of 6.9%.  

LLaVA-o1 results
LLaVA-o1 vs other open and closed models Source: arXiv

Furthermore, stage-level beam search led to additional performance gains, demonstrating the effectiveness of inference-time scaling. Due to computational resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes.

Impressively, LLaVA-o1 outperformed not only other open-source models of the same size or larger but also some closed-source models like GPT-4-o-mini and Gemini 1.5 Pro.

“LLaVA-o1 establishes a new standard for multimodal reasoning in VLMs, offering robust performance and scalability, especially in inference time,” the researchers write. “Our work paves the way for future research on structured reasoning in VLMs, including potential expansions with external verifiers and the use of reinforcement learning to further enhance complex multimodal reasoning capabilities.”

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Read More

Share198Tweet124
huewire

huewire

Recent Comments

No comments to show.

Recent Posts

  • NYPD condemns Trump’s DHS for playing politics with counterterrorism funds
  • Morocco: The 14th edition of the Magreb International Film Festival opens in Oujda
  • South Korea airport workers go on strike starting Wednesday, Korea Airports Corp says, Asia News
  • Mike Johnson Caught on Camera Admitting Trump Is ‘Unwell’
  • Madagascar: Protests ongoing to demand president’s resignation as police presence grows
Huewire

Copyrights © 2025 Huewire.com.

Navigate Site

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Follow Us

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • HOME
  • BUSINESS
  • ENTERTAINMENT
  • POLITICAL
  • TECHNOLOGY
  • ABOUT US
  • OUR POLICY

Copyrights © 2025 Huewire.com.