Autonomous visual information seeking with large language models

Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team There has been great progress towards adapting large language models (LLMs) to accommodate multimodal inputs for tasks including image captioning, visual question answering (VQA), and open vocabulary recognition. Despite such achievements, current state-of-the-art visual language models (VLMs) perform inadequately on visual information seeking datasets, such as Infoseek and OK-VQA, where external knowledge is required to answer the questions. Examples of visual information seeking queries where external knowledge is required to answer the question. Images are taken from the OK-VQA dataset. In “AVIS: Autonomous Visual Information Seeking with Large Language Models”, we introduce a novel method that achieves state-of-the-art results on visual information seeking tasks. Our method integrates LLMs with three types of tools: (i) computer vision tools for extracting visual information from images, (ii) a web search tool for retrieving open world knowledge and facts, and (iii) an image search tool to glean relevant information from metadata associated with visually similar images. AVIS employs an LLM-powered planner to choose tools and queries at each step. It also uses an LLM-powered reasoner to analyze tool outputs and extract key information. A working memory component retains information throughout the process. An example of AVIS’s generated workflow for answering a challenging visual information seeking question. The input image is taken from the Infoseek dataset. Comparison to previous work Recent studies (e.g., Chameleon, ViperGPT and MM-ReAct) explored adding tools to LLMs for multimodal inputs. These systems follow a two-stage process: planning (breaking down questions into structured programs or instructions) and execution (using tools to gather information). Despite success in basic tasks, this approach often falters in complex real-world scenarios. There has also been a surge of interest in applying LLMs as autonomous agents (e.g., WebGPT and ReAct). These agents interact with their environment, adapt based on real-time feedback, and achieve goals. However, these methods do not restrict the tools that can be invoked at each stage, leading to an immense search space. Consequently, even the most advanced LLMs today can fall into infinite loops or propagate errors. AVIS tackles this via guided LLM use, influenced by human decisions from a user study. Informing LLM decision making with a user study Many of the visual questions in datasets such as Infoseek and OK-VQA pose a challenge even for humans, often requiring the assistance of various tools and APIs. An example question from the OK-VQA dataset is shown below. We conducted a user study to understand human decision-making when using external tools. We conducted a user study to understand human decision-making when using external tools. Image is taken from the OK-VQA dataset. The users were equipped with an identical set of tools as our method, including PALI, PaLM, and web search. They received input images, questions, detected object crops, and buttons linked to image search results. These buttons offered diverse information about the detected object crops, such as knowledge graph entities, similar image captions, related product titles, and identical image captions. We record user actions and outputs and use it as a guide for our system in two key ways. First, we construct a transition graph (shown below) by analyzing the sequence of decisions made by users. This graph defines distinct states and restricts the available set of actions at each state. For example, at the start state, the system can take only one of these three actions: PALI caption, PALI VQA, or object detection. Second, we use the examples of human decision-making to guide our planner and reasoner with relevant contextual instances to enhance the performance and effectiveness of our system. AVIS transition graph. General framework Our approach employs a dynamic decision-making strategy designed to respond to visual information-seeking queries. Our system has three primary components. First, we have a planner to determine the subsequent action, including the appropriate API call and the query it needs to process. Second, we have a working memory that retains information about the results obtained from API executions. Last, we have a reasoner, whose role is to process the outputs from the API calls. It determines whether the obtained information is sufficient to produce the final response, or if additional data retrieval is required. The planner undertakes a series of steps each time a decision is required regarding which tool to employ and what query to send to it. Based on the present state, the planner provides a range of potential subsequent actions. The potential action space may be so large that it makes the search space intractable. To address this issue, the planner refers to the transition graph to eliminate irrelevant actions. The planner also excludes the actions that have already been taken before and are stored in the working memory. Next, the planner collects a set of relevant in-context examples that are assembled from the decisions previously made by humans during the user study. With these examples and the working memory that holds data collected from past tool interactions, the planner formulates a prompt. The prompt is then sent to the LLM, which returns a structured answer, determining the next tool to be activated and the query to be dispatched to it. This design allows the planner to be invoked multiple times throughout the process, thereby facilitating dynamic decision-making that gradually leads to answering the input query. We employ a reasoner to analyze the output of the tool execution, extract the useful information and decide into which category the tool output falls: informative, uninformative, or final answer. Our method utilizes the LLM with appropriate prompting and in-context examples to perform the reasoning. If the reasoner concludes that it’s ready to provide an answer, it will output the final response, thus concluding the task. If it determines that the tool output is uninformative, it will revert back to the planner to select another action based on the current state. If it finds the tool output to be useful, it will modify the state and transfer control back to the planner to make a new decision at the new state. AVIS employs a dynamic decision-making strategy to respond to visual information-seeking queries. Results We evaluate AVIS on Infoseek and OK-VQA datasets. As shown below, even robust visual-language models, such as OFA and PaLI, fail to yield high accuracy when fine-tuned on Infoseek. Our approach (AVIS), without fine-tuning, achieves 50.7% accuracy on the unseen entity split of this dataset. AVIS visual question answering results on Infoseek dataset. AVIS achieves higher accuracy in comparison to previous baselines based on PaLI, PaLM and OFA. Our results on the OK-VQA dataset are shown below. AVIS with few-shot in-context examples achieves an accuracy of 60.2%, higher than most of the previous works. AVIS achieves lower but comparable accuracy in comparison to the PALI model fine-tuned on OK-VQA. This difference, compared to Infoseek where AVIS outperforms fine-tuned PALI, is due to the fact that most question-answer examples in OK-VQA rely on common sense knowledge rather than on fine-grained knowledge. Therefore, PaLI is able to encode such generic knowledge in the model parameters and doesn’t require external knowledge. Visual question answering results on A-OKVQA. AVIS achieves higher accuracy in comparison to previous works that use few-shot or zero-shot learning, including Flamingo, PaLI and ViperGPT. AVIS also achieves higher accuracy than most of the previous works that are fine-tuned on OK-VQA dataset, including REVEAL, ReVIVE, KAT and KRISP, and achieves results that are close to the fine-tuned PaLI model. Conclusion We present a novel approach that equips LLMs with the ability to use a variety of tools for answering knowledge-intensive visual questions. Our methodology, anchored in human decision-making data collected from a user study, employs a structured framework that uses an LLM-powered planner to dynamically decide on tool selection and query formation. An LLM-powered reasoner is tasked with processing and extracting key information from the output of the selected tool. Our method iteratively employs the planner and reasoner to leverage different tools until all necessary information required to answer the visual question is amassed. Acknowledgements This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid and Alireza Fathi.

Share This Post

There has been great progress towards adapting large language models (LLMs) to accommodate multimodal inputs for tasks including image captioning, visual question answering (VQA), and open vocabulary recognition. Despite such achievements, current state-of-the-art visual language models (VLMs) perform inadequately on visual information seeking datasets, such as Infoseek and OK-VQA, where external knowledge is required to answer the questions.

Examples of visual information seeking queries where external knowledge is required to answer the question. Images are taken from the OK-VQA dataset.

In “AVIS: Autonomous Visual Information Seeking with Large Language Models”, we introduce a novel method that achieves state-of-the-art results on visual information seeking tasks. Our method integrates LLMs with three types of tools: (i) computer vision tools for extracting visual information from images, (ii) a web search tool for retrieving open world knowledge and facts, and (iii) an image search tool to glean relevant information from metadata associated with visually similar images. AVIS employs an LLM-powered planner to choose tools and queries at each step. It also uses an LLM-powered reasoner to analyze tool outputs and extract key information. A working memory component retains information throughout the process.

An example of AVIS’s generated workflow for answering a challenging visual information seeking question. The input image is taken from the Infoseek dataset.

Comparison to previous work

Recent studies (e.g., Chameleon, ViperGPT and MM-ReAct) explored adding tools to LLMs for multimodal inputs. These systems follow a two-stage process: planning (breaking down questions into structured programs or instructions) and execution (using tools to gather information). Despite success in basic tasks, this approach often falters in complex real-world scenarios.

There has also been a surge of interest in applying LLMs as autonomous agents (e.g., WebGPT and ReAct). These agents interact with their environment, adapt based on real-time feedback, and achieve goals. However, these methods do not restrict the tools that can be invoked at each stage, leading to an immense search space. Consequently, even the most advanced LLMs today can fall into infinite loops or propagate errors. AVIS tackles this via guided LLM use, influenced by human decisions from a user study.

Informing LLM decision making with a user study

Many of the visual questions in datasets such as Infoseek and OK-VQA pose a challenge even for humans, often requiring the assistance of various tools and APIs. An example question from the OK-VQA dataset is shown below. We conducted a user study to understand human decision-making when using external tools.

We conducted a user study to understand human decision-making when using external tools. Image is taken from the OK-VQA dataset.

The users were equipped with an identical set of tools as our method, including PALI, PaLM, and web search. They received input images, questions, detected object crops, and buttons linked to image search results. These buttons offered diverse information about the detected object crops, such as knowledge graph entities, similar image captions, related product titles, and identical image captions.

We record user actions and outputs and use it as a guide for our system in two key ways. First, we construct a transition graph (shown below) by analyzing the sequence of decisions made by users. This graph defines distinct states and restricts the available set of actions at each state. For example, at the start state, the system can take only one of these three actions: PALI caption, PALI VQA, or object detection. Second, we use the examples of human decision-making to guide our planner and reasoner with relevant contextual instances to enhance the performance and effectiveness of our system.

AVIS transition graph.

General framework

Our approach employs a dynamic decision-making strategy designed to respond to visual information-seeking queries. Our system has three primary components. First, we have a planner to determine the subsequent action, including the appropriate API call and the query it needs to process. Second, we have a working memory that retains information about the results obtained from API executions. Last, we have a reasoner, whose role is to process the outputs from the API calls. It determines whether the obtained information is sufficient to produce the final response, or if additional data retrieval is required.

The planner undertakes a series of steps each time a decision is required regarding which tool to employ and what query to send to it. Based on the present state, the planner provides a range of potential subsequent actions. The potential action space may be so large that it makes the search space intractable. To address this issue, the planner refers to the transition graph to eliminate irrelevant actions. The planner also excludes the actions that have already been taken before and are stored in the working memory.

Next, the planner collects a set of relevant in-context examples that are assembled from the decisions previously made by humans during the user study. With these examples and the working memory that holds data collected from past tool interactions, the planner formulates a prompt. The prompt is then sent to the LLM, which returns a structured answer, determining the next tool to be activated and the query to be dispatched to it. This design allows the planner to be invoked multiple times throughout the process, thereby facilitating dynamic decision-making that gradually leads to answering the input query.

We employ a reasoner to analyze the output of the tool execution, extract the useful information and decide into which category the tool output falls: informative, uninformative, or final answer. Our method utilizes the LLM with appropriate prompting and in-context examples to perform the reasoning. If the reasoner concludes that it’s ready to provide an answer, it will output the final response, thus concluding the task. If it determines that the tool output is uninformative, it will revert back to the planner to select another action based on the current state. If it finds the tool output to be useful, it will modify the state and transfer control back to the planner to make a new decision at the new state.

AVIS employs a dynamic decision-making strategy to respond to visual information-seeking queries.

Results

We evaluate AVIS on Infoseek and OK-VQA datasets. As shown below, even robust visual-language models, such as OFA and PaLI, fail to yield high accuracy when fine-tuned on Infoseek. Our approach (AVIS), without fine-tuning, achieves 50.7% accuracy on the unseen entity split of this dataset.

AVIS visual question answering results on Infoseek dataset. AVIS achieves higher accuracy in comparison to previous baselines based on PaLI, PaLM and OFA.

Our results on the OK-VQA dataset are shown below. AVIS with few-shot in-context examples achieves an accuracy of 60.2%, higher than most of the previous works. AVIS achieves lower but comparable accuracy in comparison to the PALI model fine-tuned on OK-VQA. This difference, compared to Infoseek where AVIS outperforms fine-tuned PALI, is due to the fact that most question-answer examples in OK-VQA rely on common sense knowledge rather than on fine-grained knowledge. Therefore, PaLI is able to encode such generic knowledge in the model parameters and doesn’t require external knowledge.

Visual question answering results on A-OKVQA. AVIS achieves higher accuracy in comparison to previous works that use few-shot or zero-shot learning, including Flamingo, PaLI and ViperGPT. AVIS also achieves higher accuracy than most of the previous works that are fine-tuned on OK-VQA dataset, including REVEAL, ReVIVE, KAT and KRISP, and achieves results that are close to the fine-tuned PaLI model.

Conclusion

We present a novel approach that equips LLMs with the ability to use a variety of tools for answering knowledge-intensive visual questions. Our methodology, anchored in human decision-making data collected from a user study, employs a structured framework that uses an LLM-powered planner to dynamically decide on tool selection and query formation. An LLM-powered reasoner is tasked with processing and extracting key information from the output of the selected tool. Our method iteratively employs the planner and reasoner to leverage different tools until all necessary information required to answer the visual question is amassed.

Acknowledgements

This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid and Alireza Fathi.

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Uncategorized

Revealed: Key Components of Employee Engagement

Loyal employees are the core of every successful company. Depending on their role, they ensure the processes are running smoothly and responsibly on a daily

Uncategorized

DynIBaR: Space-time view synthesis from videos of dynamic scenes

Posted by Zhengqi Li and Noah Snavely, Research Scientists, Google Research

A mobile phone’s camera is a powerful tool for capturing everyday moments. However, capturing a dynamic scene using a single camera is fundamentally limited. For instance, if we wanted to adjust the camera motion or timing of a recorded video (e.g., to freeze time while sweeping the camera around to highlight a dramatic moment), we would typically need an expensive Hollywood setup with a synchronized camera rig. Would it be possible to achieve similar effects solely from a video captured using a mobile phone’s camera, without a Hollywood budget?

In “DynIBaR: Neural Dynamic Image-Based Rendering”, a best paper honorable mention at CVPR 2023, we describe a new method that generates photorealistic free-viewpoint renderings from a single video of a complex, dynamic scene. Neural Dynamic Image-Based Rendering (DynIBaR) can be used to generate a range of video effects, such as “bullet time” effects (where time is paused and the camera is moved at a normal speed around a scene), video stabilization, depth of field, and slow motion, from a single video taken with a phone’s camera. We demonstrate that DynIBaR significantly advances video rendering of complex moving scenes, opening the door to new kinds of video editing applications. We have also released the code on the DynIBaR project page, so you can try it out yourself.

Given an in-the-wild video of a complex, dynamic scene, DynIBaR can freeze time while allowing the camera to continue to move freely through the scene.

Background

The last few years have seen tremendous progress in computer vision techniques that use neural radiance fields (NeRFs) to reconstruct and render static (non-moving) 3D scenes. However, most of the videos people capture with their mobile devices depict moving objects, such as people, pets, and cars. These moving scenes lead to a much more challenging 4D (3D + time) scene reconstruction problem that cannot be solved using standard view synthesis methods.

Standard view synthesis methods output blurry, inaccurate renderings when applied to videos of dynamic scenes.

Other recent methods tackle view synthesis for dynamic scenes using space-time neural radiance fields (i.e., Dynamic NeRFs), but such approaches still exhibit inherent limitations that prevent their application to casually captured, in-the-wild videos. In particular, they struggle to render high-quality novel views from videos featuring long time duration, uncontrolled camera paths and complex object motion.

The key pitfall is that they store a complicated, moving scene in a single data structure. In particular, they encode scenes in the weights of a multilayer perceptron (MLP) neural network. MLPs can approximate any function — in this case, a function that maps a 4D space-time point (x, y, z, t) to an RGB color and density that we can use in rendering images of a scene. However, the capacity of this MLP (defined by the number of parameters in its neural network) must increase according to the video length and scene complexity, and thus, training such models on in-the-wild videos can be computationally intractable. As a result, we get blurry, inaccurate renderings like those produced by DVS and NSFF (shown below). DynIBaR avoids creating such large scene models by adopting a different rendering paradigm.

DynIBaR (bottom row) significantly improves rendering quality compared to prior dynamic view synthesis methods (top row) for videos of complex dynamic scenes. Prior methods produce blurry renderings because they need to store the entire moving scene in an MLP data structure.

Image-based rendering (IBR)

A key insight behind DynIBaR is that we don’t actually need to store all of the scene contents in a video in a giant MLP. Instead, we directly use pixel data from nearby input video frames to render new views. DynIBaR builds on an image-based rendering (IBR) method called IBRNet that was designed for view synthesis for static scenes. IBR methods recognize that a new target view of a scene should be very similar to nearby source images, and therefore synthesize the target by dynamically selecting and warping pixels from the nearby source frames, rather than reconstructing the whole scene in advance. IBRNet, in particular, learns to blend nearby images together to recreate new views of a scene within a volumetric rendering framework.

DynIBaR: Extending IBR to complex, dynamic videos

To extend IBR to dynamic scenes, we need to take scene motion into account during rendering. Therefore, as part of reconstructing an input video, we solve for the motion of every 3D point, where we represent scene motion using a motion trajectory field encoded by an MLP. Unlike prior dynamic NeRF methods that store the entire scene appearance and geometry in an MLP, we only store motion, a signal that is more smooth and sparse, and use the input video frames to determine everything else needed to render new views.

We optimize DynIBaR for a given video by taking each input video frame, rendering rays to form a 2D image using volume rendering (as in NeRF), and comparing that rendered image to the input frame. That is, our optimized representation should be able to perfectly reconstruct the input video.

We illustrate how DynIBaR renders images of dynamic scenes. For simplicity, we show a 2D world, as seen from above. (a) A set of input source views (triangular camera frusta) observe a cube moving through the scene (animated square). Each camera is labeled with its timestamp (t-2, t-1, etc). (b) To render a view from camera at time t, DynIBaR shoots a virtual ray through each pixel (blue line), and computes colors and opacities for sample points along that ray. To compute those properties, DyniBaR projects those samples into other views via multi-view geometry, but first, we must compensate for the estimated motion of each point (dashed red line). (c) Using this estimated motion, DynIBaR moves each point in 3D to the relevant time before projecting it into the corresponding source camera, to sample colors for use in rendering. DynIBaR optimizes the motion of each scene point as part of learning how to synthesize new views of the scene.

However, reconstructing and deriving new views for a complex, moving scene is a highly ill-posed problem, since there are many solutions that can explain the input video — for instance, it might create disconnected 3D representations for each time step. Therefore, optimizing DynIBaR to reconstruct the input video alone is insufficient. To obtain high-quality results, we also introduce several other techniques, including a method called cross-time rendering. Cross-time rendering refers to the use of the state of our 4D representation at one time instant to render images from a different time instant, which encourages the 4D representation to be coherent over time. To further improve rendering fidelity, we automatically factorize the scene into two components, a static one and a dynamic one, modeled by time-invariant and time-varying scene representations respectively.

Creating video effects

DynIBaR enables various video effects. We show several examples below.

Video stabilization

We use a shaky, handheld input video to compare DynIBaR’s video stabilization performance to existing 2D video stabilization and dynamic NeRF methods, including FuSta, DIFRINT, HyperNeRF, and NSFF. We demonstrate that DynIBaR produces smoother outputs with higher rendering fidelity and fewer artifacts (e.g., flickering or blurry results). In particular, FuSta yields residual camera shake, DIFRINT produces flicker around object boundaries, and HyperNeRF and NSFF produce blurry results.

Simultaneous view synthesis and slow motion

DynIBaR can perform view synthesis in both space and time simultaneously, producing smooth 3D cinematic effects. Below, we demonstrate that DynIBaR can take video inputs and produce smooth 5X slow-motion videos rendered using novel camera paths.

Video bokeh

DynIBaR can also generate high-quality video bokeh by synthesizing videos with dynamically changing depth of field. Given an all-in-focus input video, DynIBar can generate high-quality output videos with varying out-of-focus regions that call attention to moving (e.g., the running person and dog) and static content (e.g., trees and buildings) in the scene.

Conclusion

DynIBaR is a leap forward in our ability to render complex moving scenes from new camera paths. While it currently involves per-video optimization, we envision faster versions that can be deployed on in-the-wild videos to enable new kinds of effects for consumer video editing using mobile devices.

Acknowledgements

DynIBaR is the result of a collaboration between researchers at Google Research and Cornell University. The key contributors to the work presented in this post include Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely.

Do You Want To Boost Your Business?

drop us a line and keep in touch