laitimes

Automate complex computer vision tasks based on text prompts?

author:3D Vision Workshop
Automate complex computer vision tasks based on text prompts?

Written by PCIPG-HAY | Source: 3DCV

In the background of the official account "3DCV", reply to "Original Paper" to get the PDF of the paper.

Add WeChat: dddvisiona, Note: 3D point cloud, pull you into the group. Industry subgroups are attached at the end of the article.

This paper proposes VISPROG, a neural symbolic approach for solving complex combinatorial vision tasks given natural language instructions. VISPROG does not require training for any specific task. Instead, it leverages the contextual learning capabilities of large language models to generate Python-like modular programs, which are then executed to obtain solutions and comprehensive and explainable rationales. Each line of the generated program can call one of several ready-made computer vision models, image processing subroutines, or Python functions to produce intermediate output that can be used by subsequent parts of the program. WE DEMONSTRATE THE FLEXIBILITY OF VISPROG ON 4 DIFFERENT TASKS - COMBINED VISUAL QUESTION ANSWERING, ZERO-SAMPLE REASONING OF IMAGE PAIRS, FACTUAL KNOWLEDGE OBJECT MARKUP, AND LANGUAGE-GUIDED IMAGE EDITING. WE BELIEVE NEURAL SYMBOLIC APPROACHES LIKE VISPROG ARE AN EXCITING AVENUE TO EASILY AND EFFICIENTLY EXTEND THE SCOPE OF AI SYSTEMS TO SERVE THE LONG TAIL OF COMPLEX TASKS THAT PEOPLE MAY WISH TO PERFORM.

Purpose: The quest for general-purpose AI systems has led to the development of powerful end-to-end trainable models, many of which aspire to provide simple natural language interfaces for AI to enable users to interact with the models. Existing methods: The primary method for building these systems is large-scale unsupervised pre-training, followed by supervised multi-task training. However, this approach requires a curated data set for each task, which makes it challenging to scale to the complex tasks we ultimately want these systems to perform. This thesis work: In this work, explore the use of large language models to solve visual problems for complex tasks by breaking down these tasks described by natural language into simpler steps that can be handled by specialized end-to-end training models or other programs.

Automate complex computer vision tasks based on text prompts?

FIGURE 1.VISPROG is a modular and interpretable neural symbol system for combinatorial visual reasoning (frame-plot on the left, four tasks achievable by this system on the right) VISPROG, which inputs visual data (a single image or a group of images) along with natural language instructions, generates a series of steps, and, if you prefer, a visualizer, which is then executed to produce the desired output. Each line in the visualizer invokes one of the various modules currently supported by the system. Modules can be off-the-shelf computer vision models, language models, image processing subroutines in OpenCV, or arithmetic and logical operators. The module consumes input generated by executing the preceding lines of code and outputs intermediate results that can be consumed downstream. IN THE EXAMPLE ABOVE, THE VISIPER GENERATED BY VISPROG CALLS FACE DETECTOR, GPT-3 AS A KNOWLEDGE RETRIEVAL SYSTEM, AND CLIP AS AN OPEN VOCABULARY IMAGE CLASSIFIER TO GENERATE THE DESIRED OUTPUT (SEE FIGURE. 1)。 VISPROG improves the way programs were previously generated and executed for vision apps. For visual question answering (VQA) tasks, neural module networks (NMNs) [2,9,10,12] consist of specialized, differentiable neural modules forming a problem-specific, end-to-end trainable network. These methods either use brittle, ready-made semantic parsers to deterministically compute the layout of the module, or learn the layout generator through REINFORCE [30] with weak answer supervision. IN CONTRAST, VISPROG USES A POWERFUL LANGUAGE MODEL (GPT-3) WITH A SMALL NUMBER OF CONTEXTUAL EXAMPLES TO CREATE COMPLEX PROGRAMS1 WITHOUT ANY TRAINING. THE PROGRAM CREATED BY VISPROG ALSO USES A HIGHER LEVEL OF ABSTRACTION THAN NMN AND CALLS STATE-OF-THE-ART TRAINED MODELS AND NON-NEURAL PYTHON SUBROUTINES (FIGURE 2). THESE BENEFITS MAKE VISPROG AN EASY-TO-USE, HIGH-PERFORMANCE, AND MODULAR NEURAL SYMBOLOGY SYSTEM. VISPROG is also highly interpretable. FIRST, VISPROG GENERATES EASY-TO-UNDERSTAND PROGRAMS WHOSE USERS CAN VERIFY THEIR LOGICAL CORRECTNESS. SECOND, BY BREAKING DOWN PREDICTIONS INTO SIMPLE STEPS, VISPROG ALLOWS USERS TO EXAMINE THE OUTPUT OF INTERMEDIATE STEPS TO DIAGNOSE ERRORS AND INTERVENE IN THE INFERENCE PROCESS IF NEEDED. In summary, executors with intermediate step results (e.g. text, bounding boxes, segmentation masks, generated images, etc.) are chained together to describe the flow of information as a visual rationale for prediction. To demonstrate flexibility, we use VISPROG to perform 4 different tasks that share some common skills (such as image parsing) while also requiring a certain level of professional reasoning and visual manipulation. These tasks are - we emphasize that neither the language model nor any module is fine-tuned in any way. ADAPTING VISPROG TO ANY TASK IS AS SIMPLE AS PROVIDING SOME CONTEXTUAL EXAMPLES CONSISTING OF NATURAL LANGUAGE INSTRUCTIONS AND CORRESPONDING PROGRAMS. While easy to use, VISPROG improves the basic VQA model by 2.7 points on the combined VQA task, achieves up to 62.4% accuracy in zero-samples on NLVR, eliminates the need to train on image pairs, and achieves delightful qualitative and quantitative results in knowledge labeling and image editing tasks. Contribution points in this paper: (i) VISPROG - a system for generating vision programs from natural language instructions using the contextual learning capabilities of language models for combinatorial vision tasks (section 3); (ii) demonstrate VISPROG's flexibility in complex visual tasks, such as factual knowledge object markup and language-guided image editing, which are not achievable or have limited success in a single end-to-end model; (iii) provide visual principles for these tasks and demonstrate their utility in error analysis and user-driven instruction tuning to significantly improve the performance of VISPROG.

Neural symbolic approaches gain new momentum due to the incredible understanding, generative, and contextual learning capabilities of large language models (LLMs). Now briefly explain the previous visual task program generation and execution methods, the recent work of using LLMs for vision, and the progress of language task reasoning methods. Work related to the program generation and execution of vision tasks. Neural Module Networks (NMNs) pioneered modular and combinatorial approaches to visual question answering (VQA) tasks. NMN combines neural modules into end-to-end differentiable networks. WHILE EARLIER ATTEMPTS USE OFF-THE-SHELF PARSERS, MORE RECENT METHODS USE REINFORCE AND WEAK ANSWER SUPERVISED JOINT LEARNING LAYOUTS WITH NEURAL MODULES TO GENERATE MODELS. ALTHOUGH VISPROG IS SIMILAR IN SPIRIT TO NMN, IT HAS SEVERAL ADVANTAGES OVER NMN. FIRST, VISPROG GENERATES HIGH-LEVEL PROGRAMS THAT CALL STATE-OF-THE-ART TRAINED NEURAL MODELS AND OTHER PYTHON FUNCTIONS IN INTERMEDIATE STEPS, RATHER THAN BUILDING END-TO-END NEURAL NETWORKS. This makes it easy to merge symbolic, non-differentiable modules. Second, VISPROG takes advantage of the contextual learning capabilities of LLMs to generate program-like instructions and their corresponding programs by using natural language instructions (or visual problems or statements to be verified) as well as some examples to prompt LLM (GPT-3), thus eliminating the need to train a dedicated program generator for each task. Work on LLMs for vision tasks. LLMs and situational learning have been applied to visual tasks. PICa uses LLM for knowledge-based VQA tasks. PICa represents visual information in an image as text through titles, objects, and attributes, and provides that text representation to GPT-3 along with questions and contextual examples to directly generate answers. The Socratic model (SM), which consists of pre-trained models composed of different modalities, such as language (BERT, GPT-2), visual language (CLIP), and audio language (mSLAM), performs many zero-sample tasks, including image captioning, video-to-text retrieval, and robot planning. In SM, however, the composition of each task is predetermined and fixed. IN CONTRAST, VISPROG DETERMINES HOW TO BUILD A MODEL FOR EACH INSTANCE BY GENERATING A PROGRAM FROM INSTRUCTIONS, QUESTIONS, OR STATEMENTS. WE DEMONSTRATE VISPROG'S ABILITY TO HANDLE COMPLEX INSTRUCTIONS INVOLVING DIFFERENT FUNCTIONS (20 MODULES) AND DIFFERENT INPUTS (TEXT, IMAGE, AND IMAGE PAIRS), INTERMEDIATE (TEXT, IMAGE, BOUNDING BOX, SEGMENTATION MASK), AND OUTPUT MODES (TEXT AND IMAGE). Similar to VISPROG, ProgPrompt is a parallel work that demonstrates LMM's ability to generate python-like robot action plans from natural language instructions. WHILE PROGPROMPT MODULES, SUCH AS "FIND" OR "GRAB", TAKE A STRING (USUALLY AN OBJECT NAME) AS INPUT, THE VISPROG PROGRAM IS MORE GENERIC. At each step of a VISPROG program, the module can accept multiple parameters generated by previous steps, including strings, numbers, arithmetic and logical expressions, or arbitrary Python objects (such as list() or dict() instances containing bounding boxes or segmentation masks.

Over the past few years, the AI community has created high-performance, task-specific models for many visual and language tasks such as object detection, segmentation, VQA, subtitling, and text-to-image generation. While each of these models solves a well-defined but narrowly defined problem, the tasks we typically want to solve in the real world tend to be broader and loosely defined. To solve such practical tasks, one would have to collect a new task-specific dataset, which can be expensive, or orchestrate a program that calls multiple neural models, image processing subroutines (such as image resizing, cropping, filtering, and color space conversion), and other computations (such as database lookups, or arithmetic and logical operations). Manually creating these programs for the infinitely long tail complex tasks we encounter every day not only requires programming expertise, but is slow, labor-intensive, and ultimately not enough space to cover all tasks. If you can describe a task in natural language and let the AI system generate and execute the task without any training, it seems that the visual program will solve the problem?

Large language models such as GPT-3 have demonstrated superior ability to generalize to new samples after a small number of input and output demonstrations in context. For example, two examples of English-to-French translations and a new English phrase are used to suggest that GPT-3 produced the French translation "bonsoir". Note that we do not have to fine-tune GPT-3 to perform the translation task of the third phrase. VISPROG uses GPT-3's contextual learning capabilities to output visual programs of natural language instructions. good morning -> bonjourgood day -> bonne journ ́eegood evening -> Similar to the English and French translation pairs in the example above, we prompt GPT-3 with instruction pairs and the required advanced program. Figure 3 shows the hint for such an image editing task. Programs in context examples are written manually and can often be built without accompanying images. Each line or program step of a VISPROG program consists of the module name, the module's input parameter names and their values, and output variable names. VISPROG programs typically use output variables from past steps as inputs to future steps. We use descriptive module names (e.g. "Select", "ColorPop", "Replace"), parameter names (e.g. "image", "object", "query") and variable names (e.g. "IMAGE", "OBJ") to let GPT-3 understand the input and output types and functions of each module. Output variables can be used to store arbitrary data types during execution. For example, "OBJ" is a list of objects in an image that contains masks, bounding boxes, and text (such as category names) associated with each object.

Automate complex computer vision tasks based on text prompts?

FIGURE 3.PROGRAM GENERATION IN VISPROG. These contextual examples are fed into GPT-3 along with new natural language instructions. Without observing the image or its contents, VISPROG generates a program (bottom of Figure 3) that can be executed on the input image to perform the described task.

VISPROG 目前支持 20 个模块(图 2),用于实现图像理解、图像处理(包括生成)、知识检索以及执行算术和逻辑运算等功能。在 VISPROG 中,每个模块都实现为一个 Python 类(代码 1),该类具有以下方法:(i) 解析该行以提取输入参数名称和值以及输出变量名称;(ii) 执行可能涉及经过训练的神经模型的必要计算,并使用输出变量名称和值更新程序状态;(iii) 使用 html 直观地总结该步骤的计算(稍后用于创建视觉原理)。向 VISPROG 添加新模块只需实现并注册一个模块类,而使用该模块的程序的执行则由 VISPROG 解释器自动处理,这将在下面介绍。      
Automate complex computer vision tasks based on text prompts?

Figure 2. MODULES CURRENTLY SUPPORTED BY VISPROG. The red module uses neural models (OWL-ViT, DSFD, MaskForme, CLIP, ViLT, and Stable Diffusion). The blue module uses image processing and other Python subroutines. These modules are called in programs generated by natural language instructions. Adding a new module to extend the functionality of VISPROG is simple (Code 1).

class VisProgModule():     def __init__(self):        # load a trained model; move to GPU     def html(self,inputs: List,output: Any):        # return an html string visualizing step I/O             def parse(self,step: str):         # parse step and return list of input values/variable names         # and output variable name              def execute(self,step: str,state: Dict):        inputs,input_var_names,output_var_name = self.parse(step)        # get values of input variables from state         for var_name in input_var_names:             inputs.append(state[var_name])                     # perform computation using the loaded model         output = some_computation(inputs)                 # update state         state[output_var_name] = output                # visual summary of the step computation         step_html = self.html(inputs,output)         return output, step_html      

The execution of the program is handled by the interpreter. The interpreter initializes the program state (a dictionary that maps variable names to their values) with input and executes the program line by line, calling the correct module with the input specified in that line. After each step, the program status is updated with the name and value of the step output.

In addition to performing the necessary calculations, each module class implements a method called html() to visually summarize the inputs and outputs of modules in HTML fragments. The interpreter simply stitches together an HTML summary of all program steps into a visual principle (Figure 4), which can be used to analyze the logical correctness of the program and examine the intermediate output of the program's internal structure. Visual principles also enable users to understand the cause of failure and adjust natural language instructions as much as possible to improve performance.

Automate complex computer vision tasks based on text prompts?

Figure 4. The visual principle of VISPROG generation. These fundamentals visually summarize the inputs and outputs of each computational step in the program generated during image editing (top) and NLVR task (bottom) inference.

VISPROG provides a flexible framework that can be applied to a variety of complex vision tasks. WE EVALUATE VISPROG ON 4 TASKS THAT REQUIRE CAPABILITIES SUCH AS SPATIAL REASONING, MULTI-IMAGE INFERENCE, KNOWLEDGE RETRIEVAL, AND IMAGE GENERATION AND MANIPULATION. Figure 5 summarizes the inputs, outputs, and modules used for these tasks. We now describe these tasks, their evaluation settings, and the selection of contextual examples.

Automate complex computer vision tasks based on text prompts?

FIGURE 5.WE EVALUATE VISPROG ON A DIFFERENT SET OF TASKS. These tasks cover a variety of inputs and outputs, and reuse modules (Loc, FaceDet, VQA) whenever possible.

VISPROG is composable, which makes it suitable for a combined, multi-step visual question answering task: GQA. Modules for GQA tasks include modules for open vocabulary localization, VQA modules, functions for cropping image regions given bounding box coordinates or spatial prepositions (e.g. top, left, and so on), modules for counting boxes, and modules for counting the number of boxes. Evaluate Python expressions. For example, consider the following question: "Is the pickup truck to the left or right of the person wearing the helmet?" ”。 VISPROG first locates the "people in helmets", crops the area to the left (or right) of these people, checks for "pick-ups" on that side, returns "left" if there is, otherwise returns "right". VISPROG USES A VILT-BASED QUESTION ANSWERING MODULE, BUT INSTEAD OF SIMPLY PASSING COMPLEX RAW QUESTIONS TO VILT, VISPROG CALLS IT TO PERFORM SIMPLER TASKS, SUCH AS IDENTIFYING CONTENT IN PARTS OF AN IMAGE. THEREFORE, THE GQA VISPROG WE GENERATE IS NOT ONLY EASIER TO INTERPRET THAN VILT, BUT ALSO MORE ACCURATE (TABLE 1). Alternatively, we could eliminate the need for QA models like ViLT altogether and use other systems like CLIP and Object Detector, but we'll leave that for future research. Assess. To limit the amount of money spent using GPT-3 to generate a program, we created a subset of GQA for evaluation. Each question in GQA is labeled with a problem type. To evaluate different sets of question types (∼ 100 verbose types), we randomly sample up to k samples per question type from balanced val (k = 5) and testdev (k = 20) sets. Prompt. WE MANUALLY ANNOTATE 31 RANDOM QUESTIONS IN THE BALANCED TRAINING SET USING THE REQUIRED VISPROG PROGRAM. It's easy to comment out a question programmatically, and you need to write down the chain of reasoning needed to answer that particular question. We provide GPT-3 with a smaller subset of contextual examples, sampled randomly from that list to reduce the cost of answering each GQA question.

Automate complex computer vision tasks based on text prompts?

Table 1.GQA Test Development Results. We report the performance of a subset of the original GQA test development set

The VQA model is trained to answer questions about a single image. In practice, one might need a system to answer questions about image collections. For example, users can ask the system to parse their vacation album and answer the following question: "Which landmark did we visit the day after we saw the Eiffel Tower?" ”。 WE DEMONSTRATE THE ABILITY OF VISPROG TO SOLVE TASKS INVOLVING MULTIPLE IMAGES USING A SINGLE-IMAGE VQA SYSTEM WITHOUT TRAINING ON MULTI-IMAGE EXAMPLES, RATHER THAN ACQUIRING EXPENSIVE DATASETS AND TRAINING MULTI-IMAGE MODELS. We demonstrated this capability in the NLVRV2 benchmark, which involves validating statements about pairs of images. Typically, solving NLVRV2 challenges requires training a custom schema with image pairs as inputs to the NLVRV2 training set. INSTEAD, VISPROG DOES THIS BY BREAKING DOWN COMPLEX STATEMENTS INTO SIMPLE QUESTIONS ABOUT A SINGLE IMAGE AND PYTHON EXPRESSIONS INVOLVING ARITHMETIC AND LOGICAL OPERATORS, AS WELL AS ANSWERS TO IMAGE-LEVEL QUESTIONS. The VQA model VILT-VQA is used to obtain an image-level answer and evaluate a python expression to validate the statement. Assess. We create a small validation set by taking 250 random samples from the NLVRV2 development set to guide prompt selection and test generalization on the full public test set of NLVRV2. Prompt. WE SAMPLE AND ANNOTATE THE VISPROG PROGRAM FOR 16 RANDOM STATEMENTS IN THE NLVRV2 TRAINING SET. Since some of these examples are redundant (similar program structure), we also create a curated subset of 12 examples by removing 4 redundant examples.

We often want to identify people and objects in an image whose names we don't know. For example, we may want to identify celebrities, politicians, characters from TV shows, national flags, company logos, popular cars and their manufacturers, biological species, and many more. Solving this task requires not only locating people, faces, and objects, but also finding factual knowledge in an external knowledge base to construct a set of categories to categorize, such as the names of characters in a TV show. We refer to this task simply as a fact knowledge object tag or knowledge tag. TO SOLVE THE KNOWLEDGE LABELING PROBLEM, VISPROG USES GPT-3 AS AN IMPLICIT KNOWLEDGE BASE THAT CAN BE QUERIED BY NATURAL LANGUAGE CUES, SUCH AS "LIST THE MAIN CHARACTERS FROM THE TV SHOW The Big Bang Theory, separated by commas." The CLIP image classification module can then use the generated category list, which classifies the image areas generated by the positioning and face detection modules. VISPROG'S PROGRAM BUILDER AUTOMATICALLY DETERMINES WHETHER TO USE A FACE DETECTOR OR AN OPEN VOCABULARY LOCATOR BASED ON THE CONTEXT IN NATURAL LANGUAGE INSTRUCTIONS. VISPROG also estimates the maximum size of the retrieved list of categories. For example, "Mark the logos of the top 5 German car companies" generates a list of 5 categories, while "Mark the logos of German car companies" generates a list of arbitrary lengths determined by GPT-3 with a cut-off of 20, which allows the user to easily control noise during the classification process by adjusting instructions. Assess. To evaluate VISPROG's performance on this task, we annotated 100 tagging instructions in 46 images that require external knowledge to label 253 instances of objects, including figures from popular culture, politics, sports, and art, as well as various objects (e.g. cars, flags, fruits, appliances, furniture, etc.). For each instruction, we measure positioning and labeling performance by precision (the score of the correctly predicted box) and recall (the score of the correctly predicted ground truth object). Tag metrics require that both the predicted bounding box and the associated label or class label are correct, whereas localization ignores the label. To determine the correctness of positioning, we use an IoU threshold of 0.5. We summarize positioning and labeling performance by F1 score, the harmonic average of average precision and recall between instructions. Prompt. We created 14 contextual examples for this task. Note that the illustration of these examples is hallucinatory, i.e. no images are associated with these examples.

Text-to-image generation has made impressive advances over the past few years with models such as DALL-E, Parti, and Stable Diffusion. However, these models still can't handle prompts such as "hide Daniel Craig's face with :p" (de-identification or privacy protection) or "create Daniel Craig's pop colors and blur the background" (object highlighting), although implementing these programmatically using a combination of face detection, segmentation, and image processing modules is relatively straightforward. To implement complex edits, such as "replace Barack Obama with Barack Obama in sunglasses" (object substitution), you first need to identify the object of interest, generate a mask of the object to replace, and then call the image repair model (we use stable diffusion) with the original image, specifying the mask of the pixel to be replaced and a description of the new pixel to be generated at that location. When VISPROG is equipped with the necessary modules and example programs, very complex instructions can be easily processed. Assess. To test VISPROG's image editing instructions for derecognition, object highlighting, and object substitution, we collected 107 instructions from 65 images. We manually score the correctness of predictions and accuracy of reports. Note that as long as the resulting image is semantically correct, we will not penalize visual artifacts of subtasks replacing subtasks with stably diffused objects. Prompt. Similar to knowledge tagging, we created 10 contextual examples with no associated images for this task.

Our experiments evaluate the impact of the number of prompts on GQA and NLVR performance (Section 5.1), compare the generalization of VISPROG on four tasks for various prompt strategies (Section 5.2), analyze the source of errors for each task (Figure 9), and investigate the utility of visual principles in diagnosing errors and improving VISPROG performance through instruction tuning (Section 5.3).

Figure 6 shows that validation performance gradually improves as the number of contextual examples used in GQA and NLVR prompts increases. Each run randomly selects a subset of annotated context examples based on a random seed. We also found that majority voting on random seeds consistently resulted in better performance than the average performance on the run. This is consistent with the findings in the literature on the idea chain reasoning literature for mathematical reasoning problems. On NLVR, the performance of VISPROG saturates with fewer prompts than GQA. We believe this is because NLVRV2 programs require fewer modules than GQA, and therefore fewer demos using them.

Automate complex computer vision tasks based on text prompts?

Figure 6. The number of contextual examples on GQA and NLVRV2 validation sets improves performance. Error bars represent 95% confidence intervals for 5 runs. Predictions from the same run are used for majority voting. (Section 5.1)

GQA。 In Table 1, we evaluate different prompt strategies on the GQA testdev set. For the maximum prompt size evaluated on the validation set (examples in 24 contexts), we compare a random policy consisting of the best prompt selected by VISPROG in 5 runs on the validation set (each run randomly sampled from 31 annotated examples in context) and the majority voting policy that makes the maximum consensus prediction for each issue in 5 runs. While the "random" tip was only slightly better than VILT-VQA, the vote brought a significant gain of 2.7 percentage points. This is because voting is done in multiple runs, each with a different set of context examples, effectively increasing the total number of context examples seen for each prediction. We also evaluated a manually curated prompt with 20 examples, 16 from 31 annotated examples, and 4 additional hallucination examples designed to better cover the failure cases observed in the validation set. A well-curated prompt performs as well as a voting strategy, while using 5x less computation, highlighting the promise of prompt engineering. NLVR。 Table 2 shows the performance of VISPROG on the NLVRV2 test set and compares the effect of random, voting, and curated prompt strategies with GQA. While VISPROG performs NLVR tasks with zero samples without training on image pairs, we report VILT-NLVR, a VILT model fine-tuned on NLVRV2, as a performance cap. Although a few points behind the upper limit, VISPROG only uses a single-image VQA model for image understanding and LLM for inference, showing strong zero-sample performance. Note that VISPROG uses VILT-VQA as its VQA module, which is trained on VQAV2 single-image question answering tasks and not on NLVRV2.

Automate complex computer vision tasks based on text prompts?

Table 2. NLVRV2 test results. VISPROG performs NLVR zero-sample, i.e. there is no need to train any modules on the image pair. VILT-NLVR is a VILT model fine-tuned on NLVRV2 that is used as an upper limit. Knowledge Tagging。 Table 3 shows the localization and tagging performance of knowledge tagging tasks. All instructions for this task require not only open vocabulary localization, but also query the knowledge base for categories to mark localized objects. This makes this an impossible task with object detectors alone. Using the original instruction, VISPROG achieved an impressive F1 score of 63.7% for markup, which involves proper localization and naming of objects, achieving an 80.6% F1 score for localization alone. THE VISUAL PRINCIPLE IN VISPROG ALLOWS FURTHER PERFORMANCE TO BE IMPROVED BY MODIFYING INSTRUCTIONS.

Automate complex computer vision tasks based on text prompts?

Table 3. Knowledge tagging results. The table shows the performance of the original instructions and the performance of modified instructions created after examining the visual rationale to understand the source of the instance-specific error. Image Editing。 Table 4 shows the performance of language-guided image editing tasks. Figure 7 shows the wide range of operations possible with the current set of modules in VISPROG, including facial manipulation, highlighting one or more objects in an image with stylistic effects such as color pop-ups and background blurs, and changing the scene context by replacing key elements in a scene such as the desert.

Automate complex computer vision tasks based on text prompts?

Table 4. Image editing results. We manually evaluate the semantic correctness of each prediction.

Automate complex computer vision tasks based on text prompts?

Figure 7. Qualitative results for image editing (top) and knowledge tagging tasks (bottom).

Error analysis.

VISPROG's visualization principle enables a thorough analysis of failure modes. In Figure 9, we examine the rationale for about 100 samples per task to break down the source of error. THIS TYPE OF ANALYSIS PROVIDES A CLEAR PATH TO IMPROVING VISPROG'S PERFORMANCE ON A VARIETY OF TASKS. For example, because incorrect programs are the primary source of GQA errors, affecting 16% of samples, the performance of GQA can be improved by providing more contextual examples similar to failure issues. Performance can also be improved by upgrading the model used to implement a high-error module to a higher-performance module. For example, replacing the VILT-VQA model with a better NLVR VQA model can improve performance by up to 24% (Figure 9). Similarly, improving the model used to implement the "List" and "Select" modules, the main sources of error for knowledge tagging and image editing tasks, can significantly reduce errors.

Automate complex computer vision tasks based on text prompts?

Figure 9. THE SOURCE OF ERROR IN VISPROG.

Instruction adjustments.

To be useful, the principle of vision must ultimately allow users to improve the performance of the system in their tasks. For knowledge tagging and image editing tasks, we investigate whether visual principles can help users modify or adjust instructions for better performance. Figure 8 shows how localization errors revealed by visual principles enable users to modify instructions to better query localization modules. Other ways to modify directives include providing better queries for knowledge retrieval or providing category names for select modules to limit searches to segmented areas that belong to that category. Tables 3 and 4 show that instruction adjustments can bring significant benefits to knowledge tagging and image editing tasks.

Automate complex computer vision tasks based on text prompts?

Figure 8. Adjust instructions using visual principles. By revealing the cause of the failure, VISPROG allows users to modify the original instructions to improve performance.

VISPROG proposes visual programming as a simple and effective way to use the reasoning capabilities of LMMs for complex visual tasks. VISPROG demonstrates powerful performance while generating highly interpretable visual principles. WE BELIEVE THAT RESEARCHING NEW APPROACHES TO INTEGRATING USER FEEDBACK TO IMPROVE THE PERFORMANCE OF NEURAL SYMBOL SYSTEMS SUCH AS VISPROG IS AN EXCITING DIRECTION FOR BUILDING THE NEXT GENERATION OF UNIVERSAL VISION SYSTEMS."

Read on