Local Vision Language Model Lookup

Perform visual question answering on objects in an ROI, or on a ROI in the frame using local models.

Overview

The Local Vision Language Model Lookup (VLM) node performs visual question answering on objects within a Region of Interest (ROI), or on a ROI within the frame. It uses local inference with models like InternVL2-1B (small) and InternVL2-2B (large) to analyze and answer questions about image content.

Inputs & Outputs

  • Inputs: 1, Media Format: Raw Video
  • Outputs: 1, Media Format: Raw Video
  • Output Metadata: nodes.node_id, recognized_objs, recognized_obj_ids, recognized_obj_count, recognized_obj_delta, label_changed_obj_delta

Properties

PropertyDescriptionTypeDefaultRequired
roi_labelsRegions of interest labelshiddennullYes
roisRegions of interest. Conditional on roi_labels.polygonnullYes
processing_modeProcessing mode. Options: ROIs, at Interval (rois_interval), ROIs, upon Trigger (rois_trigger), Objects in an ROI (objects).enumrois_intervalYes
intervalCollect objects or ROIs for lookup atleast this many seconds apart.float1No
triggerQueue ROI for lookup when this condition evaluates to true. Conditional on processing_mode being rois_trigger.trigger-conditionnullNo
objects_to_processObject types to process (e.g. car, person, car.red). Conditional on processing_mode being objects.model-labelnullNo
obj_lookup_modeObject lookup mode. Options: Until result (until_result) - Lookup on interval or size change until result obtained or max attempts exhausted, Continuously (continuous) - Periodically at an interval. Conditional on processing_mode being objects.enumuntil_resultNo
min_obj_size_pixelsMin. width and height of an object. Conditional on processing_mode being objects.number64No
obj_lookup_size_change_thresholdIf the size of an object changes by more than this threshold, perform a lookup. Min: 0.01, Max: 2.0, Step: 0.2. Conditional on processing_mode being objects.slider0.1No
max_lookups_per_objMaximum number of attempts to perform a lookup for an object in the Until result lookup mode. Conditional on processing_mode being objects.number5No
model_typeModel type. Options: Small (InternVL2_5-1B), Large (InternVL2_5-2B).enuminternvl2_5_1bNo
promptProvide a prompt, additional instructions or context for the model. Required if no attributes are provided.textnullNo
description_modeGenerate a description of the scene or objects in the images. This description will be used for search and summarization. Options: none, when_attributes_present, alwaysenumnoneNo
attributesProvide attribute names and for each attribute, a question or description with optional answer choices to extract the attribute value - as a JSON dictionary. If present, description is a special attribute which override the description mode. See examples below.jsonnullNo
detail_levelImage resolution. Options: Low (low), High (high).enumlowNo
max_tokensMaximum number of tokens to return for each request.number500No
display_roiDisplay ROI on video?booltrueNo
display_objinfoDisplay results on video? Options: Disabled (disabled), Bottom left (bottom_left), Bottom right (bottom_right), Top left (top_left), Top right (top_right).enumbottom_leftNo
debugLog debugging information?boolfalseNo

Model Types

InternVL2_5-1B (Small)

  • Faster inference
  • Good for basic scene description and object detection
  • Lower memory requirements
  • Default model: OpenGVLab/InternVL2_5-1B
  • Model size: 2.0 GB download
  • GPU Memory: Requires ~2.5GB of GPU memory

InternVL2_5-2B (Large)

  • More detailed and nuanced responses
  • Better understanding of complex scenes
  • Higher memory requirements
  • Default model: OpenGVLab/InternVL2_5-2B
  • Model size: 4.5 GB download
  • GPU Memory: Requires ~5.0GB of GPU memory

Metadata

Metadata PropertyDescription
nodes.[node_id].rois.[roi_id].label_changed_deltaIndicates if there has been a change in the VLM result for the ROI.
nodes.[node_id].rois.[roi_id].label_availableIndicates if a VLM result is available for the ROI.
nodes.[node_id].rois.[roi_id].labelVLM result contents for the ROI.
nodes.[node_id].rois.[roi_id].attributes.[attribute_name]Attribute values (string or null) for the ROI, if attributes property is specified. Will exclude description attribute.
nodes.[node_id].recognized_obj_countThe count of objects with VLM results.
nodes.[node_id].recognized_obj_deltaThe change in the count of objects with VLM results.
nodes.[node_id].label_changed_obj_deltaThe change in the count of objects with changed VLM results.

Example JSON

{
    "nodes": {
        "local_vlm1": {
            "type": "local_vlm",
            "rois": {
                "roi1": {
                    "label_changed_delta": true,
                    "label_available": true,
                    "label": "The image shows two people walking on a sidewalk",
                    "attributes": {
                        "crowded": "crowded"
                    }
                }
            },
            "recognized_obj_ids": ["2775161862"],
            "recognized_obj_count": 1,
            "recognized_obj_delta": 1,
            "label_changed_obj_delta": 1,
            "unrecognized_obj_count": 0,
            "unrecognized_obj_delta": 0,
            "objects_of_interest_keys": ["recognized_obj_ids"]
        }
    }
}