Local Vision Language Model

Perform visual question answering on objects in an ROI, or on a ROI in the frame using local models.

Overview

The Local Vision Language Model (VLM) node performs visual question answering on objects within a Region of Interest (ROI), or on a ROI within the frame. It uses local inference with models like InternVL2-1B (small) and InternVL2-2B (large) to analyze and answer questions about image content.

Inputs & Outputs

  • Inputs: 1, Media Format: Raw Video
  • Outputs: 1, Media Format: Raw Video
  • Output Metadata: nodes.node_id, recognized_objs, recognized_obj_ids, recognized_obj_count, recognized_obj_delta, label_changed_obj_delta

Properties

PropertyDescriptionTypeDefaultRequired
roi_labelsRegions of interest labelshiddennullYes
roisRegions of interest. Conditional on roi_labels.polygonnullYes
processing_modeProcessing mode. Options: ROIs, at Interval (rois_interval), ROIs, upon Trigger (rois_trigger), Objects in an ROI (objects).enumrois_intervalYes
intervalProcess objects or ROIs at least this many seconds apart.float10No
triggerQueue ROI for processing when this condition evaluates to true. Conditional on processing_mode being rois_trigger.trigger-conditionnullNo
model_typeModel type. Options: Small (InternVL2_5-1B), Large (InternVL2_5-2B).enuminternvl2_5_1bNo
promptQuestion or prompt to ask about the image content.text"Describe the image content"No
max_new_tokensMaximum number of tokens to generate in the response.number100No
display_roiDisplay ROI on video?booltrueNo
display_objinfoDisplay results on video? Options: Disabled (disabled), Bottom left (bottom_left), Bottom right (bottom_right), Top left (top_left), Top right (top_right).enumbottom_leftNo
debugLog debugging information?boolfalseNo

Model Types

InternVL2_5-1B (Small)

  • Faster inference
  • Good for basic scene description and object detection
  • Lower memory requirements
  • Default model: OpenGVLab/InternVL2_5-1B
  • Model size: 2.0 GB download
  • GPU Memory: Requires ~2.5GB of GPU memory

InternVL2_5-2B (Large)

  • More detailed and nuanced responses
  • Better understanding of complex scenes
  • Higher memory requirements
  • Default model: OpenGVLab/InternVL2_5-2B
  • Model size: 4.5 GB download
  • GPU Memory: Requires ~5.0GB of GPU memory

Metadata

Metadata PropertyDescription
nodes.[node_id].rois.[roi_id].label_changed_deltaIndicates if there has been a change in the VLM result for the ROI.
nodes.[node_id].rois.[roi_id].label_availableIndicates if a VLM result is available for the ROI.
nodes.[node_id].rois.[roi_id].labelVLM result contents for the ROI.
nodes.[node_id].recognized_obj_countThe count of objects with VLM results.
nodes.[node_id].recognized_obj_deltaThe change in the count of objects with VLM results.
nodes.[node_id].label_changed_obj_deltaThe change in the count of objects with changed VLM results.

Example JSON

{
    "nodes": {
        "local_vlm1": {
            "type": "local_vlm",
            "rois": {
                "roi1": {
                    "label_changed_delta": true,
                    "label_available": true,
                    "label": "The image shows two people walking on a sidewalk"
                }
            },
            "recognized_obj_ids": ["2775161862"],
            "recognized_obj_count": 1,
            "recognized_obj_delta": 1,
            "label_changed_obj_delta": 1,
            "unrecognized_obj_count": 0,
            "unrecognized_obj_delta": 0,
            "objects_of_interest_keys": ["recognized_obj_ids"]
        }
    }
}