Overview

The Local Vision Language Model (VLM) node performs visual question answering on objects within a Region of Interest (ROI), or on a ROI within the frame. It uses local inference with models like InternVL2-1B (small) and InternVL2-2B (large) to analyze and answer questions about image content.

Inputs & Outputs

Inputs: 1, Media Format: Raw Video
Outputs: 1, Media Format: Raw Video
Output Metadata: nodes.node_id, recognized_objs, recognized_obj_ids, recognized_obj_count, recognized_obj_delta, label_changed_obj_delta

Properties

Property	Description	Type	Default	Required
`roi_labels`	Regions of interest labels	hidden	null	Yes
`rois`	Regions of interest. Conditional on `roi_labels`.	polygon	null	Yes
`processing_mode`	Processing mode. Options: ROIs, at Interval (rois_interval), ROIs, upon Trigger (rois_trigger), Objects in an ROI (objects).	enum	rois_interval	Yes
`interval`	Process objects or ROIs at least this many seconds apart.	float	10	No
`trigger`	Queue ROI for processing when this condition evaluates to true. Conditional on `processing_mode` being `rois_trigger`.	trigger-condition	null	No
`model_type`	Model type. Options: Small (InternVL2_5-1B), Large (InternVL2_5-2B).	enum	internvl2_5_1b	No
`prompt`	Question or prompt to ask about the image content.	text	"Describe the image content"	No
`max_new_tokens`	Maximum number of tokens to generate in the response.	number	100	No
`display_roi`	Display ROI on video?	bool	true	No
`display_objinfo`	Display results on video? Options: Disabled (disabled), Bottom left (bottom_left), Bottom right (bottom_right), Top left (top_left), Top right (top_right).	enum	bottom_left	No
`debug`	Log debugging information?	bool	false	No

Model Types

InternVL2_5-1B (Small)

Faster inference
Good for basic scene description and object detection
Lower memory requirements
Default model: OpenGVLab/InternVL2_5-1B
Model size: 2.0 GB download
GPU Memory: Requires ~2.5GB of GPU memory

InternVL2_5-2B (Large)

More detailed and nuanced responses
Better understanding of complex scenes
Higher memory requirements
Default model: OpenGVLab/InternVL2_5-2B
Model size: 4.5 GB download
GPU Memory: Requires ~5.0GB of GPU memory

Metadata

Metadata Property	Description
`nodes.[node_id].rois.[roi_id].label_changed_delta`	Indicates if there has been a change in the VLM result for the ROI.
`nodes.[node_id].rois.[roi_id].label_available`	Indicates if a VLM result is available for the ROI.
`nodes.[node_id].rois.[roi_id].label`	VLM result contents for the ROI.
`nodes.[node_id].recognized_obj_count`	The count of objects with VLM results.
`nodes.[node_id].recognized_obj_delta`	The change in the count of objects with VLM results.
`nodes.[node_id].label_changed_obj_delta`	The change in the count of objects with changed VLM results.

Example JSON

{
    "nodes": {
        "local_vlm1": {
            "type": "local_vlm",
            "rois": {
                "roi1": {
                    "label_changed_delta": true,
                    "label_available": true,
                    "label": "The image shows two people walking on a sidewalk"
                }
            },
            "recognized_obj_ids": ["2775161862"],
            "recognized_obj_count": 1,
            "recognized_obj_delta": 1,
            "label_changed_obj_delta": 1,
            "unrecognized_obj_count": 0,
            "unrecognized_obj_delta": 0,
            "objects_of_interest_keys": ["recognized_obj_ids"]
        }
    }
}