Overview

The Local Vision Language Model Lookup (VLM) node performs visual question answering on objects within a Region of Interest (ROI), or on a ROI within the frame. It uses local inference with models like InternVL2-1B (small) and InternVL2-2B (large) to analyze and answer questions about image content.

Inputs & Outputs

Inputs: 1, Media Format: Raw Video
Outputs: 1, Media Format: Raw Video
Output Metadata: nodes.node_id, recognized_objs, recognized_obj_ids, recognized_obj_count, recognized_obj_delta, label_changed_obj_delta

Properties

Property	Description	Type	Default	Required
`roi_labels`	Regions of interest labels	hidden	null	Yes
`rois`	Regions of interest. Conditional on `roi_labels`.	polygon	null	Yes
`processing_mode`	Processing mode. Options: ROIs, at Interval (rois_interval), ROIs, upon Trigger (rois_trigger), Objects in an ROI (objects).	enum	rois_interval	Yes
`interval`	Collect objects or ROIs for lookup atleast this many seconds apart.	float	1	No
`trigger`	Queue ROI for lookup when this condition evaluates to true. Conditional on `processing_mode` being `rois_trigger`.	trigger-condition	null	No
`objects_to_process`	Object types to process (e.g. car, person, car.red). Conditional on `processing_mode` being `objects`.	model-label	null	No
`obj_lookup_mode`	Object lookup mode. Options: Until result (until_result) - Lookup on interval or size change until result obtained or max attempts exhausted, Continuously (continuous) - Periodically at an interval. Conditional on `processing_mode` being `objects`.	enum	until_result	No
`min_obj_size_pixels`	Min. width and height of an object. Conditional on `processing_mode` being `objects`.	number	64	No
`obj_lookup_size_change_threshold`	If the size of an object changes by more than this threshold, perform a lookup. Min: 0.01, Max: 2.0, Step: 0.2. Conditional on `processing_mode` being `objects`.	slider	0.1	No
`max_lookups_per_obj`	Maximum number of attempts to perform a lookup for an object in the `Until result` lookup mode. Conditional on `processing_mode` being `objects`.	number	5	No
`model_type`	Model type. Options: Small (InternVL2_5-1B), Large (InternVL2_5-2B).	enum	internvl2_5_1b	No
`prompt`	Provide a prompt, additional instructions or context for the model. Required if no `attributes` are provided.	text	null	No
`description_mode`	Generate a description of the scene or objects in the images. This description will be used for search and summarization. Options: `none`, `when_attributes_present`, `when_alert_present`, `always`	enum	`none`	No
`attributes`	Provide attribute names and for each attribute, a question or description with optional answer choices to extract the attribute value - as a JSON dictionary. Special attributes if present: `description` overrides description mode, `alert` describes condition to trigger an alert and `alert_message` overrides the message to display when an alert is triggered. See examples below.	json	null	No
`detail_level`	Image resolution. Options: Low (low), High (high).	enum	low	No
`max_tokens`	Maximum number of tokens to return for each request.	number	500	No
`display_roi`	Display ROI on video?	bool	true	No
`display_objinfo`	Display results on video? Options: Disabled (disabled), Bottom left (bottom_left), Bottom right (bottom_right), Top left (top_left), Top right (top_right).	enum	bottom_left	No
`debug`	Log debugging information?	bool	false	No

Model Types

InternVL2_5-1B (Small)

Faster inference
Good for basic scene description and object detection
Lower memory requirements
Default model: OpenGVLab/InternVL2_5-1B
Model size: 2.0 GB download
GPU Memory: Requires ~2.5GB of GPU memory

InternVL2_5-2B (Large)

More detailed and nuanced responses
Better understanding of complex scenes
Higher memory requirements
Default model: OpenGVLab/InternVL2_5-2B
Model size: 4.5 GB download
GPU Memory: Requires ~5.0GB of GPU memory

Metadata

Metadata Property	Description
`nodes.[node_id].rois.[roi_id].label_changed_delta`	Indicates if there has been a change in the VLM result for the ROI.
`nodes.[node_id].rois.[roi_id].label_available`	Indicates if a VLM result is available for the ROI.
`nodes.[node_id].rois.[roi_id].label`	VLM result contents for the ROI.
`nodes.[node_id].rois.[roi_id].attributes.[attribute_name]`	Attribute values (string or null) for the ROI, if `attributes` property is specified. Will exclude `description` attribute.
`nodes.[node_id].recognized_obj_count`	The count of objects with VLM results.
`nodes.[node_id].recognized_obj_delta`	The change in the count of objects with VLM results.
`nodes.[node_id].label_changed_obj_delta`	The change in the count of objects with changed VLM results.
`nodes.[node_id].alert`	True if an alert was triggered else False
`nodes.[node_id].alert_message`	Brief description of alert

Example JSON

{
    "nodes": {
        "local_vlm1": {
            "type": "local_vlm",
            "rois": {
                "roi1": {
                    "label_changed_delta": true,
                    "label_available": true,
                    "label": "The image shows two people walking on a sidewalk",
                    "attributes": {
                        "crowded": "crowded"
                    }
                }
            },
            "recognized_obj_ids": ["2775161862"],
            "recognized_obj_count": 1,
            "recognized_obj_delta": 1,
            "label_changed_obj_delta": 1,
            "unrecognized_obj_count": 0,
            "unrecognized_obj_delta": 0,
            "alert": true,
            "alert_message": "Jaywalking detected",
            "objects_of_interest_keys": ["recognized_obj_ids"]
        }
    }
}