Local Vision Language Model
Perform visual question answering on objects in an ROI, or on a ROI in the frame using local models.
Overview
The Local Vision Language Model (VLM) node performs visual question answering on objects within a Region of Interest (ROI), or on a ROI within the frame. It uses local inference with models like InternVL2-1B (small) and InternVL2-2B (large) to analyze and answer questions about image content.
Inputs & Outputs
- Inputs: 1, Media Format: Raw Video
- Outputs: 1, Media Format: Raw Video
- Output Metadata:
nodes.node_id
,recognized_objs
,recognized_obj_ids
,recognized_obj_count
,recognized_obj_delta
,label_changed_obj_delta
Properties
Property | Description | Type | Default | Required |
---|---|---|---|---|
roi_labels | Regions of interest labels | hidden | null | Yes |
rois | Regions of interest. Conditional on roi_labels . | polygon | null | Yes |
processing_mode | Processing mode. Options: ROIs, at Interval (rois_interval), ROIs, upon Trigger (rois_trigger), Objects in an ROI (objects). | enum | rois_interval | Yes |
interval | Process objects or ROIs at least this many seconds apart. | float | 10 | No |
trigger | Queue ROI for processing when this condition evaluates to true. Conditional on processing_mode being rois_trigger . | trigger-condition | null | No |
model_type | Model type. Options: Small (InternVL2_5-1B), Large (InternVL2_5-2B). | enum | internvl2_5_1b | No |
prompt | Question or prompt to ask about the image content. | text | "Describe the image content" | No |
max_new_tokens | Maximum number of tokens to generate in the response. | number | 100 | No |
display_roi | Display ROI on video? | bool | true | No |
display_objinfo | Display results on video? Options: Disabled (disabled), Bottom left (bottom_left), Bottom right (bottom_right), Top left (top_left), Top right (top_right). | enum | bottom_left | No |
debug | Log debugging information? | bool | false | No |
Model Types
InternVL2_5-1B (Small)
- Faster inference
- Good for basic scene description and object detection
- Lower memory requirements
- Default model: OpenGVLab/InternVL2_5-1B
- Model size: 2.0 GB download
- GPU Memory: Requires ~2.5GB of GPU memory
InternVL2_5-2B (Large)
- More detailed and nuanced responses
- Better understanding of complex scenes
- Higher memory requirements
- Default model: OpenGVLab/InternVL2_5-2B
- Model size: 4.5 GB download
- GPU Memory: Requires ~5.0GB of GPU memory
Metadata
Metadata Property | Description |
---|---|
nodes.[node_id].rois.[roi_id].label_changed_delta | Indicates if there has been a change in the VLM result for the ROI. |
nodes.[node_id].rois.[roi_id].label_available | Indicates if a VLM result is available for the ROI. |
nodes.[node_id].rois.[roi_id].label | VLM result contents for the ROI. |
nodes.[node_id].recognized_obj_count | The count of objects with VLM results. |
nodes.[node_id].recognized_obj_delta | The change in the count of objects with VLM results. |
nodes.[node_id].label_changed_obj_delta | The change in the count of objects with changed VLM results. |
Example JSON
{
"nodes": {
"local_vlm1": {
"type": "local_vlm",
"rois": {
"roi1": {
"label_changed_delta": true,
"label_available": true,
"label": "The image shows two people walking on a sidewalk"
}
},
"recognized_obj_ids": ["2775161862"],
"recognized_obj_count": 1,
"recognized_obj_delta": 1,
"label_changed_obj_delta": 1,
"unrecognized_obj_count": 0,
"unrecognized_obj_delta": 0,
"objects_of_interest_keys": ["recognized_obj_ids"]
}
}
}
Updated 17 days ago