Holo1.5-7B: Navigation VLM Demo

This demo showcases Holo1.5-7B, a new version of the Action Vision-Language Model developed by HCompany, fine-tuned from Qwen/Qwen2.5-VL-7B-Instruct. It's designed to perform complex navigation tasks in Web, Android, and Desktop interfaces. How to use:

  1. Upload an image (e.g., a screenshot of a UI, see example below).
  2. Provide a textual task (e.g., "Find the latest model by H Company").
  3. The model will predict the next action to take. The model processor resizes your input image. Coordinates are relative to this resized image.
Examples
Input UI Image task