Documentation Index
Fetch the complete documentation index at: https://askui-docs-on-premise-architecture.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why Multiple Models?
AskUI uses different AI models for different tasks instead of one large model for everything. This is because UI automation requires several distinct capabilities:- Computer vision to see what’s on screen
- Natural language understanding to interpret instructions
- Planning to break down complex tasks
- Precise interaction to click and type accurately
The Three Model Types
1. Locator Models
What they do: Find and interact with UI elements Locator models analyze screenshots to locate buttons, text fields, and other UI elements. They also execute mouse clicks and keyboard input. Tasks:- Identify UI elements from screenshots
- Determine element locations and boundaries
- Execute clicks, typing, and other interactions
- UIDT-1: Locates elements on screen
- PTA-1: Takes text descriptions and finds matching UI elements
2. Query Models
What they do: Answer questions and make decisions Query models process natural language and generate responses. They understand context and can reason about what actions to take. Tasks:- Interpret user instructions
- Answer questions about screen content
- Make decisions about next steps
- GPT-4: General language understanding and reasoning
- Computer Use: Anthropic’s model for computer interaction tasks
3. Action Models (AMs)
What they do: Plan and coordinate multi-step tasks Action Models take high-level goals and break them into sequences of actions. They coordinate the other models and handle errors. Tasks:- Break down complex goals into steps
- Decide which model to use for each step
- Handle failures and retry logic
- Monitor progress and adjust plans
- Computer Use: Plans and executes computer tasks
- UI-Tars: Specialized for UI automation workflows
How They Work Together
When you give AskUI a task:- Action Model creates a plan with specific steps
- Query Models interpret any unclear instructions
- Locator Models execute each individual action
- Action Model checks results and continues or adjusts the plan
- AM plans: open travel site → search flights → select options → book
- Locator model clicks on flight search
- Query model interprets “Berlin” and “Rome” as departure/destination
- Locator model fills in the form fields
- AM monitors progress and handles any errors
Model Capabilities
| Model Type | Model Name | Purpose | Teachable | Online Trainable |
|---|---|---|---|---|
| Locator | UIDT-1 | Locate elements & understand screen | No | Partial |
| Locator | PTA-1 | Convert prompts into one-click actions | No | Yes |
| Query | GPT-4 | Understand & respond to user queries | Yes | No |
| Query | Gemini 2.5 Flash | Understand & respond to user queries | Yes | No |
| Query | Gemini 2.5 Pro | Understand & respond to user queries | Yes | No |
| Query | Computer Use | Understand & respond to user queries | Yes | No |
| Large Action (act) | Computer Use | Plan and execute full workflows | Yes | No |
| Large Action (act) | UI-Tars | Plan and execute full workflows | Yes | No |
Note: See model names here