Documentation Index
Fetch the complete documentation index at: https://askui-docs-on-premise-architecture.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
class VisionAgent(AgentBase)
A vision-based agent that can interact with user interfaces through computer vision and AI.
This agent can perform various UI interactions like clicking, typing, scrolling, and more.
It uses computer vision models to locate UI elements and execute actions on them.
Arguments:
display int, optional - The display number to use for screen interactions. Defaults to 1.
reporters list[Reporter] | None, optional - List of reporter instances for logging and reporting. If None, an empty list is used.
tools AgentToolbox | None, optional - Custom toolbox instance. If None, a default one will be created with AskUiControllerClient.
model ModelChoice | ModelComposition | str | None, optional - The default choice or name of the model(s) to be used for vision tasks. Can be overridden by the model parameter in the click(), get(), act() etc. methods.
retry Retry, optional - The retry instance to use for retrying failed actions. Defaults to ConfigurableRetry with exponential backoff. Currently only supported for locate() method.
models ModelRegistry | None, optional - A registry of models to make available to the VisionAgent so that they can be selected using the model parameter of VisionAgent or the model parameter of its click(), get(), act() etc. methods. Entries in the registry override entries in the default model registry.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.click("Submit button")
agent.type("Hello World")
agent.act("Open settings menu")
click
def click(
locator: Optional[str | Locator | Point] = None,
button: Literal["left", "middle", "right"] = "left",
repeat: Annotated[int, Field(gt=0)] = 1,
offset: Optional[Point] = None,
model: ModelComposition | str | None = None
) -> None
Simulates a mouse click on the user interface element identified by the provided locator.
Arguments:
locator str | Locator | Point | None, optional - UI element description, structured locator, or absolute coordinates (x, y). If None, clicks at current position.
button ‘left’ | ‘middle’ | ‘right’, optional - Specifies which mouse button to click. Defaults to 'left'.
repeat int, optional - The number of times to click. Must be greater than 0. Defaults to 1.
offset Point | None, optional - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element to click on using the locator.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.click() # Left click on current position
agent.click("Edit") # Left click on text "Edit"
agent.click((100, 200)) # Left click at absolute coordinates (100, 200)
agent.click("Edit", button="right") # Right click on text "Edit"
agent.click(repeat=2) # Double left click on current position
agent.click("Edit", button="middle", repeat=4) # 4x middle click on text "Edit"
agent.click("Submit", offset=(10, -5)) # Click 10 pixels right and 5 pixels up from "Submit"
mouse_move
def mouse_move(
locator: str | Locator | Point,
offset: Optional[Point] = None,
model: ModelComposition | str | None = None
) -> None
Moves the mouse cursor to the UI element identified by the provided locator.
Arguments:
locator str | Locator | Point - UI element description, structured locator, or absolute coordinates (x, y).
offset Point | None, optional - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element to move the mouse to using the locator.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.mouse_move("Submit button") # Moves cursor to submit button
agent.mouse_move((300, 150)) # Moves cursor to absolute coordinates (300, 150)
agent.mouse_move("Close") # Moves cursor to close element
agent.mouse_move("Profile picture", model="custom_model") # Uses specific model
agent.mouse_move("Menu", offset=(5, 10)) # Move 5 pixels right and 10 pixels down from "Menu"
def mouse_scroll(x: int, y: int) -> None
Simulates scrolling the mouse wheel by the specified horizontal and vertical amounts.
Arguments:
x int - The horizontal scroll amount. Positive values typically scroll right, negative values scroll left.
y int - The vertical scroll amount. Positive values typically scroll down, negative values scroll up.
Notes:
The actual scroll direction depends on the operating system’s configuration.
Some systems may have “natural scrolling” enabled, which reverses the traditional direction.
The meaning of scroll units varies across operating systems and applications.
A scroll value of 10 might result in different distances depending on the application and system settings.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.mouse_scroll(0, 10) # Usually scrolls down 10 units
agent.mouse_scroll(0, -5) # Usually scrolls up 5 units
agent.mouse_scroll(3, 0) # Usually scrolls right 3 units
type
def type(
text: Annotated[str, Field(min_length=1)],
locator: str | Locator | Point | None = None,
offset: Optional[Point] = None,
model: ModelComposition | str | None = None,
clear: bool = True
) -> None
Types the specified text as if it were entered on a keyboard.
If locator is provided, it will first click on the element to give it focus before typing.
If clear is True (default), it will triple click on the element to select the current text (in multi-line inputs like textareas the current line or paragraph) before typing.
IMPORTANT: clear only works if a locator is provided.
Arguments:
text str - The text to be typed. Must be at least 1 character long.
locator str | Locator | Point | None, optional - UI element description, structured locator, or absolute coordinates (x, y). If None, types at current focus.
offset Point | None, optional - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element, i.e., input field, to type into using the locator.
clear bool, optional - Whether to triple click on the element to give it focus and select the current text before typing. Defaults to True.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.type("Hello, world!") # Types "Hello, world!" at current focus
agent.type("user@example.com", locator="Email") # Clicks on "Email" input, then types
agent.type("username", locator=(200, 100)) # Clicks at coordinates (200, 100), then types
agent.type("password123", locator="Password field", model="custom_model") # Uses specific model
agent.type("Hello, world!", locator="Textarea", clear=False) # Types "Hello, world!" into textarea without clearing
agent.type("text", locator="Input field", offset=(5, 0)) # Click 5 pixels right of "Input field", then type
key_up
def key_up(key: PcKey | ModifierKey) -> None
Simulates the release of a key.
Arguments:
key PcKey | ModifierKey - The key to be released.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.key_up('a') # Release the 'a' key
agent.key_up('shift') # Release the 'Shift' key
key_down
def key_down(key: PcKey | ModifierKey) -> None
Simulates the pressing of a key.
Arguments:
key PcKey | ModifierKey - The key to be pressed.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.key_down('a') # Press the 'a' key
agent.key_down('shift') # Press the 'Shift' key
mouse_up
def mouse_up(button: Literal["left", "middle", "right"] = "left") -> None
Simulates the release of a mouse button.
Arguments:
button ‘left’ | ‘middle’ | ‘right’, optional - The mouse button to be released. Defaults to 'left'.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.mouse_up() # Release the left mouse button
agent.mouse_up('right') # Release the right mouse button
agent.mouse_up('middle') # Release the middle mouse button
mouse_down
def mouse_down(button: Literal["left", "middle", "right"] = "left") -> None
Simulates the pressing of a mouse button.
Arguments:
button ‘left’ | ‘middle’ | ‘right’, optional - The mouse button to be pressed. Defaults to 'left'.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.mouse_down() # Press the left mouse button
agent.mouse_down('right') # Press the right mouse button
agent.mouse_down('middle') # Press the middle mouse button
keyboard
def keyboard(
key: PcKey | ModifierKey,
modifier_keys: Optional[list[ModifierKey]] = None,
repeat: Annotated[int, Field(gt=0)] = 1
) -> None
Simulates pressing (and releasing) a key or key combination on the keyboard.
Arguments:
key PcKey | ModifierKey - The main key to press. This can be a letter, number, special character, or function key.
modifier_keys list[ModifierKey] | None, optional - List of modifier keys to press along with the main key. Common modifier keys include 'ctrl', 'alt', 'shift'.
repeat int, optional - The number of times to press (and release) the key. Must be greater than 0. Defaults to 1.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.keyboard('a') # Press 'a' key
agent.keyboard('enter') # Press 'Enter' key
agent.keyboard('v', ['control']) # Press Ctrl+V (paste)
agent.keyboard('s', ['control', 'shift']) # Press Ctrl+Shift+S
agent.keyboard('a', repeat=2) # Press 'a' key twice
cli
def cli(command: Annotated[str, Field(min_length=1)]) -> None
Executes a command on the command line interface.
This method allows running shell commands directly from the agent. The command
is split on spaces and executed as a subprocess.
Arguments:
command str - The command to execute on the command line.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
# Use for Windows
agent.cli(fr'start "" "C:\Program Files\VideoLAN\VLClc.exe"') # Start in VLC non-blocking
agent.cli(fr'"C:\Program Files\VideoLAN\VLClc.exe"') # Start in VLC blocking
# Mac
agent.cli("open -a chrome") # Open Chrome non-blocking for mac
agent.cli("chrome") # Open Chrome blocking for linux
agent.cli("echo Hello World") # Prints "Hello World"
agent.cli("python --version") # Displays Python version
# Linux
agent.cli("nohub chrome") # Open Chrome non-blocking for linux
agent.cli("chrome") # Open Chrome blocking for linux
agent.cli("echo Hello World") # Prints "Hello World"
agent.cli("python --version") # Displays Python version
act
def act(
goal: Annotated[str | list[MessageParam],
Field(min_length=1)],
model: str | None = None,
on_message: OnMessageCb | None = None,
tools: list[Tool] | ToolCollection | None = None,
settings: ActSettings | None = None
) -> None
Instructs the agent to achieve a specified goal through autonomous actions.
The agent will analyze the screen, determine necessary steps, and perform
actions to accomplish the goal. This may include clicking, typing, scrolling,
and other interface interactions.
Arguments:
goal str | list[MessageParam] - A description of what the agent should
achieve.
model str | None, optional - The composition or name of the model(s) to
be used for achieving the goal.
on_message OnMessageCb | None, optional - Callback for new messages. If
it returns None, stops and does not add the message.
tools list[Tool] | ToolCollection | None, optional - The tools for the
agent. Defaults to default tools depending on the selected model.
settings AgentSettings | None, optional - The settings for the agent.
Defaults to a default settings depending on the selected model.
Returns:
None
Raises:
MaxTokensExceededError - If the model reaches the maximum token limit
defined in the agent settings.
ModelRefusalError - If the model refuses to process the request.
Example:
from askui import VisionAgent
with VisionAgent() as agent:
agent.act("Open the settings menu")
agent.act("Search for 'printer' in the search box")
agent.act("Log in with username 'admin' and password '1234'")
get
def get(
query: Annotated[str, Field(min_length=1)],
response_schema: Type[ResponseSchema] | None = None,
model: str | None = None,
source: Optional[InputSource] = None
) -> ResponseSchema | str
Retrieves information from an image or PDF based on the provided query.
If no source is provided, a screenshot of the current screen is taken.
Arguments:
query str - The query describing what information to retrieve.
source InputSource | None, optional - The source to extract information
from. Can be a path to an image, PDF, or office document file,
a PIL Image object or a data URL. Defaults to a screenshot of the
current screen.
response_schema Type[ResponseSchema] | None, optional - A Pydantic model
class that defines the response schema. If not provided, returns a
string.
model str | None, optional - The composition or name of the model(s) to
be used for retrieving information from the screen or image using the
query. Note: response_schema is not supported by all models.
PDF processing is only supported for Gemini models hosted on AskUI.
Returns:
ResponseSchema | str: The extracted information, str if no
response_schema is provided.
Raises:
NotImplementedError - If PDF processing is not supported for the selected
model.
ValueError - If the source is not a valid PDF or image.
Example:
from askui import ResponseSchemaBase, VisionAgent
from PIL import Image
import json
class UrlResponse(ResponseSchemaBase):
url: str
class NestedResponse(ResponseSchemaBase):
nested: UrlResponse
class LinkedListNode(ResponseSchemaBase):
value: str
next: "LinkedListNode | None"
with VisionAgent() as agent:
# Get URL as string
url = agent.get("What is the current url shown in the url bar?")
# Get URL as Pydantic model from image at (relative) path
response = agent.get(
"What is the current url shown in the url bar?",
response_schema=UrlResponse,
source="screenshot.png",
)
# Dump whole model
print(response.model_dump_json(indent=2))
# or
response_json_dict = response.model_dump(mode="json")
print(json.dumps(response_json_dict, indent=2))
# or for regular dict
response_dict = response.model_dump()
print(response_dict["url"])
# Get boolean response from PIL Image
is_login_page = agent.get(
"Is this a login page?",
response_schema=bool,
source=Image.open("screenshot.png"),
)
print(is_login_page)
# Get integer response
input_count = agent.get(
"How many input fields are visible on this page?",
response_schema=int,
)
print(input_count)
# Get float response
design_rating = agent.get(
"Rate the page design quality from 0 to 1",
response_schema=float,
)
print(design_rating)
# Get nested response
nested = agent.get(
"Extract the URL and its metadata from the page",
response_schema=NestedResponse,
)
print(nested.nested.url)
# Get recursive response
linked_list = agent.get(
"Extract the breadcrumb navigation as a linked list",
response_schema=LinkedListNode,
)
current = linked_list
while current:
print(current.value)
current = current.next
# Get text from PDF
text = agent.get(
"Extract all text from the PDF",
source="document.pdf",
)
print(text)
locate
def locate(
locator: str | Locator,
screenshot: Optional[InputSource] = None,
model: ModelComposition | str | None = None
) -> Point
Locates the first matching UI element identified by the provided locator.
Arguments:
locator str | Locator - The identifier or description of the element to
locate.
screenshot InputSource | None, optional - The screenshot to use for
locating the element. Can be a path to an image file, a PIL Image object
or a data URL. If None, takes a screenshot of the currently
selected display.
model ModelComposition | str | None, optional - The composition or name
of the model(s) to be used for locating the element using the locator.
Returns:
Point - The coordinates of the element as a tuple (x, y).
Example:
from askui import VisionAgent
with VisionAgent() as agent:
point = agent.locate("Submit button")
print(f"Element found at coordinates: {point}")
locate_all
def locate_all(
locator: str | Locator,
screenshot: Optional[InputSource] = None,
model: ModelComposition | str | None = None
) -> PointList
Locates all matching UI elements identified by the provided locator.
Note: Some LocateModels can only locate a single element. In this case, the
returned list will have a length of 1.
Arguments:
locator str | Locator - The identifier or description of the element to
locate.
screenshot InputSource | None, optional - The screenshot to use for
locating the element. Can be a path to an image file, a PIL Image object
or a data URL. If None, takes a screenshot of the currently
selected display.
model ModelComposition | str | None, optional - The composition or name
of the model(s) to be used for locating the element using the locator.
Returns:
PointList - The coordinates of the elements as a list of tuples (x, y).
Example:
from askui import VisionAgent
with VisionAgent() as agent:
points = agent.locate_all("Submit button")
print(f"Found {len(points)} elements at coordinates: {points}")
locate_all_elements
def locate_all_elements(
screenshot: Optional[InputSource] = None,
model: ModelComposition | None = None
) -> list[DetectedElement]
Locate all elements in the current screen using AskUI Models.
Arguments:
screenshot InputSource | None, optional - The screenshot to use for
locating the elements. Can be a path to an image file, a PIL Image
object or a data URL. If None, takes a screenshot of the currently
selected display.
model ModelComposition | None, optional - The model composition
to be used for locating the elements.
Returns:
list[DetectedElement] - A list of detected elements
Example:
from askui import VisionAgent
with VisionAgent() as agent:
detected_elements = agent.locate_all_elements()
print(f"Found {len(detected_elements)} elements: {detected_elements}")
annotate
def annotate(
screenshot: InputSource | None = None,
annotation_dir: str = "annotations",
model: ModelComposition | None = None
) -> None
Annotate the screenshot with the detected elements.
Creates an interactive HTML file with the detected elements
and saves it to the annotation directory.
The HTML file can be opened in a browser to see the annotated image.
The user can hover over the elements to see their names and text value
and click on the box to copy the text value to the clipboard.
Arguments:
-
screenshot ImageSource | None, optional - The screenshot to annotate.
If None, takes a screenshot of the currently selected display.
-
annotation_dir str - The directory to save the annotated
image. Defaults to “annotations”.
-
model ModelComposition | None, optional - The composition
of the model(s) to be used for annotating the image.
If None, uses the default model.
Example Using VisionAgent:
from askui import VisionAgent
with VisionAgent() as agent:
agent.annotate()
Example Using AndroidVisionAgent:
from askui import AndroidVisionAgent
with AndroidVisionAgent() as agent:
agent.annotate()
Example Using VisionAgent with custom screenshot and annotation directory:
from askui import VisionAgent
with VisionAgent() as agent:
agent.annotate(screenshot="screenshot.png", annotation_dir="htmls")
wait
def wait(
until: Annotated[float, Field(gt=0.0)] | str | Locator,
retry_count: Optional[Annotated[int, Field(gt=0)]] = None,
delay: Optional[Annotated[float, Field(gt=0.0)]] = None,
until_condition: Literal["appear", "disappear"] = "appear",
model: ModelComposition | str | None = None
) -> None
Pauses execution or waits until a UI element appears or disappears.
Arguments:
until float | str | Locator - If a float, pauses execution for the
specified number of seconds (must be greater than 0.0). If a string
or Locator, waits until the specified UI element appears or
disappears on screen.
retry_count int | None - Number of retries when waiting for a UI
element. Defaults to 3 if None.
delay int | None - Sleep duration in seconds between retries when
waiting for a UI element. Defaults to 1 second if None.
until_condition Literal[“appear”, “disappear”] - The condition to wait
until the element satisfies. Defaults to “appear”.
model ModelComposition | str | None, optional - The composition or name
of the model(s) to be used for locating the element using the
until locator.
Raises:
WaitUntilError - If the UI element is not found after all retries.
Example:
from askui import VisionAgent
from askui.locators import loc
with VisionAgent() as agent:
# Wait for a specific duration
agent.wait(5) # Pauses execution for 5 seconds
agent.wait(0.5) # Pauses execution for 500 milliseconds
# Wait for a UI element to appear
agent.wait("Submit button", retry_count=5, delay=2)
agent.wait("Login form") # Uses default retries and sleep time
agent.wait(loc.Text("Password")) # Uses default retries and sleep time
# Wait for a UI element to disappear
agent.wait("Loading spinner", until_condition="disappear")
# Wait using a specific model
agent.wait("Submit button", model="custom_model")