🔨 AgentStudio - Pseudo-Lab 11th AI Agent Project
“Bridging the intergenerational knowledge gap with AI and sharing positive influence.”
Vision-Language-Action (VLA) Agent for Automated Kiosk Interaction
Kiosk Agent is an AI system that utilizes Vision-Language Models (VLM) to automatically control Android kiosk applications. It interprets visual interfaces and executes precise actions to assist users who may find digital kiosks challenging.
gemini-3-flash (high-speed) and gemini-3-pro (high-reasoning) models.AgentStudio allows you to switch between different Vision-Language Models depending on your needs.
| Provider | Model | Status | Key Advantage |
|---|---|---|---|
gemini-3-flash |
✅ Supported | Low latency and cost-efficient | |
gemini-3-pro |
✅ Supported | Advanced reasoning for complex UI | |
| OpenAI | gpt-4o-mini |
✅ Supported | Robust performance across various tasks |
gemma-3-27b |
🔜 Roadmap | Optimized for on-device/local privacy | |
| Microsoft | Fara-7B |
🔜 Roadmap | Optimized Computed Ondevice Agent |
To switch models, update your .env file:
MODEL_PROVIDER=gemini
GEMINI_MODEL=gemini-3-flash # Options: gemini-3-flash, gemini-3-pro
The VLA paradigm is a continuous cycle where the agent observes, reasons, and executes.
flowchart LR
A[Screen Capture] --> B[VLM Reasoning]
B --> C[Action Decode]
C --> D[Execute ADB]
D --> E{Done?}
E -->|No| A
E -->|FINISH| F[Complete]
E -->|INTERRUPT| G[Human Input]
G --> A
| Phase | Description |
|---|---|
| Screen Capture | Captures Android device screen via ADB |
| VLM Reasoning | Gemini analyzes the screen to decide the next action |
| Action Decode | Parses VLM output into structured executable commands |
| Execute ADB | Controls the device using ADB (tap, swipe, input) |
| INTERRUPT | Triggers HITL when user intervention is required |
We manage the agent’s logic flow using LangGraph for stable state transitions.
flowchart TD
START([Start]) --> VLM[VLM Node]
VLM --> EXEC[Execute Node]
EXEC --> ROUTER{Router}
ROUTER -->|LOOP| VLM
ROUTER -->|INTERRUPT| HUMAN[Human Node]
ROUTER -->|FINISH| END([End])
HUMAN -->|Resume| VLM
HUMAN -->|Abort| END
git clone [https://github.com/Pseudo-Lab/Agent_Studio.git](https://github.com/Pseudo-Lab/Agent_Studio.git)
cd Agent_Studio
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate
# Install dependencies in editable mode
uv pip install -e backend/
cp .env.example .env
# Edit .env with your GOOGLE_API_KEY
| Action | Parameters | Description |
|---|---|---|
CLICK |
x, y |
Tap specific coordinates |
INPUT |
text |
Type text into a field |
SWIPE |
x1, y1, x2, y2 |
Scroll or navigate |
INTERRUPT |
question |
Ask user for guidance (HITL) |
FINISH |
- | Task completed successfully |
| Name | Role | Focus |
|---|---|---|
| Jaehyun Kim | Builder | Frontend (Next.js), Backend (FastAPI) |
| Seunghyeok Kim | Runner | LangGraph, Reasoning, Prompt Engineering |
| Gyumin Lee | Runner | VLA Mechanism, LangGraph Architecture |
| Minjung Jeon | Runner | Voice (TTS/STT), Google ADK |
This project is licensed under the Apache License 2.0.