Optimise Agent Costs

Running large-parameter LLMs autonomously can become expensive quickly. An agent that makes 15 calls to gpt-4o to research a topic will cost roughly $0.15 per task. At 10,000 tasks a month, that is $1,500.

Here is a practical guide to reducing LLM costs on Savine by up to 80% without sacrificing output quality.

1. Use Router Patterns (Cheaper Models First)

Not every step of a task requires elite reasoning. If you have an agent reading thousands of user emails to see if they are "Refund Requests" or "Technical Support", do not use gpt-4o.

The Optimisation: Build a system.json Router node using gemini-2.0-flash or llama-3.3-70b (via Groq). These models cost pennies per million tokens. Only route the complex "Technical Support" queries to the expensive claude-3-5-sonnet agent.

2. Reduce the System Prompt Context Window

Input tokens cost money. Every single time the AgentGraphEngine loops, the LLM reads the entire conversation history plus the system prompt.

The Mistake: A 5,000 token system prompt explaining the entire history of your company, when the agent just needs to format a CSV. 5,000 tokens * 10 execution loops = 50,000 input tokens billed.

The Fix:

Keep system_prompt under 500 tokens if possible.
If you need vast context, put the knowledge into Savine's persistent memory and instruct the agent to use vector_search ONLY when it encounters an edge case.

3. Limit Output Tokens

Generation (output tokens) is typically 3x to 5x more expensive than ingestion (input tokens).

The Fix: Set strict limits in agent.json:

json

"llm": {
  "provider": "openai",
  "model": "gpt-4o",
  "max_tokens": 512
}

Add constraint rules to the prompt: "Output your findings as exactly 3 bullet points. Do not write introductory or concluding prose."

4. Leverage Groq for Speed-Critical Paths

Execution duration itself implies underlying cost, particularly when chaining multi-agent systems where parallel nodes wait.

By mapping high-volume agents to provider: groq, the underlying LPU hardware processes 500+ tokens a second, heavily discounting the execution latency wall-time and scaling infrastructure costs linearly.

Worked Example Result

A customer migrating a monolithic LangChain gpt-4 script to a Savine System:

Replaced the single agent with a 3-node system.
Node 1 (Triage): gemini-1.5-flash ($0.0001)
Node 2 (Searcher): llama-3.3-70b ($0.0004)
Node 3 (Synthesis): gpt-4o ($0.0150)

Total Cost Reduced from $0.08 per run to $0.0155 per run. (80% Reduction)

Optimise Agent Costs ​

1. Use Router Patterns (Cheaper Models First) ​

2. Reduce the System Prompt Context Window ​

3. Limit Output Tokens ​