The Forge
Feed it provider API keys. Get back a ranked catalog of the datasets actually available to you, plus a slate of strategy candidates built on top of them.
Overview
Most quants spend more time hunting for usable data than they spend on alpha. Provider APIs have inconsistent coverage, undocumented gotchas, and asymmetric quality across asset classes. The Forge encodes that exploration as a pipeline of cooperating LLM agents, each with a narrow job.
The pipeline runs once per scan and produces three artifacts:
- A catalog of datasets discovered across your active providers, each scored 0–100 on six quality dimensions and assigned a verdict.
- A list of research plans — hypotheses the Planner agent proposed by combining your useful datasets.
- A set of strategy candidates, each one a runnable backtest with metrics.
None of this auto-deploys. The whole point is to surface options for you to review.
Agents
Four cooperating agents, each with a narrow job.
| Agent | What it does |
|---|---|
| Discovery | Classifies your API keys against the bundled registry (14 providers). Enumerates each provider's capabilities into the catalog. |
| Probe | Calls each provider for a small sample and scores it on the six quality dimensions. The only agent that talks to external APIs. |
| Planner | Reads the catalog and proposes research plans — falsifiable hypotheses that combine 1–3 useful datasets. |
| Synthesis | Codes one research plan as a runnable strategy, backtests it, and parks the result for your review. The only agent that writes executable code. |
Catalog
The catalog is a per-install table of every dataset the Forge knows about. Each row is a (provider, capability, asset_class) triple with quality scores, a verdict, and links to provenance.
The six quality dimensions
Each dimension is scored 0–100 by the Probe agent.
| Dimension | Question it answers |
|---|---|
coverage | How much of the universe (symbols, time range) is actually present? |
freshness | How recent is the most recent point? How long is the publish lag? |
completeness | What fraction of expected fields are non-null? |
uniqueness | Does this dataset duplicate something else in the catalog? |
stability | Does historical data get revised? Is the schema stable across releases? |
trading_relevance | Is this data plausibly tradable (vs. illustrative or after-the-fact)? |
The aggregate score
The six dimensions are combined as a geometric mean, not arithmetic. The architectural guarantee: a zero in any single dimension clamps the aggregate to ~zero. A polygon endpoint that's 90/100 on five dimensions but 0/100 on coverage is not "average" — it's unusable, and the verdict reflects that.
# From auracle/forge/catalog.py
# Floor the zeroes at 1 so log() is defined.
floors = [max(d, 1) for d in dims]
score = math.exp(sum(math.log(d) for d in floors) / len(floors))
# Score == 1 if any dimension is 0; ~90 if all are 90.
Verdict thresholds
The aggregate score maps to a verdict by these thresholds:
| Score | Verdict | What it means |
|---|---|---|
| ≥ 70 | useful | Pass straight into the Planner's input pool. |
| ≥ 40 | marginal | Planner may use it with a caveat; backfill or supplement first. |
| ≥ 15 | poor | Excluded from Planner. Visible in the catalog so you know it exists. |
| < 15 | irrelevant | Effectively hidden. Catalog entry kept for audit only. |
Research
A research plan is a structured hypothesis. The Planner emits them as JSON in
forge_research_plans.plan_json; you can also write them by hand.
{
"hypothesis": "OI divergence on Coinalyze precedes 4h mean reversion in BTC-USD",
"inputs": [
{ "dataset_id": 12, "alias": "oi" },
{ "dataset_id": 7, "alias": "bars" }
],
"feature_spec": {
"oi_change_24h": "oi.delta(24h) / oi.rolling(7d).mean()",
"ret_4h": "bars.close.pct_change(4h)"
},
"label_spec": { "horizon": "4h", "kind": "log_return" },
"evaluation": {
"walk_forward": { "train": "12mo", "test": "1mo", "step": "1mo" },
"costs": { "bps": 5 },
"metrics": ["sharpe", "max_dd", "hit_rate"]
}
}
Plans can be queued, paused, retried, or rejected. The Planner won't re-emit a plan it already proposed in the past 30 days (content-hash dedupe).
Candidates
A strategy candidate is the output of running Synthesis on one plan. It lives
in forge_strategy_candidates with:
- A pointer to the source plan and the inputs used.
- The generated Python file path under
strategies/_forge/. - The backtest result blob (equity curve, trades, per-period metrics).
- A score against the plan's metric targets, plus a Planner-summarized verdict.
In the web UI, candidates appear in a table sorted by Sharpe. Clicking one opens the
candidate review surface — equity curve, trade list, generated code, and a
Promote button. Promotion copies the file out of
_forge/ into your active strategies directory; only then can it be
deployed.
Limits & audit
The Forge spends money (LLM tokens) and writes code (synthesized strategies). Both are bounded.
Cost caps
- Per-scan: hard cap of
$5 USDof LLM spend by default. Configurable inauracle forge config. - Per-day:
$25 USDaggregate across all agents. - Per-dataset probe:
$0.50+ 2 minutes of API time.
Hitting any cap pauses the scan; the partial catalog is preserved.
Audit trail
forge_agent_runs records every agent invocation: which agent, which dataset,
input fingerprint, token usage, wall-clock time, output fingerprint, and the
install_uuid. Replay a scan by re-running the same input fingerprints; the
output should hash-match (modulo non-determinism flagged at the LLM call boundary).
Promotion gate
Generated strategies are sandboxed and never auto-deployed. You review the code, the backtest, and the metrics before Promote moves the file into your active strategies directory; only then can it be deployed live.
Costs
Default budgets per agent (configurable per install).
| Agent | Per-call cap | Per-scan cap | Model |
|---|---|---|---|
| Discovery | 4k input / 1k output | $0.10 | Claude Haiku |
| Probe | 2k input / 500 output | $0.50 / dataset | Claude Haiku |
| Planner | 16k input / 4k output | $1.50 | Claude Sonnet |
| Synthesis | 32k input / 8k output | $2.50 / candidate | Claude Sonnet |
Cost is logged in forge_agent_runs.cost_usd. Aggregate with a
direct SQL query (SELECT SUM(cost_usd) FROM forge_agent_runs WHERE created_at > now() - interval '30 days')
until the auracle forge costs CLI ships in v1.1.