The bug that wasn't a bug
Last weekend I spent a day with Opus 4.6 building an AI-driven crypto paper-trading strategy. It shipped. Seven sell strategies, three rebuy strategies, a Supabase-backed dashboard, automated emails, the works. It ran 3× daily on a $20K fund. I was genuinely proud.
Tonight I sat down with Opus 4.7 and the first thing I noticed, scrolling through the trade log: Strategy 5 (AVAX Narrative Shift) had fired 9 times in 3 days, trimming AVAX from 269 units to 20 — a 92.5% liquidation. Right as Bitwise launched the first Avalanche ETF and AVAX was printing a fresh 52-week high on its Galaxy Score.
The strategy wasn't broken. It was doing exactly what I told it to do. That was the problem.
The gap between "shipped working code" and "this is a strategy I'd trust with real money" turned out to be an afternoon of honest analysis — something the weekend build never had.
What Opus 4.6 built over the weekend
The v1 strategy was fine as a prototype. LunarCrush sentiment, Alpaca paper trades, Supabase snapshots, a dashboard hosted via a publish MCP, iMessage alerts. Every piece worked. The architecture was clean enough to hand off.
What it *didn't* have:
- Any backtest. Parameter values (25% portfolio stop, 10% trailing stop, 50% sentiment drop) were picked because they sounded reasonable. That's it.
- A cooldown on Strategy 4 (portfolio drawdown trim). Once the -15% threshold broke, the rule would fire *every single run* until recovery — potentially shredding a position in a normal dip.
- Rebuy symmetry. The sell rules were aggressive. The rebuy rules required sentiment recovery that might never happen on the exact assets we sold. Cash accumulated. It stayed in cash.
- Verified inputs. Strategy 7 (LINK Treasury Dump) pointed at two Ethereum addresses. When I actually queried them tonight, one held ~27 LINK (~$257) and the other held nothing. They weren't the Chainlink treasury. The rule could never fire. Nobody noticed.
None of these are "bugs" in the compile-time sense. The code ran. Trades executed. The snapshot wrote. The email sent. You could review every function and sign off.
What Opus 4.7 did tonight
I asked a simple question: *can we validate these rules against historical data?* That one prompt changed everything.
Within the hour we had:
- A Python backtest pulling 12 months of daily OHLC from CoinGecko for all 6 assets
- Three simulated strategies: buy-and-hold, v1 rules as-written, v2 rules as-proposed
- Per-strategy trade counts, Sharpe ratios, max drawdown dates, equity curves
The first run was humbling. HODL with v2 weights returned -2.3% over the year with a 60.7% drawdown. V1 returned +29.5% with only a 12.7% drawdown — which sounds great until you notice the trade log: *it stopped out of every position by mid-summer and sat in 100% cash for the remaining 10 months.* V1 "won" by liquidating once and never re-entering. That's not a strategy. That's a survivorship bias.
Opus 4.6 (weekend build)
Shipped a working end-to-end system in one session. Picked reasonable defaults. Did not backtest, did not question whether the rule parameters actually produced good outcomes, and did not cross-check external inputs like on-chain treasury addresses.
Opus 4.7 (tonight)
Proactively proposed the backtest before writing the v2 rules. Caught survivorship bias in v1 on the first run. Caught a bug in its *own* v2 implementation (Strategy 4 re-firing daily) on the second run. Iterated four times before it was willing to lock in parameter values. Dropped Strategy 7 when verification showed the treasury addresses were wrong rather than implementing it "as specified."
The specific fixes that only a backtest could surface
Strategy 4 (Portfolio Drawdown) re-fires. Rule as written: "If portfolio down >15% from peak, reduce all positions by 25%." In the first v2 backtest, this rule fired 187 times in 365 days. Each bad week, positions got trimmed by 25%, then 25% of what was left the next day, then 25% of what was left the day after. Death by a thousand 25% cuts. The fix: re-arm only when a new all-time peak is reached, not on partial recoveries. Fires dropped from 187 to 10.
The "dump to cash" trap. When a trailing stop fires, the v1 rule sells to cash and waits for a "sentiment recovery" rebuy — which for some assets might never come. In a multi-month crypto slump, cash accumulates and doesn't get redeployed. V2's fix: rotate proceeds into other assets scored by 7-day momentum × under-allocation, *unless* a regime gate fires. Net-zero rebalancing instead of cash dumps.
Contagion regime detection. The classic "risk-off" signal (BTC below 200-day SMA, portfolio drawdown >10%) is slow. By the time both conditions are true, you've already taken the pain. V2 adds a faster signal: 3+ trailing-stop fires in the trailing 14 days → flip to risk-off for 30 days. In the backtest this single heuristic cut max drawdown from -31% (v2 without the gate) to -13.4%.
Rebuy 4: Deploy-to-Target. A new rule that says: if you're sitting on idle cash and any asset is >5 percentage points under its target weight, buy up to $500/day into the most under-weight one. Solves the post-sell-off "frozen in cash" problem with simple DCA.
The numbers
12-month backtest results
HODL: -2.3% return, -60.7% max drawdown, Sharpe 0.28. V1 rules: +27.8%, -12.7%, Sharpe 1.07 (but ended in 100% cash). V2 rules: +35.2%, -13.4%, Sharpe 1.28, stayed actively deployed through recovery windows.
V2 wins on every metric that matters and it does so by *trading more thoughtfully* — 81 trades vs V1's 38 — not by avoiding the market.
What I actually learned about AI-assisted engineering
The code Opus 4.6 wrote was fine. I'd happily review it in a PR. It shipped. It did what the spec said.
But 4.6 never *questioned the spec*. It didn't ask "are these parameters right?" It didn't say "this treasury address doesn't look like a real Chainlink wallet, should we verify?" It didn't propose a backtest. It just implemented what I described.
4.7 did those things without prompting. It flagged Strategy 5's production behavior as an anomaly worth investigating. It proposed the rotation layer *before* I asked for it. When its own v2 backtest showed 625 trades, it said "that's too noisy, let me find the bug" instead of shipping it. When the contagion gate turned out to be load-bearing, it ran the numbers three different ways to confirm.
The deliverable difference looks modest: a better-performing strategy and a cleaner dashboard. The methodological difference is enormous: one model delivers a plausible implementation of what you asked for, the other delivers a *validated* version of what you actually wanted.
The weekend build worked. Tonight's build earned the right to manage real money. Same person giving the prompts. Different model proposing the work.
What's next
The new strategy is live on the full ~$101K Alpaca equity as of tonight. The v1 tracking tables are preserved via a `strategy_version` column — same database, same schema, queryable for side-by-side comparisons later. The v2 dashboard lives at /dashboard. The old one gets a deprecation banner and a forward link.
Two things on the follow-up list:
- Port the monitor logic from the scheduled Claude task to a Vercel cron that calls Alpaca's REST API directly. The Claude scheduled task is great for research/analysis but overkill for deterministic rule execution.
- Wait 7 days for the `strategy_outcomes` table to start auto-evaluating each trade. Every sell gets checked against a 7-day counterfactual — right or wrong. Over time the system learns which rules are earning their keep and which aren't.
If you're curious about the live numbers, the dashboard updates 3× daily.
Comments
Leave a comment
Read Next
I Built an AI-Powered Crypto Trading System in One Afternoon
I spent a Saturday afternoon connecting four AI data services into a single system that analyzes social sentiment, reads blockchain data, runs technical analysis, and executes real crypto trades on a paper account. No code. No API wrappers. Just Claude and a handful of MCP servers.
AI Is Not Replacing My Professionals. It Is Making Me a Better Client.
I have a CPA, a financial advisor, and a doctor. I am not replacing any of them. But the way I interact with all of them has fundamentally changed this year — and AI is the reason.
I Built My Friend a Personalized AI Guide. It Took Thirty Minutes.
I built a personalized Claude quick-start guide for a friend, hosted it on my website, and then turned the whole thing into a database-backed content platform with admin controls — all in a single session. The speed of iteration is getting absurd.