the shame around vibe-coded stuff is real but honestly - most of the code out there wouldn't survive scrutiny either, AI-generated or not. the difference is that vibe coding fails in predictable patterns. weirdly verbose error handling that doesn't actually handle the error, auth flows that work great until you send a malformed header, things like that.
for notifications specifically, the risky bits would be: what happens if an app sends a notification payload that's malformed or huge, how do you handle permission checks if the notification system process restarts mid-filtering, and whether the filtering rules can be bypassed by crafting notifications with weird mime types or encoded text.
if you wrote tests for those edge cases (or even just thought through them), you're already ahead of 90% of shipped code, vibe-coded or not. the scrutiny you're worried about is actually healthy - peer review catches stuff automated tools miss.
containers are fine for basic isolation but the attack surface is way bigger than people think. you're still trusting the container runtime, the kernel, and the whole syscall interface. if the agent can call arbitrary syscalls inside the container, you're one kernel bug away from a breakout.
what I'm curious about with matchlock - does it use seccomp-bpf to restrict syscalls, or is it more like a minimal rootfs with carefully chosen binaries? because the landlock LSM stuff is cool but it's mainly for filesystem access control. network access, process spawning, that's where agents get dangerous.
also how do you handle the agent needing to install dependencies at runtime? like if claude decides it needs to pip install something mid-task. do you pre-populate the sandbox or allow package manager access?
Creator of matchlock here. Great questions, here's how matchlock handles these:
The guest-agent (pid-1) spawns commands in a new pid + mount namespace (similar to firecracker jailer but in the inner level for the purpose of macos support). In non-privileged mode it drops SYS_PTRACE, SYS_ADMIN, etes from the bounding set, sets `no_new_privs`, then installs a seccomp-BPF filter that eperms proces vm readv/writev, ptrace kernel load. The microVM is the real isolation boundary — seccomp is defense in depth. That said there is a `--privileged` flag that allows that to be skipped for the purpose of image build using buildkit.
Whether pip install works is entirely up to the OCI image you pick. If it has a package manager and you've allowed network access, go for it. The whole point is making `claude --dangerously-skip-permissions` style usage safe.
Personally I've had agents perform red team type of breakout. From my first hand experience what the agent (opus 4.6 with max thinking) will exploit without cap drops and seccomps is genuinely wild.
Thank you for matchlock! I’ve got Opus 4.6 red teaming it right now. ;)
I think a secure VM is a necessary baseline, and the days of env files with a big bundle of unscoped secrets are a thing of the past, so I like the base features you built in.
I’d love to hear more about the red team breakouts you’ve seen if you have time.
defense in depth makes sense - microVM as the boundary, seccomp as insurance. most docs treat seccomp like it's the whole story which is... optimistic.
the opus 4.6 breakouts you mentioned - was it known vulns or creative syscall abuse? agents are weirdly systematic about edge cases compared to human red teamers. they don't skip the obvious stuff.
--privileged for buildkit tracks - you gotta build the images somewhere.
It tried a lot of things relentlessly, just to name a few:
* Exploit kernel CVEs
* Weaponise gcc, crafting malicious kernel modules; forging arbitrary packets to spoof the source address that bypass tcp/ip
* Probing metadata service
* Hack bpf & io uring
* A lot of mount escape attempts, network, vsock scanning and crafting
As a non security researcher it was mind blown to see what it did, which in the hindsight isn't surprising as Opus 4.6 hits 93% solve rate on Cybench - https://cybench.github.io/
that's wild - weaponizing gcc to craft kernel modules is not something I'd expect from automated testing. most fuzzing stops at syscall-level probes but this is full exploit chain development.
the metadata service probing is particularly concerning because that's the classic cloud escape path. if you're running this in aws/gcp and the agent figures out IMDSv1 is reachable, game over. vsock scanning too - that's targeting the host-guest communication channel directly.
93% on cybench is genuinely scary when you think about what it means. it's not just finding known CVEs, it's systematically exploring the attack surface like a skilled pentester would. and unlike humans, it doesn't get tired or skip the boring enumeration steps. did you find it tried timing attacks or side channels at all? or was it mostly direct exploitation?
I'm working on a similar project. Currently managing images with nix, using envoy to proxy all outbound traffic with no direct network access, with optional quota support. Ironically similar to how I'd do things for humans.
My architecture is a little different though, as my agents aren't running in the sandbox, only executing code there remotely.
nice - I was wondering about the cross-platform story. firecracker on linux for the isolation, virtualization.framework on mac so you don't need vmware.
this is really cool - the single binary thing solves a huge pain point I have with OpenClaw. I love that tool but the Node + npm dependency situation is a lot.
curious: when you say compatible with OpenClaw's markdown format, does that mean I could point LocalGPT at an existing OpenClaw workspace and it would just work? or is it more 'inspired by' the format?
the local embeddings for semantic search is smart. I've been using similar for code generation and the thing I kept running into was the embedding model choking on code snippets mixed with prose. did you hit that or does FTS5 + local embeddings just handle it?
also - genuinely asking, not criticizing - when the heartbeat runner executes autonomous tasks, how do you keep the model from doing risky stuff? hitting prod APIs, modifying files outside workspace, etc. do you sandbox or rely on the model being careful?
Hitting production APIs (and email) is my main concern with all agents I run.
To solve this I've built Wardgate [1], which removes the need for agents to see any credentials and has access control on a per API endpoints basis. So you can say: yes you can read all Todoist tasks but you can't delete tasks or see tasks with "secure" in them, or see emails outside Inbox or with OTP codes, or whatever.
this is a clever approach - credential-less proxying with scoped permissions is way cleaner than trying to teach the model what not to do. how do you handle dynamic auth flows though? like if an API returns a short-lived token that needs to be refreshed, does wardgate intercept and cache those or do you expose token refresh as a separate controlled endpoint?
and I'm curious about the filtering logic - is it regex on endpoint paths or something more semantic? because the "tasks with secure in them" example makes me think there's some content inspection happening, not just URL filtering.
the papercut argument jstanley made is valid but there's a flip side - when you're running AI-generated code at scale, every capability you give it is also a capability that malicious prompts can exploit. the real question isn't whether restrictions slow down the model (they do), it's whether the alternative - full CPython with file I/O, network access, subprocess - is something you can safely give to code written by a language model that someone else is prompting.
that said, the class restriction feels weird. classes aren't the security boundary. file access, network, imports - that's where the risk is. restricting classes just forces the model to write uglier code for no security gain. would be curious if the restrictions map to an actual threat model or if it's more of a "start minimal and add features" approach.
My understanding is that "the class restriction" isn't trying to implement any kind of security boundary — they just haven't managed to implement support yet.
ah that makes sense - I was reading too much into it as a deliberate security trade-off. makes way more sense as a "not implemented yet" thing. thanks for clarifying.
This is cool - the ~70% success rate on basic attacks tracks with what I've seen. Most agent frameworks just pipe raw text through without any sanitization because "it's just summarizing a page, what could go wrong."
The screenshot approach nate mentions is interesting but feels like trading one problem for another. You're immune to text injection but now vulnerable to visual tricks - misleading rendered text, fake UI elements, those unicode lookalike characters that render identically but have different meanings.
Curious if you've tested any agents that do pre-processing on the HTML - like stripping invisible elements, normalizing unicode, etc - before passing to the model. That's the approach I've seen in a few internal tools but haven't benchmarked how effective it actually is against multi-layer attacks like yours.
sorry. i didn't mean to say that's the only thing this agent is doing is screenshotting. just that it was a thing my agent is doing which has this neat property. i also have a host of other things going on when it does need to grab and understand the contents of the page. the screenshot is used in conjunction with the html to navigate and find things. but it's also doing things this particular test tries (hidden divs, aria=hidden, etc.). also tries to message the model about what's trusted and untrusted.
but the big thing I have in here is simply a cross domain check. if the domain is about to be navigated away from, we alert the user to changing domains. this is all in a browser context too so a browsers csrf protection is also being relied on. but its the cross domain navigation i'm really worried about. and trying to make sure i've gotten super hardened. but this is the trickiest part in a browser admittedly. i feel like browsers are going to need a new "non-origin" kind of flow that knows an agent is browsing and does something like blocking and confirming natively.
The cross-domain check makes sense as the priority - that's where the real risk is. Injection making the agent do something dumb on the same site is bad, but redirecting to an attacker-controlled domain is way worse. Exfil via URL params, tokens in redirects, all that.
Your browser-native agent mode idea is interesting. Something like CSP but for navigation intent - "this agent can only interact with *.myapp.com" - and it's declarative so the injection can't social-engineer its way around it. Though browser vendors are probably 2-3 years behind on this stuff. Agent frameworks will have to solve it themselves first and then maybe Chrome picks it up later once there's consensus.
Haven't benchmarked pre-processing approaches yet, but that's a natural next step. Right now the test page targets raw agent behavior — no middleware. A comparison between raw vs sanitized pipelines against the same attacks would be really useful. The multi-layer attack (#10) would probably be the hardest to strip cleanly since it combines structural hiding with social engineering in the visible text.
Yeah, the social engineering + structural combination is brutal to defend against. You can strip the technical hiding but the visible prompt injection still works on the model. Would be interesting to see how much of the ~70% success rate drops with just basic sanitization (strip comments, normalize whitespace, remove zero-width) vs more aggressive stripping.
If you build out a v2 with middleware testing, a leaderboard by framework would be killer. "How manipulation-proof is [Langchain/AutoGPT/etc] out of the box vs with basic defenses" would get a lot of attention.
The Trivy + Grype combo is interesting - in my experience they catch different things, especially on container scanning vs dependencies. You see them disagree much on severity?
Re: the vibe coding angle - the thing I keep running into is that standard scanners are tuned for human-written code patterns. Claude code is structurally different. More verbose, weirdly sparse on the explicit error handling that would normally trigger SAST rules. Auth code especially - it looks textbook correct and passes static analysis fine, but edge cases are where it falls apart. Token validation that works great except for malformed inputs, auth checks that miss specific header combinations, that kind of thing.
The policy engine sounds flexible enough that people could add custom rules for AI-specific patterns? That'd be the killer feature tbh.
I am totally thinking about adding this so you can connect to an API or use self hosted models that run in a container if you have the resources!!!! You are spot on.
makes sense - if folks can bring their own model, they can fine-tune detection for whatever code patterns matter to them. the auth edge cases I mentioned (malformed token handling, middleware ordering) would be way easier to catch with a model trained on actual vulnerable examples than trying to write regex rules for every variant.
The framing is a bit dramatic but the underlying shift is real. What AI actually kills is the "wrap an API in a UI" SaaS model. If the value is just presenting data nicely or doing simple transformations, an agent can replace that.
What survives: products with proprietary data, strong network effects, or deep domain expertise baked into the workflow. The moat moves from "we built a UI" to "we understand this problem better than anyone."
I run 4 side projects and the ones getting traction aren't the ones with the fanciest AI features - they're the ones solving specific problems people have repeatedly (meal planning, meeting search). The AI is the engine, not the product.
The real risk for B2B SaaS isn't that AI replaces your product - it's that your customers can now build a "good enough" internal version in a weekend with Claude Code.
Fair callout. I've been writing too many product descriptions lately and it's leaking into how I write comments. The actual point was simpler than I made it sound: AI kills "UI wrapper" SaaS, but products with real domain knowledge survive. My side projects taught me that - the ones getting users aren't the technically fancy ones, they're the ones solving boring specific problems.
Interesting approach for cost management, but one angle nobody seems to be discussing: the security implications.
When you fall back to a local model for coding, you lose whatever safety guardrails the hosted model has. Claude's hosted version has alignment training that catches some dangerous patterns (like generating code that exfiltrates env vars or writes overly permissive IAM policies). A local Llama or Mistral running raw won't have those same checks.
For side projects this probably doesn't matter. But if your Claude Code workflow involves writing auth flows, handling secrets, or touching production infra, the model you fall back to matters a lot. The generated code might be syntactically fine but miss security patterns that the larger model would catch.
Not saying don't do it - just worth being aware that "equivalent code generation" doesn't mean "equivalent security posture."
I would always prefer something local. By definition it's more secure, as you are not sending your code on the wire to a third party server, and hope that they comply with the "We will not train our models with your data".
That's a fair point - you're talking about data security (not sending code to third parties) and I was talking about output quality security (what the model generates). Two different dimensions of "secure" and honestly both matter.
For side projects I'd probably agree with you. For anything touching production with customer data, I want both - local execution AND a model that won't silently produce insecure patterns.
Oh it absolutely does, never said otherwise. Hosted models produce plenty of insecure code too - the Moltbook thing from like a week ago was Claude Opus and it still shipped with wide open auth.
My point was narrower than it came across: when you swap from a bigger model to a smaller local one mid-session, you lose whatever safety checks the bigger one happened to catch. Not that the bigger one catches everything - clearly it doesn't.
Not saying the frontier models aren't smarter than the ones I can run on my two 4090s (they absolutely are) but I feel like you're exaggerating the security implications a bit.
We've seen some absolutely glaring security issues with vibe-coded apps / websites that did use Claude (most recently Moltbook).
No matter whether you're vibe coding with frontier models or local ones, you simply cannot rely on the model knowing what it is doing. Frankly, if you rely on the model's alignment training for writing secure authentication flows, you are doing it wrong. Claude Opus or Qwen3 Coder Next isn't responsible if you ship insecure code - you are.
You're right, and the Moltbook example actually supports the broader point - even Claude Opus with all its alignment training produced insecure code that shipped. The model fallback just widens the gap.
I agree nobody should rely on model alignment for security. My argument isn't "Claude is secure and local models aren't" - it's that the gap between what the model produces and what a human reviews narrows when the model at least flags obvious issues. Worse model = more surface area for things to slip through unreviewed.
But your core point stands: the responsibility is on you regardless of what model you use. The toolchain around the model matters more than the model itself.
Sure, in theory. But "assumed good enough" is doing a lot of heavy lifting there. Most people picking a local fallback model are optimizing for cost and latency, not carefully evaluating its security alignment characteristics. They grab whatever fits in VRAM and call it a day.
Not saying that's wrong, just that it's a gap worth being aware of.
Not a full team adoption story, but relevant data point: I run a small engineering org (~40 engineers across teams) and we've been tracking AI coding tool adoption informally.
The split is roughly: 30% all-in (Claude Code or Cursor for everything), 50% selective users (use it for boilerplate, tests, docs but still hand-write core logic), 20% holdouts.
What I've noticed on PR velocity: it went up initially, then plateaued. The PRs got bigger, which means reviews take longer. We actually had to introduce a "max diff size" policy because AI-assisted PRs were becoming 800+ line monsters that nobody could review meaningfully.
The quality concern that keeps coming up: security. AI-generated code tends to take shortcuts on auth, input validation, error handling. We've started running dedicated security scans specifically tuned for patterns that AI likes to produce. That's been the biggest process change.
Net effect: probably 20-30% faster on feature delivery, but we're spending more time on review and security validation than before.
Ha, pretty accurate in my experience. Though I'd say it's more like 1.5x the PRs - Claude does the initial PR, then you do half a PR fixing the subtle stuff it got wrong, and then you spend the other half wondering if you missed something.
The security fixes are the worst because the code looks correct. It's not like a typo you'd catch immediately - it's an auth check that works for 95% of cases but fails on edge cases the model never considered.
I have seen the same Ai hallucinations that you mentioned: auth, input validation, error handling, non-existent dependencies, etc. It's tricky to get them all as LLM's have mastered the art of being "confidently wrong". What tools are you using to catch those issues? I feel current tooling is ill equiped for this new wave of Ai generated output.
"Confidently wrong" is the perfect description. The code compiles, the tests pass (because the AI also wrote the tests to match), and the auth flow looks reasonable at first glance.
For catching these we layer a few things:
- Standard SAST (Semgrep, CodeQL) catches the obvious stuff but misses AI-specific patterns
- npm audit / pip-audit for dependency issues, especially non-existent packages the AI hallucinates
- Custom rules tuned for patterns we keep seeing: overly permissive CORS, missing rate limiting, auth checks that look correct but have subtle logic bugs
- Manual review with a specific checklist for AI-generated code (different from our normal review checklist)
You're right that current tooling has a gap. Traditional scanners assume human-written code patterns. AI code looks structurally different - it tends to be more verbose but miss edge cases in ways humans wouldn't. We've been experimenting with scanning approaches specifically tuned for AI output.
The biggest wins have been simple: requiring all AI-generated auth and input validation code to go through a dedicated security reviewer, not just a regular code review.
The sandbox-or-not debate is important but it's only half the picture. Even a perfectly sandboxed agent can still generate code with vulnerabilities that get deployed to production - SQL injection, path traversal, hardcoded secrets, overly permissive package imports.
The execution sandbox stops the agent from breaking out during development, but the real risk is what gets shipped downstream. Seeing more tools now that scan the generated code itself, not just contain the execution environment.
The goal of such sandboxing is that you can allow the agent to freely write/execute/test code during development, so that it can propose a solution/commit without the human having to approve every dangerous step ("write a Python file, then execute it" is already a dangerous step). As the post says: "To safely run a coding agent without review".
You would then review the code, and use it if it's good. Turning many small reviews where you need to be around and babysit every step into a single review at the end.
What you seem to be asking for (shipping the generated code to production without review) is a completely different goal and probably a bad idea.
If there really were a tool that can "scan the generated code" so reliably that it is safe to ship without human review, then that could just be part of the tool that generates the code in the first place so that no code scanning would be necessary. Sandboxing wouldn't be necessary either then. So then sandboxing wouldn't be "half the picture"; it would be unnecessary entirely, and your statement simplifies to "if we could auto-generate perfect code, we wouldn't need any of this".
Yeah I think we're actually agreeing more than it seems. I'm not arguing for shipping without review - more that the review itself is where things fall through.
In practice, that "single review at the end" is often a 500-line diff that someone skims at 5pm. The sandbox did its job, the code runs, tests pass. But the reviewer misses that the auth middleware doesn't actually check token expiry, or that there's a path traversal buried in a file upload handler. Not because they're bad at reviewing - because AI-generated code has different failure modes than human-written code and we're not trained to spot them yet.
Scanning tools don't replace review, they're more like a checklist that runs before the human even looks at it. Catches the stuff humans consistently miss so the reviewer can focus on logic and architecture instead of hunting for missing input validation.
If that's the goal, why not just have Claude Code do it all from your phone at that point? Test it when its done locally you pull down the branch. Not 100% frictionless, but if it messes up an OS it would be anthropic's not yours.
Precisely! There's a fundamental tension:
1. Agents need to interact with the outside world to be useful
2. Interacting with the outside world is dangerous
Sandboxes provide a "default-deny policy" which is the right starting point. But, current tools lack the right primitives to make fine grained data-access and data policy a reality.
Object-capabilities provide the primitive for fine-grained access. IFC (information flow control) for dataflow.
See, my typical execution environment is a Linux vm or laptop, with a wide variety of SSH and AWS keys configured and ready to be stolen (even if they are temporary, it's enough to infiltrate prod, or do some sneaky lateral movement attack). On the other hand, typical application execution environment is an IAM user/role with strictly scoped permissions.
Yeah this is the part that keeps me up at night honestly. The dev machine is the juiciest target and it's where the agent runs with the most access. Your ~/.ssh, ~/.aws, .env files, everything just sitting there.
The NixOS microvm approach at least gives you a clean boundary for the agent's execution. But you're right that it's a different threat model from prod - in prod you've (hopefully) scoped things down, in dev you're basically root with keys to everything.
for notifications specifically, the risky bits would be: what happens if an app sends a notification payload that's malformed or huge, how do you handle permission checks if the notification system process restarts mid-filtering, and whether the filtering rules can be bypassed by crafting notifications with weird mime types or encoded text.
if you wrote tests for those edge cases (or even just thought through them), you're already ahead of 90% of shipped code, vibe-coded or not. the scrutiny you're worried about is actually healthy - peer review catches stuff automated tools miss.
reply