Hacker Newsnew | past | comments | ask | show | jobs | submit | twotwotwo's commentslogin

For folks that like this kind of question, SimpleBench (https://simple-bench.com/ ) is sort of neat. From the sample questions (https://github.com/simple-bench/SimpleBench/blob/main/simple... ), a common pattern seems to be for the prompt to 'look like' a familiar/textbook problem (maybe with detail you'd need to solve a physics problem, etc.) but to get the actually-correct answer you have to ignore what the format appears to be hinting at and (sometimes) pull in some piece of human common sense.

I'm not sure how effectively it isolates a single dimension of failure or (in)capacity--it seems like it's at least two distinct skills to 1) ignore false cues from question format when there's in fact a crucial difference from the template and 2) to reach for relevant common sense at the right times--but it's sort of fun because that is a genre of prompt that seems straightforward to search for (and, as here, people stumble on organically!).


Yeah, this(-ish): there are shipping models that don't eliminate N^2 (if a model can repeat your code back with edits, it needs to reference everything somehow), but still change the picture a lot when you're thinking about, say, how resource-intensive a long-context coding session is.

There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.


I have played with it and it's so easy get started with that now I want a quick-project idea as an excuse to use it!

I'm sure you've thought of this, but: lots of people have some amount of 'free' (or really: zero incremental cost to users) access to some coding chat tool through a subscription or free allowance like Google's.

If you wanted to let those programs access your custom tools (browser!) and docs about the environment, a low-fuss way might be to drop a skills/ dir of info and executables that call your tools into new installs' homedirs, and/or a default AGENTS.md with the basic info and links to more.

And this seems like more fuss, but if you wanted to be able to expose to the Web whatever coding tool people 'bring', similar to how you expose your built-in chat, there's apparently an "agent control protocol" used as a sort of cross-vendor SDK by projects like https://willmcgugan.github.io/toad-released/ that try to put a nice interface on top of everything. Not saying this'd be easy at all, but you could imagine the choice between a few coding tools and auth info for them as profile-level settings pushed to new VMs. Or maybe no special settings, and bringing your own tools is just a special case of bringing your own image or setup script.

But, as y'all note, it's a VM. You can install whatever and use it through the terminal (or VSCode remoting or something else). "It's a computer" is quite a good open standard to build on.

Is the chat descended from Sketch?


Thanks! We are thinking a lot about how to prepopulate VMs. The first thing we are going to start with is a fast ‘clone’ command, so you can preconfigure a base VM then make as many as you like. Lots of other ideas floating around too.

Re sketch: the code is not the same but the agent is deeply inspired by it. Eg the screenshot support, which just seems obvious to us. Philip has done the heavy lifting here, he hangs out in the discord if you want to chat about it.


Prelaunch scripts. Snapshots. There’s plenty of ways to prepopulate a vm. What’s tricky is replicating that so it’s available across the “nodes” they have.

Man, this brings me back. Kudos to you guys! Just find a better solution than Ceph or minio.


When you create a new exe.dev VM, you can tell Shelley what it's for. I've had fun results from, "surprise me".

Also, telling Shelley to get inspiration from the VM name can be fun.


So I tried this the other day after Filippo Valsorda, another Go person, posted about it. My reaction was 'whoa, this really makes it easier to start a quick project', and it took a minute to figure out why I felt that way when, I mean, I have a laptop and could spin up cloud stuff--arguably I already had what I needed.

I think it's the combination of 1) really quick to get going, 2) isolated and disposable environments and 3) can be persistent and out there on the Internet.

Often to get element 3, persistent and public, I had to jump through hoops in a cloud console and/or mess with my 'main' resources (install things or do other sysadmin work on a laptop or server, etc.), resources I use for other stuff and would prefer not to clutter up with every experiment I attempt.

Here I can make a thing and if I'm done, I'm done, nothing else impacted, or if it's useful it can stick around and become shared or public. Some other environments also have 'quick to start, isolated, and disposable' down, but are ephemeral only, limited, or don't have great publishing or sharing, and this avoids that trough too. And VMs go well with building general-purpose software you could fling onto any machine, not tied to a proprietary thing.

This is good stuff. I hope they get a sustainable paid thing going. I'd sign up.

Also, though I realize in a sense it'd be competition to a business I just said I like: some parts of the design could work elsewhere too. You could have an open-source "click here to start a thing! and click here to archive it." layer above a VM, machine, or whatever sort of cloud account; could be a lot of fun. (I imagine someone will think "have you looked at X?" here, and yes, chime in, interested in all sorts of potential values of X.)


> persistent and public

I don't think that it's actually public? From one of their explainers, no public IP is assigned, so you'll need to ar least have to use an additional service like Cloudflare Tunnel to use it for hosting anything.


[exe.dev co-founder here] You can make it public! Our TLS proxy supports it, and supports CNAME rules (plus a top-level trick) to let you put a domain name on it. To make the HTTP server on port 8000 of your VM public run:

    ssh exe.dev share set-public <yourvmname>


Any plans to support non web stuff?


For non-web stuff you will need a static IP. We plan to support that in the near future: https://github.com/boldsoftware/exe.dev/issues/6


Could also support sni/sslh style stuff to support more protocols without static IP.


We could! Do you have any in mind? I can file issues for them.


I'd love to see XMPP support especially, which I know sslh supports.


FWIW, here are (mostly) their agent's tips for other agents from exploring a mostly-new system including tidbits like how to get recent Node: https://s3.us-east-1.amazonaws.com/1FV6XMQKP2T0D9M8FF82-cach...

It's very much a snapshot of what happens to come on a new VM today, and I put a little disclaimer in it to try to help tools get unstuck if anything there proves to be outdated or a flat-out (accidental) lie.


Thank you! This is a very useful one-pager, answers many questions I had I couldn't find in their documentation (being on mobile I couldn't test with SSH).


I'm conflicted about opining on models: no individual has actually done a large sample of real-world tasks with a lot of models to be able to speak with authority, but I kinda think we should each share our dubiously-informed opinions anyway because benchmarks aren't necessarily representative of real-world use and many can clearly be gamed.

Anyhow, I noticed more of a difference trying Opus 4.5 compared to Sonnet 4.5 than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the old 5x, it's much more often practical to consider reaching for than past Opus models. Anthropic's basic monthly thing also covers a fair amount of futzing with it in CC.

At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.


METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.

"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.

In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.


The “50% time horizon” feels most actionable when you pair it with an expected-value model. For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost). A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.


Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure!


>which human

The second graph has this under it:

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...


Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did.


I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time. Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time.


Your version only describes what happens if you do the operations serially, though. For example, a consumer SSD can do a million (or more) operations in a second not 50K, and you can send a lot more than 7 total packets between CA and the Netherlands in a second, but to do either of those you need to take advantage of parallelism.

If the reciprocal numbers are more intuitive for you you can still say an L1 cache reference takes 1/2,000,000,000 sec. It's "ops/sec" that makes it look like it's a throughput.

An interesting thing about the latency numbers is they mostly don't vary with scale, whereas something like the total throughput with your SSD or the Internet depends on the size of your storage or network setups, respectively. And aggregate CPU throughput varies with core count, for example.

I do think it's still interesting to think about throughputs (and other things like capacities) of a "reference deployment": that can affect architectural things like "can I do this in RAM?", "can I do this on one box?", "what optimizations do I need to fix potential bottlenecks in XYZ?", "is resource X or Y scarcer?" and so on. That was kind of done in "The Datacenter as a Computer" (https://pages.cs.wisc.edu/~shivaram/cs744-readings/dc-comput... and https://books.google.com/books?id=Td51DwAAQBAJ&pg=PA72#v=one... ) with a machine, rack, and cluster as the units. That diagram is about the storage hierarchy and doesn't mention compute, and a lot has improved since 2018, but an expanded table like that is still seems like an interesting tool for engineering a system.


> For example, a consumer SSD can do a million (or more) operations in a second not 50K

The "Read 1MB from SSD" entry translates into a higher throughput (still not as high as you imply, but "SSD" is also a broad category ranging from SATA-connected devices though I think five generations of NVMe now); I assume the "Read 4KB" timing really describes a single, isolated page read which would be rather difficult to parallelize.


> Your version only describes what happens if you do the operations serially, though

and that's what the intuition should be based on, because serial problems sadly exist and a lot of common wisdom suddenly disappear when confronted to one.


Great comment. I like your phrasing "capacities of a reference deployment", this is what I tend to refer to as the performance ceiling. In practical terms, if you're doing synthetic performance measurements in the lab, it's a good idea to try to recreate optimal field conditions so your benchmarks have a proper frame of reference.


These are potentially complementary approaches. Various innovations have shrunk the KV cache size or (with DSA) how much work you have to do in each attention step. This paper is about hybrid models where some layers' state needs don't grow with context size at all.

SSMs have a fixed-size state space, so on their own they'll never going be able to recite a whole file of your code in a code-editing session for example. But if much of what an LLM is doing isn't long-distance recall, you might be able to get away with only giving some layers full recall capability, with other layers manipulating the info already retrieved (plus whatever's in their own more limited memory).

I think Kimi Linear Attention and Qwen3-next are both doing things a little like this: most layers' attention/memory doesn't grow with context size. Another approach, used in Google's small open Gemma models, is to give some layers only 'local' attention (most recent N tokens) and give a few 'full' (whole context window) attention. I guess we're seeing how those approaches play out and how different tricks can be cobbled together.

There can potentially be a moneyball aspect to good model architecture. Even if on its own using space-saving attention mechanisms in some layers of big models cost something in performance, their efficiency could allow you to 'spend' more elsewhere (more layers or more params or such) to end with overall better performance at a certain level of resources. Seems like it's good to have experiments with many different approaches going on.


Wanted to note https://issues.chromium.org/issues/40141863 on making the lossless JPEG recompression a Content-Encoding, which provides a way that, say, a CDN could deploy it in a way that's fully transparent to end users (if the user clicks Save it would save a .jpg).

(And: this is great! I think JPEG XL has chance of being adopted with the recompression "bridge" and fast decoding options, and things like progressive decoding for its VarDCT mode are practical advantages too.)


"What are the consequences of recent, controversial changes in policy?" does not become an irrelevant question simply because you can also think of hypothetical policies.


I'm not from US, so from outside it looks like US changed priorities and are less concerned about what happens in other countries, which is understandable. Whether it is good policy long term or not is another question, but clearly it was popular choice.


Does the US not have a national debt? To cut spending seems like a good long-term policy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: