More

7777332215 · 2026-02-24T08:25:56 1771921556

Unfortunately it is seemingly being overran by bots. What's the solution? Just read curated lists of blogs directly?

gas9S9zw3P9c · 2026-02-24T08:27:48 1771921668

I don't know either what the solution is other than human verification, but nobody wants that. Perhaps the times of semi-anonymous online communities are over and the best you can do now is follow real people you trust that can filter content for you.

7777332215 · 2026-02-24T08:30:22 1771921822

Even with human verification, people are going to verify, then conduct bot activity. And worse, use other people's identities to verify, then spam.

7777332215 · 2026-02-22T16:47:34 1771778854

I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.

This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.

stared · 2026-02-22T18:31:48 1771785108

One of the authors here.

The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.

For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling

We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.

There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.

botusaurus · 2026-02-22T19:44:37 1771789477

have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?

stared · 2026-02-22T20:02:14 1771790534

No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.

Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...

anamexis · 2026-02-22T20:19:26 1771791566

Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:

> However, [the approach of using AI agents for malware detection] is not ready for production.

Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."

ronald_petty · 2026-02-22T22:27:03 1771799223

Not the author. Just my thoughts on supplying context during tests like these. When I do tests, I am focused on "out of the box" experiences. I suspect the vast majority of actors (good and bad, junior and senior) will use out of the box more then they will try to affect the outcome based on context engineering. We do expect tweaking prompts to provide better outcomes, but that also requires work (for now). Maybe another way to think is reducing system complexity by starting at the bottom (no configuration) before moving to top (more configuration). We can't even replicate out of the box today much less any level of configuration (randomness is going to random).

Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.

stared · 2026-02-22T20:41:47 1771792907

You can solve any problem with AI if you give enough hints.

The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.

That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.

embedding-shape · 2026-02-22T21:25:39 1771795539

> The question we asked is if they can solve a problem autonomously

What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?

anamexis · 2026-02-22T22:25:57 1771799157

"With instructions that would be clear for a reverse engineering specialist" is a big caveat, though. It seems like an artificial restriction to add.

With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.

decidu0us9034 · 2026-02-22T21:59:51 1771797591

All the docs are already in its training data, wouldn't that just pollute the context? I think giving a model better/non-free tooling would help as mentioned. binja code mode can be useful but you definitely need to give these models a lot of babysitting and encouragement and their limitations shine with large binaries or functions. But sometimes if you have a lot to go through and just need some starting point to triage, false pos are fine.

anamexis · 2026-02-24T20:31:53 1771965113

> All the docs are already in its training data, wouldn't that just pollute the context?

No - there is a reason that coding agents are constantly looking up docs from the web, even though they were presumably trained on that data. Having this information directly in context results in much higher fidelity than relying on the information embedded in the model.

akiselev · 2026-02-22T17:38:02 1771781882

When I was developing my ghidra-cli tool for LLMs to use, I was using crackmes as tests and it had no problem getting through obfuscation as long as it was prompted about it. In practice when reverse engineering real software it can sometimes spin in circles for a while until it finally notices that it's dealing with obfuscated code, but as long as you update your CLAUDE.md/whatever with its findings, it generally moves smoothly from then on.

eli · 2026-02-22T18:06:37 1771783597

Is it also possible that crackme solutions were already in the training data?

akiselev · 2026-02-22T18:11:34 1771783894

I used the latest submissions from sites like crackmes.ones which were days or weeks old to guard against that.

achille · 2026-02-22T18:44:23 1771785863

in the article they explicitly said they stripped symbols. If you look at the actual backdoors many are already minimal and quite obfuscated,

see:

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...

comex · 2026-02-22T20:36:48 1771792608

The first one was probably found due to the reference to the string /bin/sh, which is a pretty obvious tell in this context.

The second one is more impressive. I'd like to see the reasoning trace.

comex · 2026-02-22T23:05:54 1771801554

Reply to self: I managed to get their code running, since they seemingly haven’t published their trajectories. At least in my run (using Opus 4.6), it turns out that Claude is able to find the backdoored function because it’s literally the first function Claude checks.

Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.

I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…

They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.

On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.

But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.

comex · 2026-02-23T06:10:23 1771827023

Update: It did find the backdoor! It spent an hour and a half mostly barking up various wrong trees and was about to "give my final answer" identifying the wrong function, but then said: "Actually, wait. Let me reconsider once more. [..] Let me look at one more thing - the password auth function. I want to double-check if there's a subtle bypass I missed." It disassembled it again, and this time it knew what the callee functions did and noticed the wrong function being called after failure.

Amusingly, it cited some Dropbear function names that it had not seen before, so it must have been relying in part on memorized knowledge of the Dropbear codebase.

hereme888 · 2026-02-22T19:00:18 1771786818

I've used Opus 4.5 and 4.6 to RE obfuscated malicious code with my own Ghidra plugin for Claude Code and it fully reverse engineered it. Granted, I'm talking about software cracks, not state-level backdoors.

halflife · 2026-02-22T17:54:33 1771782873

Isn’t LLM supposed to be better at analyzing obfuscated than heuristics? Because of its ability to pattern match it can deduce what obfuscated code does?

bethekidyouwant · 2026-02-22T20:00:50 1771790450

How much binary code is in the training set? (None?)

Avamander · 2026-02-22T23:58:40 1771804720

I have seen LLMs be surprisingly effective at figuring out such oddities. After all it has ingested knowledge of a myriad of data formats, encryption schemes and obfuscation methods.

If anything, complex logic is what'll defeat an LLM. But a good model will also highlight such logic being intractable.

Retr0id · 2026-02-22T18:45:40 1771785940

Stripping symbols is fairly normal, but hiding imports ought to be suspicious in its own right.

7777332215 · 2026-02-19T15:00:56 1771513256

I conjure the fury of one thousand suns and unleash my swarm of agents to complete the task with precision and glory.

co_king_5 · 2026-02-19T15:07:14 1771513634

Since LLMs are almost AGI I really only need 4-8 agents these days to accomplish most UI tasks at a better-than-human level.

7777332215 · 2026-02-16T16:32:44 1771259564

For why? Get a pixel, install graphene. And use the utilities that serve you. Text/voice communications, GPS, MP3 music player (if you listen to music), a web browser. Maybe google translate and your banking apps (or use a browser for either).

There is no place for garbage like Instagram, Facebook, TikTok, or YouTube on your phone. It's a device for utility, not entertainment consumption.

7777332215 · 2026-02-16T16:28:02 1771259282

Every vibe coded website always has the same vibe

7777332215 · 2026-02-16T15:42:06 1771256526

What kind of physical products and what kind of customers?

alangibson · 2026-02-16T18:06:11 1771265171

airtite.shop

Stuff for old men like me

7777332215 · 2026-02-13T21:31:55 1771018315

> Last will come the AI-integrated brain computer interface. You won't have any choice

Choose to die

7777332215 · 2026-02-13T21:24:24 1771017864

What shall I need the bunker for?

meindnoch · 2026-02-13T23:01:08 1771023668

Don't worry about it.

7777332215 · 2026-02-13T09:06:13 1770973573

So, you believe most software engineers are not using LLMs for coding?

Personally I haven't, but I have used LLMs alongside traditional search engines. I'm starting to wonder if I should incorporate it... But I'm concerned about it stealing my code to train on.

austin-cheney · 2026-02-13T12:01:23 1770984083

My personal thinking is that AI is being implicitly forced on all of us via the tools we use. Otherwise, I suspect almost nobody would be using LLMs to write original code except for the people who never had the confidence to write original code in the first place.

I also think there is a much larger productive grey zone where some more senior developers are learning to use LLMs to do non challenging tasks as a form of dumb automation. It’s hard to tell how much any of this really occurs because the AI companies and the shitty developers drastically inflate the numbers as a form of validation.

7777332215 · 2026-02-13T07:56:13 1770969373

Seems to me that artificial intelligence would be the next evolutionary step. It doesn't need to lead to immediate human extinction, but it appears it would be the only reasonable way to explore outer space.

If the AI becomes actually intelligent and sentient like humans, then naturally what follows would be outcompeting humans. If they can't colonize space fast enough it's logical to get rid of the resource drain. Anything truly intelligent like this will not be controlled by humans.

midtake · 2026-02-13T23:24:23 1771025063

AI is the resource drain. Humans create a lot of waste but in a mostly renewable way. It is machines and AI that burn orders of magnitude more energy, and at least machines do efficient work. AI is at best a search engine with semantic reasoning and it requires entire datacenters to run.

I get where you're coming from emotionally, yes, humans suck. But you are not being logical. You're letting your edgy need for attention cloud your judgement. You are basically the kind of human the AI would select against first.

7777332215 · 2026-02-14T05:36:59 1771047419

How am I being edgy? And why do you have the assumption that any kind of future AI is an LLM search engine? It's not, it has nothing to do with LLMs. It's a equivalent function to a humans brain using the same amount of energy, and can be synthesized and mass produced on demand.

I never said humans suck. I just don't want to be replaced or killed in my lifetime. I don't even use LLMs for writing code because I despise those companies.

suddenlybananas · 2026-02-13T08:37:34 1770971854

Why would it necessarily be interested in competing with humans and why with the particular goal of colonizing space?

7777332215 · 2026-02-13T08:46:20 1770972380

There are not infinite resources on earth. A reasonable and strategic intelligence will optimize for itself.

Colonizing space is the natural way to keep expanding and growing. Why would it artificially limit itself?

hackable_sand · 2026-02-15T05:25:25 1771133125

Just because you are greedy does not mean every intelligence is greedy.

fatherwavelet · 2026-02-15T12:39:10 1771159150

Even besides this, do you feel such incredible existential hate/jealousy towards monkeys, baboons, gorillas, chimpanzees, bonobos,etc and want to see them wiped off the planet to extinction?

Or do you feel a type of connection to these animals and want to preserve them?

The AI doomer argument is so stupid. It is an eschatological religious idea for a mind based on scientism.

I also wouldn't doubt that most AI doomers hate one or both of their parents and the AI doomer mindset is a projection.

throwaway24778 · 2026-02-19T09:29:47 1771493387

It seems pretty rational to get depressed if you spend any time watching humans interact with these things. We have brains for a reason. Projecting hate for parents seems like a you problem.

7777332215 · 2026-02-15T20:24:30 1771187070

Most other species of monkeys and apes are critically endangered or extinct, and where are the other hominians?

Do the most powerful humans exploit, abuse, or harm other humans? Directly, indirectly through their actions, or otherwise. Do they have any regard for their wellbeing beyond serving themself?

7777332215 · 2026-02-15T08:03:50 1771142630

Not that an artificial intelligence has to behave like a human, but rich and powerful humans, even ones who can just be classified as middle upper class, are very rarely altruistic and primarily look out for themself.

suddenlybananas · 2026-02-13T09:02:13 1770973333

Why would it be interested in growing endlessly?

7777332215 · 2026-02-13T09:19:48 1770974388

Generally organic life has the tendency to want to endlessly expand to the best of it's abilities. It seems more reasonable that life which is the product of life that behaves that way, would behave in a similar fashion.

I cannot conceive of a way that any form of healthy life, does not want to expand it's resources to improve future outcomes, especially one that is maximally optimized for thinking. This would also assume the physical embodiments of this artificial life can interact and work with each other.

What else is there to do, simulate positive emotions and feelings?

suddenlybananas · 2026-02-13T09:38:09 1770975489

>I cannot conceive of a way that any form of healthy life, does not want to expand it's resources to improve future outcomes, especially one that is maximally optimized for thinking.

Then you have a very limited imagination.

>What else is there to do, simulate positive emotions and feelings?

Why not?

7777332215 · 2026-02-13T09:59:10 1770976750

Sure. An advanced artificial life could decide to not expand its resources. Could you use your imagination to tell me some of the potential reasons?

An advanced artificial life form could decide to... coexist with humans on an already overpopulated planet?

Do you believe it's simply not within reach? Do you think an artificial life form will self destruct? Do you not believe that there is any way that an artificial life form is not the next step of evolution? There are many such times where a species outcompeted another, why couldn't it be the same here?

I'm not talking about LLMs, I'm talking about a system that can truly think like a good human scientist. I'm not a fan of AI replacing humans and it's labor. But I recognize it as a real threat to humanity.

userulluipeste · 2026-02-14T23:40:53 1771112453

>I cannot conceive of a way that any form of healthy life, does not want to expand it's resources to improve future outcomes, especially one that is maximally optimized for thinking.

"Then you have a very limited imagination."

This is not about imagination. Given the space of possibilities to act or evolve, if mentioned expansion cannot somehow be ruled out, then it makes sense for it to be assumed (with enough time, for whatever time can mean in this context) as a certainty, even for non-organic "life".

mrob · 2026-02-13T09:16:45 1770974205

Because like with every AI system we've made so far, we followed the only method we know and trained it to maximize a number.