The background story is that Allegro defaults the selection of infrastructure from their competitors to their own, even if the user uses competitor all the time. Sometimes the user forgets to check, and it will result in using Allegro's infrastructure even if the user didn't want it.
I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it
Nitpick (I blame Amazon for their horrible naming): Aurora and RDS are separate products.
What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.
Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.
Clustering: you mean read replicas?
Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).
Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).
Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.
What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.
tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?
> you can fairly closely replicate the experience of Aurora
Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.
But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.
You might replicate the features, but you're not replicating the experience.
The person I replied to said they wanted an open-source reimplementation of Aurora. My point - which was probably poorly-worded, or just implied - was that there's a lot of work that goes into something like that, and if you can't put the pieces together on your own, you probably shouldn't be running it for anything you can't afford downtime on.
Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.
> Asking for that experience but also free / cheap doesn't make any sense.
Things that used to be very expensive suddenly gets available for free after someone builds an open source version of it. That's just the nature of open source.
It's unreasonable to demand it from someone, but people do build things and release them for free all the time! Indeed, it makes plenty of sense to imagine that at some point in time, open source offerings of Postgres will be comparable to Aurora in ease of use.
See, my typical execution environment is a Linux vm or laptop, with a wide variety of SSH and AWS keys configured and ready to be stolen (even if they are temporary, it's enough to infiltrate prod, or do some sneaky lateral movement attack). On the other hand, typical application execution environment is an IAM user/role with strictly scoped permissions.
Yeah this is the part that keeps me up at night honestly. The dev machine is the juiciest target and it's where the agent runs with the most access. Your ~/.ssh, ~/.aws, .env files, everything just sitting there.
The NixOS microvm approach at least gives you a clean boundary for the agent's execution. But you're right that it's a different threat model from prod - in prod you've (hopefully) scoped things down, in dev you're basically root with keys to everything.
I have used a separate user, but lately I have been using rootless podman containers instead for this reason. But I know too little about container escapes. So I am thinking about a combination.
Would a podman container run by a separate user provide any benefit over the two by themselves?
This. Opening a chat for the first time in the morning consistently takes 5-10 seconds. Opening subsequent ones takes 2-3 seconds. That is, if they contain plain text. If not, UI keps reflowing and jumping while thumbnails and silly gifs are loaded async, so you cannot even reliably click.
Termux is also an excellent solution for downloading videos from YouTube and similar sites, due to the fact that yt-dlp works really well (and using mobile data makes it easier to avoid IP bans, most of the time anyway).
reply