We're at early stages of planning an architecture where we offload pre-rendered JSON views of PostgreSQL onto a key value store optimised for read only high volume. Considering DynamoDB, S3, Elastic, etc. (We'll probably start without the pre-render bit, or store it in PostgreSQL until it becomes a problem).
When looking at DynamoDB I noticed that there was a surprising amount of discussion around the requirement for provisioning, considering node read/write ratios, data characteristics, etc. Basically, worrying about all the stuff you'd have to worry about with a traditional database.
To be honest, I'd hoped that it could be a bit more 'magic', like S3, and it AWS would take care of provisioning, scaling, sharding etc. But it seemed disappointingly that you'd have to focus on proactively worrying about operations and provisioning.
Is that sense correct? Is the dream of a self-managing, fire-and-forget key value database completely naive?
Your example really summarizes the challenge with the AWS paradigm: namely that they want you to believe that the thing to do is to spread the the backend of your application across a large number of distinct data systems. No one uses DynamoDB alone: they bolt it onto Postgres after realizing they have availability or scale needs beyond what a relational database can do, then they bolt on Elasticsearch to enable querying, and then they bolt on Redis to make the disjointed backend feel fast. And I'm just talking operational use cases; ignoring analytics here. Honestly it doesn't need to be these particular technologies but this is the general phenomenon you see in so many companies that adopt a relational database, key/value store (could be Cassandra instead of DynamoDB eg like what Netflix does), a search engine, and a caching layer because they think that that's the only option
This inherently leads to a complexity debt explosion, fragmentation in the experience, and an operationally brittle posture that becomes very difficult to dig out of (this is probably why AWS loves the paradigm).
Almost every single team at Amazon that I can think of off the top of my head uses DynamoDB (or DDB + S3) as its sole data store. I know that there are teams out there using relational DBs as well (especially in analytics), but in my day-to-day working with a constantly changing variety of teams that run customer-facing apps, I haven't seen RDS/Redis/etc being used in months.
The thing about Amazon is that it is massive. In my neck of the woods, I've got the complete opposite experience. So many teams have the exact DDB induced infrastructure sprawl as described by the GP (e.g. supplemental RDBMS, Elastic, caching layers, etc..).
Which says nothing of DDB. It's an god-tier tool if what you need matches what it's selling. However, I see too many teams reach for it by default without doing any actual analysis (including young me!), thus leading to the "oh shit, how will we...?" soup of ad-hoc supporting infra. Big machines look great on the promo-doc tho. So, I don't expect it to stop.
> they bolt it onto Postgres after realizing they have availability or scale needs beyond what a relational database can do, then they bolt on Elasticsearch to enable querying, and then they bolt on Redis to make the disjointed backend feel fast.
This made my head explode. Why would you explicitly join two systems made to solve different issues together? This sounds rather like a lack of architectural vision. Postgres's zero access-design inherently clashes with DynamoDB's; same goes with ElasticSearch scenario: DynamoDB's was not made to query everything, it's made to query specifically what you designed to be queried and nothing else. Redis sort-of make sense to gain a bit of speed for some particular access, but you still lack collection level querying with it.
In my experience, leave DynamoDB alone and it will work great. Automatic scaling is cheaper eventually if you've done your homework about knowing your traffic.
In my experience, leave DynamoDB alone and it will work great.
My experience agrees with yours and I'm likewise puzzled by the grandparent comment. But just a shout out to DAX (DyanmoDB Accelerator) which makes it scale through the roof:
Judging a consistency model as "terrible" implies that it does not fit any use case and therefore is objectively bad.
On the contrary, there are plenty of use cases where "eventually consistent writes" is the perfect use case. To judge this as true, you only have to look and see that every major database server offers this as an option - just one example:
I think main advantage of DDB is being serverless. Adding a server-based layer on top of it doesn't make sense to me.
I have a theory it would be better to have multiple table-replicas for read access. At application level, you randomize access to those tables according to your read scale needs.
Use main table streams and lambda to keep replicas in sync.
Depending on your traffic, this might end more expensive than DAX, but you remain fully serverless, using the exact same technology model, and have control over the consistency model.
Haven't had the chance to test this in practice, though.
I am working with a company that is redesigning an enterprise transactional system, currently backed by an Oracle database with 3000 tables. It’s B2B so loads are predictable and are expected to grow no more than 10% per year.
They want to use DynamoDB as their primary data store, with Postgres for edge cases it seems to me the opposite would be more beneficial.
At what point does DynamoDB become a better choice than Postgres? I know that at certain scales Postgres breaks down, but what are those thresholds?
You can make Postgres scale, but there is an operational cost to it. DynamoDB does that for you out of the box. (So does Aurora, to be honest, but there is also an overhead to setting up an Aurora cluster to the needs of your business.)
I've found also that in Postgres the query performance does not keep up with bursts of traffic -- you need to overprovision your db servers to cope with the highest traffic days. DynamoDB, in contrast, scales instantly. (It's a bit more complicated that that, but the effect of it is nearly instantaneous.) And what's really great about DynamoDB is after the traffic levels go down, it does not scale down your table and maintains it at the same capacity at no additional cost to you, so if you receive a burst of traffic at the same throughput, you can handle it even faster.
DynamoDB does a lot of magic under the hood, as well. My favorite is auto-sharding, i.e. it automatically moves your hot keys around so the demand is evenly distributed across your table.
So DynamoDB is pretty great. But to get the the best experience from DynamoDB, you need to have a stable codebase, and design your tables around your access patterns. Because joining two tables isn't fun.
> So DynamoDB is pretty great. But to get the the best experience from DynamoDB, you need to have a stable codebase, and design your tables around your access patterns. Because joining two tables isn't fun.
More than just joining--you're in the unenviable place of reinventing (in most environments, anyway) a lot of what are just online problems in the SQL universe. Stuff you'd do with a case statement in Postgres becomes some on-the-worker shenanigans, stuff you'd do with a materialized view in Postgres becomes a batch process that itself has to be babysat and managed and introduces new and exciting flavors of contention.
There are really good reasons to use DynamoDB out there, but there are also an absolute ton of land mines. If your data model isn't trivial, DynamoDB's best use case is in making faster subsets of your data model that you can make trivial.
They should be looking at Aurora, not Dynamo. Using Dynamo as the primary store for relational data (3000 tables!) sounds like an awful idea to me. I’d rather stay on Oracle.
It seems to me that what this is saying is that storage has become so cheap that if another database provides even slight advantages over another for some workload it is likely to be deployed and have all the data copied over to it.
HN entrepreneurs take note, this also suggests to me that there may be a market for a database (or a "metadatabase") that takes care of this for you. I'd love to be able to have a "relational database" that is also some "NoSQL" databases (since there's a few major useful paradigms there) that just takes care of this for me. I imagine I'd have to declare my schemas, but I'd love it if that's all I had to do and then the DB handled keeping sync and such. Bonus points if you can give me cross-paradigm transactionality, especially in terms of coherent insert sets (so "today's load of data" appears in one lump instantly from clients point of view and they don't see the load in progress).
At least at first, this wouldn't have to be best-of-breed necessarily at anything. I'd need good SQL joining support, but I think I wouldn't need every last feature Postgres has ever had out of the box.
If such a product exists, I'm all ears. Though I am thinking of this as a unified database, not a collection of databases and products that merely manages data migrations and such. I'm looking to run "CREATE CASSANDRA-LIKE VIEW gotta_go_fast ON SELECT a.x, a.y, b.z FROM ...", maybe it takes some time of course but that's all I really have to do to keep things in sync. (Barring resource overconsumption.)
> I'd love to be able to have a "relational database" that is also some "NoSQL" databases (since there's a few major useful paradigms there) that just takes care of this for me. I imagine I'd have to declare my schemas, but I'd love it if that's all I had to do and then the DB handled keeping sync and such.
You might be interested in what we're building [0]
It synchronizes your data systems so that, for example,
you can CDC tables from your Postgres DB, transform them in interesting ways,
and then materialize the result in a view within Elastic or DynamoDB
that updates continuously and with millisecond latency.
It will even propagate your sourced SQL schemas into JSON schemas,
and from there to, say, equivalent Elastic Search schema.
I think there was a project like this a few years ago (wrapping a relational DB + ElasticSearch into one box) and I thought it was CrateDB, but from looking at their current website I think I'm misremembering.
The concept didn't appeal to me very much then, so I never looked into it further.
---
To address your larger point, I think Postgres has a better chance of absorbing other datastores (via FDW and/or custom index types) and updating them in sync with it's own transactions (as far as those databases support some sort of atomic swap operation) than a new contender has of getting near Postgres' level of reliability and feature richness.
My understanding of the cockroach db architecture, it that it’s essentially two discrete components, a key value store that actually persists the data, and a SQL layer built on top.
Although I don’t think it’s recommended or supported to access the key value store directly.
I have no direct experience with scaling DynamoDB in production, so take this with a grain of salt. But it seems to me that the on-demand scaling mode in DynamoDB has gotten _really_ good the last couple of years.
For example, you used to have to manually set RCU/WCU to a high number when you expected a spike in traffic, since the ramp-up for on-demand scaling was pretty slow (could take up to 30 minutes). But these days, on-demand can handle spikes from 10s of requests a minute to 100s/1000s per second gracefully.
The downside of on-demand is the pricing - it's more expensive if you have continuous load. But it can easily become _much_ cheaper if you have naturally spiky load patterns.
> The downside of on-demand is the pricing - it's more expensive if you have continuous load.
True, although you don't have to make that choice permanently. You can switch from provisioned to on demand once every 24 hours.
And you can also set up application autoscaling in provisioned mode, which'll allow you to set parameters under which it'll scale your provisioned capacity up or down for you. This doesn't require any code and works pretty well if you can accept autoscaling adjustments being made in the timeframe of a minute or two.
We've some regular jobs that require scaling up dynamodb in advance few times per day, but then dynamo is only able to scale down 4x per day, so we're probably paying for over capacity unnecessarily (10x or more) for a couple hours a day
Now we just moved ondemand and let them handle it, works fine
> Is the dream of a self-managing, fire-and-forget key value database completely naive?
It's not, if you plan it right. Learn about single table design for DynamoDB before you start. There are a lot of good resources from Amazon and the community.
Here is a very accessible video from the community:
If you use single table design, you can turn on all of the auto-tuning features of DynamoDB and they will work as expected and get better and more efficient with more data.
Some people worry that this breaks the cardinal rule of microservices: One database per service. But the actual rule is never have one service directly access the data of another, always use the API. So as long as your services use different keyspaces and never access each other's data, it can still work (but does require extra discipline).
A lot of things that used to be a concern (hot partitions, etc) are not a concern anymore and most have been solved these days :)
Put it on on-demand pricing (it'll be better and cheaper for you most likely), and it will handle any load you throw at it. Can you get it to throttle? Sure, if you absolutely blast it without ever having had that high of a need before (and it can actually be avoided[0]).
You will need to understand how to model things for the NoSQL paradigm that DynamoDB uses, but that's a question of familiarity and not much else (you didn't magically know SQL either).
My experience comes from scaling DynamoDB in production for several years, handling both massive IoT data ingestion in it as well as the user data as well. We were able to replace all things we thought we would need a relational database for, completely.
My comparison between a traditional RDS setup:
- DynamoDB issues? 0. Seriously. Only thing you need to monitor is billing.
- RDS? Oh boy, need to provision for peak capacity, need to monitor replica lags, need to monitor the Replicas themselves, constant monitoring and scaling of IOPS, suddenly queries get slow as data increases, worrying about indexes and the data size, and much more...
> We're at early stages of planning an architecture where we offload pre-rendered JSON views of PostgreSQL onto a key value store optimised for read only high volume.
If possible, put the json in Workers KV, and access it through Cloudflare Workers. You can also optionally cache reads from Workers KV into Cloudflare's zonal caches.
> To be honest, I'd hoped that it could be a bit more 'magic', like S3
You could opt to use the slightly more expensive DynamoDB On-Demand, or the free DynamoDB Auto-Scaling modes, which are relatively no-config. For a very ready-heavy workload, you'd probably want to add DynamoDB Accelerator (an write-through in-memory cache) in front of your tables. Or, use S3 itself (but a S3 bucket doesn't really like when you load it with a tonne of small files) accelerated by CloudFront (which is what AWS Hyperplane, tech underpinning ALB and NLB, does: https://aws.amazon.com/builders-library/reliability-and-cons...)
It is a resource that can often be the right tool for the job but you really have to understand what the job is and carefully measure Dynamo up for what you are doing.
It is _easy_ to misunderstand or miss something that would make Dynamo hideously expensive for your use case.
Hot keys are the primary one. They destroys your "average" calculations for your throughput.
Bulk loading data is the other gotcha I've run into. Had a beautiful use case for steady read performance of a batch dataset that was incredibly economical on Dynamo but the cost/time for loading the dataset into Dynamo was totally prohibitive.
Basically Dynamo is great for constant read/write of very small, randomly distributed documents. Once you are out of thay zone things can hey dicey fast.
I do not recommend starting off with a decision to use DynamoDB before you have worked with it directly for some time to understand it. You could spend months trying to shoehorn your use case into it before realizing you made a mistake. That said, DynamoDB can be incredibly powerful and inexpensive tool if used right.
Yea, probably, but it is especially true for DynamoDB because it can initially appear as though your use cases are all supported but that is only because you haven't internalized how it works yet. By the time you realize you made a mistake, you are way too far in the weeds and have to start over from scratch. I would venture that more than 50% of DynamoDB users have had this happen to them early on. Anecdotally, just look at the comments on this post. There are so many horror stories with DynamoDB, but they're basically all people who decided to use it before they really understood it.
I believe it used to be static provisioning, you'd set the read and limit capacity beforehand. Then obviously there is autoscaling of those but it is still steps of capacity being provisioned.
They now have a dynamic provisioning scheme, you simply don't care but it is more expensive so if you have predictible requirements it is still better to use static capacity provisioning. There is an option though.
DynamoDB also requires the developer to know about its data storage model. While this is generally a good practice for any data storage solution, I feel like Dynamo requires a lot more careful planning.
I also think that most of the best practices, articles etc apply to giant datasets with huge scale issues etc. If you are running a moderately active app, you probably can get away with a lot of stupid design decisions.
My experience with dynamic provisioning has been that it is pretty inelastic, at least at the lower range of capacity. E.g. if you have a few read units and then try to export the data using AWS's cli client, you can pretty quickly hit the capacity limit and have to start the export over again. Last time, I ended up manually bumping the capacity way up, waiting a few minutes for the new capacity to kick in, and then exporting. Not what I had in mind when I wanted a serverless database!
I understand it's not really your point, but if you're actually looking to export all the data from the table, they've got an API call you can give to have DynamoDB write the whole table to S3. This doesn't use any of your available capacity.
Yes, you have to learn about all these things upfront. But once you figure it out, test it, and configure it - it will work as you expect. No surprises.
Whereas Relational Databases work until they don't. A developer makes a tiny (even a no-op) change to a query or stored procedure, a different SQL plan gets chosen, and suddenly your performance/latency dramatically reduces, and you have no easy way to roll it back through source control/deployment pipelines. You have to page a DBA who has to go pull up the hood.
It is for now but it doesn't have to be. Dynamo's design isn't particularly amenable to dynamic and heterogenous shard topologies however.
There could exist a fantasy database where you still tell it your hash and range keys, which are roughly how you tell the database which data isn't closely related to each other and which data is (and which you may want to scan) but instead of hard provisioning shard capacity it automagically splits shards when they hotspot and doesn't rely consistent hashing so that every shard can be sized differently depending on how hot it is.
Right now such a database doesn't exist AFAICT as most places that need something the scales big enough also generally have the skill to avoid most of the pitfalls that cause problems on simple databases like Dynamo.
I’d urge you to start writing a prototype, a lot of your assumptions might get thrown out the window. Dynamo is not necessarily good for reading high volume. You’ll end up needing to use a parallel scan approach which is not fast.
I'd say Dynamo is extremely good at reading high volume, with the appropriate access pattern. It's very efficient at retrieving huge amounts of well partitioned data using the data's keys, but scanning isn't so efficient.
You can only ever fetch 1MB of data at a time though, even when using the more efficient query method (as opposed to scan). If your individual entities are not very tiny, it is hard to get for instance 2M items back in a reasonable amount of time.
I don't know your scaling needs, but I would highly recommend just using Aurora postgresql for read-only workloads. We have some workloads that are essentially K/V store lookups that were previously slated for dynamodb. On an Aurora cluster of 3*r6g.xlarge we easily handle 25k qps with p99 in the single-digit ms range. Aurora can scale up to 15 instances and up to 24xlarge, so it would not be unreasonable to see 100x the read workload with similar latencies.
Happy to talk more. We're actively moving a bunch of workloads away from DynamoDB and to Aurora so this is fresh on our minds.
The salespeople always promise magic and handwave CAP away.
But data at scale is about:
1) knowing your queries ahead of time (since you've presumably reached the limit of PG/maybesql/o-rackle.
2) dealing with CAP at the application level: distributed transactions, eventual consistency, network partitions.
3) dealing with a lot more operational complexity, not less.
So if the snake oil salesmen say it will be seamless, they are very very very much lying. Either that, or you are paying a LOT of money for other people to do the hard work.
Which is what happens with managing your own NoSQL vs DynamoDB. You'll pay through the roof for DynamoDB at true big data scales.
If you know and understand S3 pretty well, and you purely need to generate, store, and read materialized static views, I highly recommend S3 for this use case. I say this as someone who really likes working with DDB daily and understands the tradeoffs with Dynamo. You can always layer on Athena or (simpler) S3 Select later if a SQL query model is a better fit than KV object lookups. S3 is loosely the fire and forget KV DB you’re describing IMO depending on your use case
Plenty of options already exist. DynamoDB has both autoscaling and serverless modes. AWS also has managed Cassandra (runs on top of DynamoDB) which doesn't need instance management.
Azure has CosmosDB, GCP has Cloud Datastore/Firestore, and there are many DB vendors like Planetscale (mysql), CockroachDB (postgres), FaunaDB (custom document/relational) that have "serverless" options.
Exactly. This has been my experience with several AWS technologies. Like with their ElasticSearch service, where I had to constantly fine-tune various parameters, such as memory. I was curious why they couldn't auto-scale the memory, why I had to do that manually. There are several AWS services that should be a bit more magical, but they are not.
There's not really magic with s3, you still need to name things with coherrent prefixes to spread around the load.
DynamoDB is almost simple enough to learn in a day. And if you're doing nothing with it, you're only really paying for storage. Good luck with your decisions.
I'm not going to speculate on the accuracy of 90% value, but I will say that appropriately prefixed objects substantially help with performance when you have tons of small-ish files. Maybe most orgs don't have that need but in operational realms doing this with your logs make the response faster.
Your impressions are cordect: DynamoDB is quite low-level and more like a DB kit than ready to use DB, for most applications it's better to use something else.
If you use the "pay per request" billing model instead of provisioned throughput, DynamoDB scaling is self-managing, and you can treat your DB as a fire-and-forget key/value store. You need to plan how you'll query your data and structure the keys accordingly, but honestly, that applies even more to S3 than it does to Dynamo.
Exactly my experience. I got sucked into using more than once, thinking it would be better next time, but there are just so many sharp edges.
At one company, someone accidentally set the write rate rate high to transfer data into the db. This had the effect of permanently increasing the shard count to a huge number, basically making the DB useless.
I think this is a good summary, and it even gets more complicated if you start using the DAX cache. Your read/write provisioning for DAX is totally different than the underlying dynamodb tables. The write throughput for Dax is limited by the size of the master node in the cluster. Can you say bottleneck?
Take a look at Firestore / Google Cloud Datastore. It's pretty much exactly what you describe - fire and forget. There's no concept of "node" (at least not from the outside).
Thinking like this both baffles me, but also makes me happy because there will always be a need for people like me, infra. AWS is not a magical tool that will replace your infra team, it is a magical tool that will allow your infra team to do more. I am the infra team of my startup and I estimate that only 50% of my time is doing infra work. The rest is supporting my peers, work in frameworky stuff, solve dev efficiency issues bla bla.
Lets say that you operate in an AWS-less environment, with everything bare metal, in a datacenter. Your GOOD infra team has to do the following:
Hardware:
- make sure there is a channel to get new hardware, both for capacity increase and spares. What are you going to do? Buy 1 server and 2 spares? If one of the servers has an issue, isn't it quite likely that the other servers, from the same batch, to have the same issue? Is this affecting you, or not? Where do you store the spares? In a warehouse somewhere, making it harder to deploy? In the rack with the one in use, wasting rackspace/switch space? Are you going to rely on the datacenter to provide you with the hardware? What if you are one of their smaller customers and your requests get pushed back because some larger customer requests get higher priority?
- make sure there is a way to deploy said hardware. You don't want to not be able to deploy a new server because there is no space in the rack, or no space in the switch. Where are your spares? In a warehouse miles away from the datacenter? Do you have access to said warehouse at midnight, on Thanksgiving? Oh shit, someone lost the key to your rack! Oh noes, we don't have any spare network cable/connectors/screws...
Software:
- did you patch your servers? did you patch your switches?
- new server, we need to install the os. And a base set of software, including the agent we use to remote manage the server.
- oh, we also need to run and maintain the management infra, say the control plane for k8.
- oh, we want some read replicas for this db, not only we need the hardware to run the replicas on (and see above for what that means), now you need to add a bunch of monitoring and have plans in place to handle things like: replicas lagging, network links between master and replicas being full, failover for the above, master crapping out yada yada.
I bet there are many other aspects I'm missing.
Choices:
Your GOOD infra team will have to decide things like: how many spares do we need, is the capacity we have atm enough for the launch of our next world-changing feature that half the internet wants to use? Are we lucky enough to survive a few months without spares or should we get estra capacity in another datacenter? Do we want to have replicas on the west coast or is the latency acceptable?
These are the main areas of what an infra team is supposed to do: Hardware, Software and Choices. AWS (and most other cloud providers) is making the first 2 points non issues. For the last area you can do 2 things: get an infra team (could be a full fledged team, could be 1 person, you could do it) and teoretically you will get choices tailored to what your business needs OR let AWS do it for you. *AWS might make these choices based on a metric you disagree with and this is the main reason people complain*.
When looking at DynamoDB I noticed that there was a surprising amount of discussion around the requirement for provisioning, considering node read/write ratios, data characteristics, etc. Basically, worrying about all the stuff you'd have to worry about with a traditional database.
To be honest, I'd hoped that it could be a bit more 'magic', like S3, and it AWS would take care of provisioning, scaling, sharding etc. But it seemed disappointingly that you'd have to focus on proactively worrying about operations and provisioning.
Is that sense correct? Is the dream of a self-managing, fire-and-forget key value database completely naive?