I can safely say that the team members working in DynamoDB are very skilled and ...

vasili111 · on Jan 20, 2022

>Engineering, however, was a disaster story. Code is horribly written and very few tests are maintained to make sure deployments go without issues. There was too much emphasis on deployment and getting fixes/features out over making sure it won't break anything else. It was a common scenario to release a new feature and put duct tape all around it to make sure it "works". And way too many operational issues. There are a lot of ways to break DynamoDB :)

>Overall, though, the product is very solid and it's one of the few database that you can say "just works" when it comes to scalability and reliability (as most AWS services are)

How those two can coexist?

0xbadcafebee · on Jan 20, 2022

You throw bodies at it. A small bunch of people will be overworked, stressed, constantly fighting fires and struggling to fight technical debt, implement features, and keep the thing afloat. Production is always a hair away from falling over but luck and grit keeps it running. To the team it's a nightmare, to the business everything is fine.

wnolens · on Jan 20, 2022

this is the answer.

source: currently being burned out on an adjacent aws team..

0xbadcafebee · on Jan 20, 2022

That sucks, man. If they won't move you to another team, just get out of there. We don't benefit by suffering for them, and they're not gonna change.

wnolens · on Jan 22, 2022

Yea. I can probably move to a more chill team, but I wouldn't work on anything nearly as cutting edge. I mentally check out for weeks at a time, then get back into it and deliver something large. I'm low key job hunting, but don't entirely trust that it'll be different anywhere else (previous jobs were like this too)

dustingetz · on Jan 20, 2022

literally every co

if you want to know why capitalism causes this, start a startup and prioritize quality, do not get to market, do not raise money, do not pass go, watch dumpster fires with millions of betrayed and angry users raise their series d

badhombres · on Jan 20, 2022

I mean eventually enough duct tape can be solid like a tank :)

jeffreygoesto · on Jan 20, 2022

Or a bridge...

https://www.popularmechanics.com/science/a5732/mythbusters-b...

bastardoperator · on Jan 20, 2022

Or a boat...

https://flexsealproducts.com/products/flex-tape

indogooner · on Jan 20, 2022

It sounds incredulous but I have heard similar things about Oracle. May be a large dev team can duct tape enough so that the product is solid.

gfd · on Jan 20, 2022

You're probably thinking of this comment, oracle and 25 million lines of c code: https://news.ycombinator.com/item?id=18442941

eternalban · on Jan 20, 2022

They both likely have solid 80% solutions (design) and incrementally cover the 20% gap as need arises. This in turn adds to operational complexity.

Alternative would be to attempt a near 'perfect' solution for the product requirements and that may either hit an impossibility wall or may require substantial long term effort that would impede product development cycles. So likely the former approach is the smarter choice.

meepmorp · on Jan 20, 2022

You need the really good duct tape.

StreamBright · on Jan 21, 2022

AWS level of disaster is very different to an average disaster. :)

otterley · on Jan 20, 2022

Customers care about the outcome, not the internal process. Besides, I’ve never worked at any sizable company in my 20+-year-long career where I didn’t conclude, “it’s a miracle this garbage works at all.”

Enjoy the sausage, but if you have a weak stomach, don’t watch how it’s made.

(I work for AWS but not on the DynamoDB team and I have no first-hand knowledge of the above claim. Opinions are my own and not those of my employer.)

chrisfosterelli · on Jan 20, 2022

> Customers care about the outcome, not the internal process.

This is true though there's only so much technical debt and internal process chaos you can create before it affects the outcome. It's a leading indicator, so by the time customers are feeling that pain you've got a lot of work in front of you before you can turn it around, if at all, and customers are not going to be happy for that duration.

Technical debt is not something to completely defeat or completely ignore, instead it's a tradeoff to manage.

jsdalton · on Jan 20, 2022

This article from Martin Fowler explores your point in greater depth. It's a good read: https://martinfowler.com/articles/is-quality-worth-cost.html

One concrete problem with technical debt the article highlights is it that negatively impacts the time to deliver new features. Customers today usually expect not only a great initial feature set from a product, but also a steady stream of improvements and growth, along with responsiveness to feedback and pain points.

wnolens · on Jan 20, 2022

> Customers care about the outcome, not the internal process

Additionally, the business cares about the outcome, not the internal process.

Ostensibly, the business should care about process but it actually doesn't matter as long as the product is just good enough to obtain/retain customers, and the people spending the money (managers) aren't incentivized to make costs any lower than previously promised (status quo).

listenallyall · on Jan 20, 2022

Just curious, why do you mention you work at AWS if you're just disclaiming that fact in the next sentence? Besides, nothing you stated is specific to AWS or any of its products.

sokoloff · on Jan 20, 2022

I don't work at Amazon, but our company's social media policy requires us to be transparent about a possible conflict of interest when speaking about things "close to" our company/our position in the industry and also need to be clear about whether we're speaking in an official capacity or in a personal capacity.

This is designed to reduce the chances of eager employees going out and astro-turfing or otherwise acting in trust-damaging ways while thinking they're "helping".

sealjam · on Jan 20, 2022

Crudely speaking, the fact that they work at AWS means that it’s in their best interests for AWS to be perceived positively.

When this is the case it’s often nice to state this conflict of interest, so others can take your appraisal in the appropriate context.

I’m not implying anything about the post, just stating what I assume to be the reason for the disclosure.

salil999 · on Jan 20, 2022

Exactly this. I was too young at the time to grasp this idea.

deanCommie · on Jan 20, 2022

If the developers are happy about the code and testing quality of a project, then you waited too long to ship.

If the customers don't have any feedback or missed feature asks at launch, you waited too long to ship.

You know who has great internal code and test quality? Google. Which is why Google doesn't ship. They're a wealth distribution charity for talented engineers. And their competitive advantage is that they lure talented people away from other companies where they might actually ship something and compete with Google, to instead park them, distract them with toys, beer kegs, readability reviews, and monorepo upgrades.

geodel · on Jan 20, 2022

Very interesting!

To me the takeaway is large/interesting/challenging engineering projects are pretty close to disasters generally. Some time they do become disaster actually.

On the other hand if a project looks like straight up designed, neatly put into JIRA stories, and developers deliver code consistently week after week then it may be a successfully planned and delivered project. But it would mostly be doing stuff that has already been many times over and likely by same people on team.

At least this has been my experience while working on standardized / templated projects vs something new.

notyourwork · on Jan 20, 2022

Challenging the cutting edge of your product domain is what I get from this. Easy things are easy and predictable. Hard things and unpredictable evolving requirements are a tension against the initial system design which is the foundation of your code base. Over time the larger projects get the perhaps further they deviate from the original design. If you could predict it up front in many cases its not all that interesting or challenging of a problem. Duct tape is fine to use as long as you understand when you've gone too far and might want to re-design from scratch based on prior learnings.

marcosdumay · on Jan 21, 2022

On the other hand, if you don't go solder things and remove the duct tape from time to time, you will always come closer to a disaster, never further away.

Some projects are run like the Doomsday Clock, and nobody can get anything done. Other ones increase and decrease on complexity all the time, and those tend to catch-up to the first set quite quickly.

0xbadcafebee · on Jan 20, 2022

I worked at a company who re-implemented the entire Dynamo paper and API, and it was exactly the same story. Completely eliminated all my illusions about the supposed superiority of distributed systems. It was a mound of tires held together with duct tape, with a tiki torch in each tire.

AtlasBarfed · on Jan 20, 2022

Did they have a spare 100 million hanging around to burn? That seems pretty ridiculous. Why did they not just run cassandra?

0xbadcafebee · on Jan 20, 2022

They did have 100 million to burn, but my mostly-wild-guess is it was closer to $1.5M/yr. But that gives you an in-house SaaS DB used across a hundred other teams/products/services, so it actually saved money (and nothing else matched its performance/CAP/functionality).

Cassandra is too opinionated and its CAP behavior wasn't great for a service like this, so they built on top of Riak. (This also eliminated any thoughts I had about Erlang being some uber-language for distributed systems, as there were (are?) tons of bugs and missing edge cases in Riak)

staticassertion · on Jan 20, 2022

Erlang gives you great primitives for building reliable protocols, but they're just primitives, and there are tons of footguns since building protocols is hard.

AtlasBarfed · on Jan 21, 2022

Because Riak uses vector clocks instead of cell timestamps? Cassandra's ONE/QUORUM/ALL consistency levels otherwise allow tuning for tolerance of CP vs AP, don't they?

0xbadcafebee · on Jan 21, 2022

To be honest I don't know, I wasn't there for the initial decision, but I know it wasn't just about CAP. It could have been as simple as Riak was easier to use (which I don't know either)

spookthesunset · on Jan 20, 2022

> Why did they not just run cassandra?

Not Invented Here can run very deep in some branches of an organization. Depending on how engineering performance evaluations work, writing a homebrew database could totally be something that aligns with the company incentives. It might not make a single bit of sense from a business standpoint but hey, if the company rewards such behavior don't be surprised when engineers flush millions down the tube "innovating" a brand new wheel.

yftsui · on Jan 20, 2022

Dynamo paper and DynamoDB are two very different things…

0xbadcafebee · on Jan 20, 2022

Err you're right, they re-used something that implemented the Dynamo paper (Riak) and implemented on top of it the DynamoDB API.

sam0x17 · on Jan 20, 2022

It's a shame they don't open source it. It's funny too, being AWS they really don't have to worry about AWS running a cheaper service, so at that point why not open source it.

js4ever · on Jan 20, 2022

There is a compatible open source alternative here, https://www.scylladb.com/alternator/

gautamdivgi · on Jan 20, 2022

If you’ve read their paper, there is a lot of detail in it to create your own. Of course they haven’t given out the code but the paper is a pretty solid design document.

nexuist · on Jan 20, 2022

They probably view it as a competitive advantage that Azure or GCP would try to copy if they figured out the "secret sauce."

staticassertion · on Jan 20, 2022

I kinda doubt it. It's probably just that open sourcing it won't provide much utility (I bet lots of code is aws specific) and just adds a new maintenance burden for them.

bitexploder · on Jan 21, 2022

The way you need to write code for a massively scalable service is just different. And the things you need to operate a service are also just different.

pclmulqdq · on Jan 21, 2022

Azure has a better product (Cosmos is damn good) and Google engineers have too much hubris to import something from lowly Amazon engineers.

sam0x17 · on Jan 20, 2022

Azure has Cosmos and Google has Datastore. They would never.

blowski · on Jan 20, 2022

Azure has Cosmos which is arguably better than DynamoDB for a lot of use cases.

AtlasBarfed · on Jan 20, 2022

So they just rolled out global replication, and I can't for the life of me figure out how they resolve write conflicts without cell timestamps or any other obvious CRDT measures.

Questions were handwaved away, and the usual Amazon black box non-answers which always smells like they are hiding problems.

Any ideas how this is working? It seems bolt-on and not well thought out, and I doubt they'll ever pay for Aphyr to put it through his torture tests.

greiskul · on Jan 20, 2022

From: https://aws.amazon.com/dynamodb/global-tables/

Consistency and conflict resolution

Any changes made to any item in any replica table are replicated to all the other replicas within the same global table. In a global table, a newly written item is usually propagated to all replica tables within a second. With a global table, each replica table stores the same set of data items. DynamoDB does not support partial replication of only some of the items. If applications update the same item in different Regions at about the same time, conflicts can arise. To help ensure eventual consistency, DynamoDB global tables use a last-writer-wins reconciliation between concurrent updates, in which DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all replicas agree on the latest update and converge toward a state in which they all have identical data.

WatchDog · on Jan 21, 2022

So a write than doesn’t “win” just gets silently discarded?

luhn · on Jan 20, 2022

Honestly your expectations are too high. Conflict resolution is row-level last-write-wins. It's not a globally distributed database, it's just a pile of regional DynamoDB tables duct taped together... They're not going to hire Aphyr for testing because there's nothing for him to test.

perfectspiral · on Jan 20, 2022

"And way too many operational issues."

I've seen this kind of thing mentioned many times, pretty baffling TBH based on Dynamo's pretty good reputation in industry. Are these mostly to the stateless components of the product, or do they see data loss?

salil999 · on Jan 20, 2022

I can't say in risk of violating some NDA but a lot of it is internal stuff that customers will never even be aware of or it would require too much effort for them to break.

There are times when bad deployments happen and customers were impacted.

haimez · on Jan 21, 2022

Careful about running your systems in us-east-1, folks

digitalgangsta · on Jan 20, 2022

I've worked at 5 different tech companies now - this is par for course. And every single one, wished they could go back and do it again, but at that point the product was too successful so they ran with it.