Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A terrible schema from a clueless programmer (rachelbythebay.com)
33 points by ciprian_craciun on Nov 7, 2021 | hide | past | favorite | 24 comments


I must be missing something, but the "solution" with the 3NF and five separate tables with their own indices seems artificial and like pointless overengineering. Isn't this solved by a single index?

Or maybe a relational DB is already overengineering. Using a hash table (some libdbm lookalike) to map (ip,helo,from,to) to (time) solves the problem, too, and doesn't have a low performance failure mode.

This doesn't change the major point about everyone having been a newbie at point, though.


Are there any libdbm-style databases that support "tuples" as keys and values, instead of plain "strings"? I've felt like I wanted them in some applications, but didn't want to stringify everything.


I'm not familiar with "libdbm-style databases", but CouchDB's map/reduce views allows you to emit any JS object as the key, like so

emit(key, value)

Eg

emit([0,"one"], 2)

And you can query by the value of the key.


DBM and similar databases are simple on-disk key-value stores. The use case is a lot like a SQLite, where you just need to put some data on disk and retrieve it efficiently later.

Can CouchDB be used in that fashion? Seems a lot more complicated, more like MongoDB than anything.


Definitely more complicated than a key value store, or sqlite, more like a distributed document database with (optional) map/reduce views (indeces) and more recently mongo-like indeces.

Best use case though, for me, is key/value durable data storage that you can sync (two ways, no need for master).

As for performance, its pretty fast, but not redis fast as data is written on disk using btree indeces. Retrieving multiple values (documents) with similar looking keys (ids) can be quite fast.


You can however cache values in 3NF better down the stack (i.e. you can cache a lookup of a string to a PK and then only use PKs down the line).


The important bit comes at the end:

> That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.

> My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?


I'm kinda shocked that relational databases work so poorly out the box, that you can't even have a simple table with 4 or 5 values you care about, but have to split it out into 5 different tables and then create business rules about how to connect those into the original table.


The hitch with relational databases is that you need to know what you are doing. This is mostly true for programming in general and a big reason why a lot of software isn't much faster today than it was in 1995, even though processors are many orders of magnitude faster. It's much more visible with relational databases, as they tend to accumulate so much data that bad solutions immediately break.

In this case, I don't think the 3NF-solution they came up with was necessary, it typically ends up requiring really complicated transactions and in some cases it may even be slower because of cache thrashing. What they really needed was to index their table. You can index strings, that's fine. It's reasonably fast, like even with a hundred million rows it's doable.

"But why doesn't the database build indices automatically?", I hear someone about to type, it's because indices make inserts slower and require more memory. In many cases auto-indexing would probably help, but it would break a lot of use-cases too.


There are databases out there that analyse usage and filtering patterns and will generate and drop indexes automatically. That is a benefit to some people (normally more analytical workloads than critical production workloads for obvious reasons).


Most databases offer some sort of optimization tooling, not all of it automatic. You just gotta know it's there.


Thats just bad design, indexes would have solved that.


How? My understanding was that the four strings formed a natural key, and so wouldn't the index just be the size of the table? What do you gain from that? I'm a database noob so please enlighten me.

I thought the benefit here was that each of the four tables was much smaller (the original table was potentially the Cartesian join of all four mini tables) so fewer string comparisons were done.

I would have considered hashing the values in those four columns. Not sure how it would compare but if the string comparisons are the issue it might eliminate the problem without creating extra tables with indexes that need to be scanned. I wonder how that would compare.


In theory, you'd be performing fewer string comparisons. In practice, it's far from clear that the original schema was that bad, performance-wise (except for the nonsensical storage of IP addresses as strings). Even if you have separate indices for the columns, even just picking the index with the highest selectivity (which is something a good database engine will do naturally) will prune most of the comparisons in a logical conjunction that can't succeed. If you have a compound index, you can look for the exact values of multiple columns at once. And if the indices are implemented correctly in the database engine, they will be compressing the keys, so the index should be smaller than the table anyway.

> I would have considered hashing the values in those four columns.

That's a good idea that definitely works, but at least to me it's not clear that it's more efficient than a compound index or a computed index (where a compound index is basically the special case of a computed index on a concatenation that is supported natively even in RDBMS packages that don't support generalized computed indices). It might very well end up a wash.


I'm a bit rusty with MySQL, but it feels like he could have just indexed the original table...


That was a she, not a he.


> work so poorly

They don't work poorly, and there is no alternative. The database should not monitor the contents and opportunistically normalise the schema itself.

It's not "business rules", it's just normalisation.


I remember when I first started to do "real work" in mysql (instead of just tiny crud apps). i wanted to store a tree structure, and talked to the most skilled DBA in the room (who happened to be our group admin). Turns out, the way to do this in sql is to make parent pointers (not list of child pointers), exactly the opposite of how everybody else in the world does it. Later I learned that postgres has compound types and array types which work fine for trees.


I've been there too but wouldn't using an int to store the IP make more sense than pushing it out to its own table? Just as fast and one less table.


Worked really well before IPv6 was a thing.


I think I care less about whether the programmer is clueless, and more about the immediate effects of that cluelessness. If their clueless implementation is meeting the minimum requirements and goals that it was intended for, then it doesn't matter if it's terrible or not, it's good enough. If it's not meeting the minimum requirements, then I want them to do it in a way that does.

Another thing I think about more often is when my code will die. Everybody's code gets replaced eventually. How difficult will it be for the next person to replace your code? What kind of impact are you having on future generations? If I write my code to be super efficient and "fancy", will it be that much more difficult to replace in the future? Could I have written it in a different way that made it easier to deal with? I hope I make the right decisions. I don't want somebody to be cursing me when I'm long gone.


looks at greylisting table schema

looks at Rachel's blog post

looks at greylisting table schema again

Doh! :P


greylisting was a fun time. And yes i too have absolutely been that person :)


isn't this a reinvention of indices?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: