This surprised me as well, I was told that in Postgres at least, there is no dif...

nerfhammer · on June 15, 2011

Short answer: Any relational database is going to read in the entire row when you select it. You will save network traffic and probably space in some internal buffers only.

Longer answer: Any relational database will read the entire page the row resides on when you select a row.

This means that tables with more columns will have less efficient pages which will need to be read/buffered more frequently per row.

This also means there is an exception: columns that are stored in overflow pages away from the rest of the row like blob or text columns or (in some db's) very long varchars may not necessarily be read if you don't select them.

There is an additional, very important exception: if you use a covering index then the db will not need to read the data page the row resides on. For example, if you have an index on (username, user_id) and you select "select user_id from table where username=xxxxx" then it will be able to read the user_id from the leaf node of the index and no bookmark lookup to the data pages will be needed. In some db's the primary key is always "covered" and you never need a bookmark lookup to get it.

seldo · on June 15, 2011

It depends. In MySQL, if the only columns you select are indexed columns, the entire query will be pulled from the index, which is usually entirely in-memory, so enormously faster:

http://dev.mysql.com/doc/refman/5.5/en/where-optimizations.h...

(The columns currently all have to be numeric for this to be true, but that's surprisingly often the case)

chrisb · on June 15, 2011

The query speed may be the same (I don't know if this is true or not), but fetching the data from disk and returning it over a network will certainly be slower, especially if the unnecessary columns contain large strings or blobs.

IgorPartola · on June 15, 2011

Forget the speed of the database. What happens when you need to transmit the returned data across a network? Would you rather transmit 32 bytes per row, or 10kB?

falcolas · on June 15, 2011

Or, like I had happen recently, 10GB?

Of course, that comes down to the scaling problem. The application I'm currently working on can't use ORMs (though we have rolled a few of our own read-only ones), because of 1) the size of the data we're going through, and 2) the lack of support for splitting reads and read/writes to different connections.

IgorPartola · on June 15, 2011

(1) is definitely a problem I've seen. SQLAlchemy for example takes forever to initialize objects representing even a single row. Fine if you have less than 100 tables with less than 100 rows each and almost never join, but start scaling, and you run into issues.

zzzeek · on June 16, 2011

"forever to initialize a single row" is obviously some pretty severe hyperbole, but the slowness of loading rows into fully formed objects is largely a product of C python's sluggishness in creating new objects. This also has nothing to do with how many rows are in the table, only how many rows you happen to be fetching over the wire for a particular query.

Performance is also something SQLAlchemy is obsessive about, so the speed of loading objects in SQLA has increased by orders of magnitude over the course of 0.4, 0.5, 0.6, 0.7. If your opinions are based on earlier versions, your impressions may be out of date.

IgorPartola · on June 16, 2011

Sorry to be vague. In my benchmarks it takes around 30 ms to initialize an instance of a model as a result of a query using Python 2.6 and sqlalchemy 0.6. This is in contrast to raw MySQLdb query initialized as a dict which takes 2-4 ms.

zzzeek · on June 16, 2011

MySQLdb is written in C. So you're comparing the speed of raw C code to pure Python (which runs on top of that C code). The vast difference in speed between pure C and Python has little to do with SQLAlchemy which I can assure you generates those objects as fast as possible while still maintaining proper behavior.

IgorPartola · on June 16, 2011

MySQLdb is written in Python. _mysql is a Python extension wrapped around libmysqlclient. _mysql is about twice as fast as MySQLdb BTW.

Nevermind all that. My point is that if you want fast code, write it in raw SQL.

zzzeek · on June 21, 2011

not really sure what you mean. MySQLdb, i.e. this project: http://mysql-python.sourceforge.net/FAQ.html ? When you ask this library to execute a query and return rows, the network communication with the MySQL server is done with C code, i.e. with _mysql . It doesn't make sense to say "_mysql is twice as fast as MySQLdb" - "_mysql" is an intrinsic component of it. The Python portion of MySQLdb does very little.

Here is from the FAQ included with the project:

"MySQLdb is a thin Python wrapper around ``_mysql`` which makes it compatible with the Python DB API interface (version 2). In reality, a fair amount of the code which implements the API is in ``_mysql`` for the sake of efficiency."

In fact, if you do want MySQLdb that's pure Python, there's an entirely separate project for that called pymsql: http://code.google.com/p/pymysql/. This library will be a lot slower than MySQLdb.

Erwin · on June 15, 2011

I think that would depend on the data type and sizes. A single row in a PG table cannot take up more than page size (8kB). If you have a 100kB TEXT field the row will instead contain a reference to an "overflow" table rather than the data itself. See http://developer.postgresql.org/pgdocs/postgres/storage-toas... -- presumably the corresponding toast data is not looked up unless needed (and you can check for e.g IS NULL in the main row)

(looking over my database, less than 1% of the 30-40GB size is in overflow tables so unless you routinely store large amounts of data it won't affect you).

(I used to think the "toast" tables were temporary tables and wondered when Postgres would delete them).