If you have a Python CRUD app, there are many steps you should take to speed up the app before parallelizing queries.
The first couple that come to mind:
- Caching. The 90-10 rule applies to most CRUD apps. Use varnish to get rid of some requests before they hit your webserver or make a redis/memcache LRU layer to prevent queries from hitting your database.
- Shifting more work away from Python to the database. A surprising number of apps don't do pagination properly.
For example, if you want to show the top 10 X objects for each item Y, you don't need to get X * Y rows. You can use a window function to get the top 10 for each Y can get 10 * Y rows instead.
- Using PyPy if you can. It's a free performance boost, in most cases.
- Smarter indexing. Postgres's partial indexes are really powerful.
- Intermediate caches. If you have common GROUP BY queries, a materialized view could go a long way.
- If you reallllllly need it, you can Cython and rewrite slow parts of your app in C.
My point is - parallelizing queries should be one of your last steps in speeding up your app. It adds a lot of complexity sometimes and there's a lot of cheap easy optimizations out there.
> a surprising number of apps don't do pagination properly
Could you elaborate on that? What's "properly"? Do you mean that they don't do it at all, or they do it in memory instead of in an indexed query? Or is there a technique here that I'm missing?
A surprising number of people either make N queries because they don't what their code is doing (really easy for a beginner to do with Django models when not using prefetch) or, more commonly, they make queries that get N rows back.
You almost never need N rows. N rows are bad because Python has to parse them. Odds are, there's a way to get a constant number of rows back for every query in every view of your webapp.
I've definitely seen the first - probably the first major performance issue that everyone runs into when using an ORM.
By "N rows", do you mean having a view that only renders a constant number of rows, but your database query returns an larger collection that's then filtered in Python code? That seems like an obvious bug that can be usually fixed with a LIMIT clause.
LIMIT/OFFSET isn't the optimal way to do pagination either, because the database has to scan to offset before fetching the limit. I'm having a hard time trying to find the particular resource that dives into it, but here are the basics:
SELECT *
FROM t
OFFSET 1000 LIMIT 10; -- does not use index, scans table
vs
SELECT *
FROM t
WHERE id between 1000 and 1010; -- uses index
I was wondering what you do when say record 1001 gets deleted...your second query will only return 9 rows.
It's explained in your linked slides that LIMIT is ok, it's the offset that you need to worry about so your query looks like it would be better written:
Interesting, I wasn't aware of this. Thanks for the explanation - I'll have to keep this in mind the next time I'm implementing pagination for a very large table.
A simple way to implement fast pagination is to set item.idx whenever you insert an item into a collection. It starts at 0 and gets incremented each time.
From there, given the number of items per page, you can trivially determine which items to display given a page number or which page to display given an item.idx. And the query uses the efficient WHERE.
Of course, the last_seen pagination is only usable for pagination where users aren't deep linking into pages (like HN).
In many cases it's trickier than that too: if you're using your database's UUID's as the primary key, you'll need to find another column to use BETWEEN on and index that to make this work.
Plus, as stated in the slideshow, you can't provide jumps to arbitrary pages. In some cases this doesn't matter, but worth taking note of.
> - Shifting more work away from Python to the database. A surprising number of apps don't do pagination properly.
This is very true. Get as much out of database before doing in python. Creation of objects such as new list is very expensive in Python, especially when you are dealing with large amount of data.Iterating over data just to do some aggregation that can be done in DB is slow and expensive. Cache the result later if you expect that to be useful.
Async has never been about speed. Cooperative multitasking doesn't magically gzip your instruction pipeline or something (actually CM is suboptimal). It's just the only tool for those of us who want to avoid giving the keys to arguably the most important subsystem in the kernel to agents outside our control: The scheduler.
Ever tried to ssh into a one-thread-per-connection setup under heavy load? Assuming you managed to log in, it'll be very very difficult to get htop to execute when it's competing with 2k other processes for cpu time.
So, spawning a thread for every incoming connection request is a bad idea. But what about spawning a thread for every outgoing database connection?
The memory overhead of a thread compared to a tcp socket may be significant, but it's nothing compared to a database connection and all the other resources tied up serve that one database query. And best of all, you can refuse to setup a database connection without upsetting your user and e.g. serve from cache instead, but you can't refuse an incoming connection request.
According to the Postgres guys [1]: "A formula which has held up pretty well across a lot of benchmarks for years is that for optimal throughput the number of active connections should be somewhere near ((core_count * 2) + effective_spindle_count)"
Assuming you got two tablespaces on two raid 1 arrays on a 16 core machine, that's 16 * 2 + 2 = 34 threads plus the event loop thread, maximum.
According to those numbers, there doesn't seem to be much stopping you from running your data access logic within a thread.
"Ever tried to ssh into a one-thread-per-connection setup under heavy load? Assuming you managed to log in, it'll be very very difficult to get htop to execute when it's competing with 2k other processes for cpu time."
Nothing about your comment is compatible with my experience. It is certainly not true that a Linux box with 2000 threads blocked on i/o will be having any sort of bad time. If you're really got 2000 threads competing for CPU time then your problem transcends execution architecture: you've simply admitted more work than you can reasonably discharge. Neither threads nor callbacks can solve this problem for you.
Also I'm not sure why you think that the kernel thread scheduler is all that good, or why it shouldn't be considered "outside our control." For my users the kernel thread scheduler is just another black object inside a dark box. The standard scheduler is pretty good for general purposes but I doubt its optimal for any particular case. In some loads it might be useful to yield to a specific thread that you think is holding a mutex your thread needs to acquire. This is cooperative multitasking, basically.
> It is certainly not true that a Linux box with 2000 threads blocked on i/o will be having any sort of bad time.
You and I must have very different perceptions about the way a server is "having a bad time" :)
I was assuming they were at various stages of processing an incoming request, which means they were blocked on either legitimate disk i/o or swapping. It's very difficult to log in even locally in that case, because the login process can't read /etc/passwd in time.
> Also I'm not sure why you think that the kernel thread scheduler is all that good, or why it shouldn't be considered "outside our control.
I was just trying to say that it's dangerous to give non-trusted peers big influence on the way the scheduler behaves.
OK, but it sounds like the main problem there is local disk access, which is the great satan anyway. You an generate a machine-hosing writeout workload using only one thread on Linux, because Linux loves to starve readers if it can write instead.
I think the unstated second dimension of your comment is that most operating system distributions come out of the installer with absolutely the wrong parameters for running many threads. I think the default thread stack size is 2MB still, and the socket buffers are all huge, and there are limits on how many processes you can have that make people think those limits are meaningful, when they're really not.
> Cooperative multitasking doesn't magically gzip your instruction pipeline or something (actually CM is suboptimal).
There are multiple meanings of speed in a concurrent system. Is it throughput of each individual connection? Average throughput? Is it minimul latency, is it average latency?
What cooperative multi-tasking does is let all tasks make progress. Without having to have an explicit scheduling algorithm in each individual task.
There could be many reasons why ssh-ing migth be slow in a one-thread per connection. It could be that memory is low. Maybe the machine is over capacity. Maybe it is good that your ssh is slow because some higher business priority data is being transferred.
Thread also doesn't necessarily mean OS threads. Threads could be mean green threads (as in Python greenlet co-routines), it could be Go's channels, it could be Erlang's processes. Those are N:M threading models.
For example WhatsApp was running with 2M concurrent TCP connection on their FreeBSD servers say 3-4 years ago. Each connection with a separate lightweight process. So the logical model works well. It is the platforms and langauges that are behind so to speak.
Note that the 'academic theorists' you cite would rather use an actor based system (and not events), and elsewhere in academia there's quite outspoken criticism of event-based systems[1] even for high-concurrency use cases. So, yes, threads have their problems, but event-based systems are not the panacea.
I just want to note that asynchronous programming is analogous to cooperative multitasking, as used in the Windows 3.1 era. It seems, if we value low latency, that we should not pursue that route for the long term, or we should be very cautious about it. For tasks that are not purely I/O bound, its use is questionable.
It is not analogous to cooperative multitasking, since Windows 3.1 dealt with different applications -- but asynchronous programming is targeting only a single (server) application.
I also would add, that asynchronous implemented server applications are doing oftentimes better than systems that primarily rely on threading.
> but asynchronous programming is targeting only a single (server) application
As systems get more complex, this is not true anymore, because thinking otherwise would put the composability of the software at risk. Large software is typically built from many heterogeneous components.
> I also would add, that asynchronous implemented server applications are doing oftentimes better than systems that primarily rely on threading.
In the case of purely I/O bound applications, you may be right. Otherwise, it is a matter of finding the "sweet spot": just like bubble-sort is faster for certain applications than quicksort, does not mean it is the sort-routine we should be putting at the center of our frameworks, and build our software around.
Also note that asynchronously servicing N users, where each of the N services has a certain probability p>0 to "lock" the system for longer than expected, will increasingly become an unresponsive experience for increasing N.
> Large software is typically built from many heterogeneous components.
That is just one of the reasons, why software is getting more and more unreliable (not the asynchronous implementation). Software stacks are getting more and more complex and implementors loose track of the complexity.
Throwing dirt on software with clear and less complex implementations, is not going to make things better.
Of course I would not argue, that asynchronous implementation is for any software and in particular with today's development tools, but just to compare it with different types of things to make it look bad, is no proper argumentation style.
Asyncio isn't a good fit for classic web frameworks, like django. These applications connect to an excellent, fast and local database, maybe a caching server, and that's it. Their Http requests are designed to be quick and stateless.
A modern application may be designed very differently. Websockets almost require some kind of asynchronous concurrency, especially beyond simple push-notifications. Talking to remote databases, micro-services or even big-data frameworks is a very different ball game: Latency and processing time can quickly add up, making asyncio concurrency more attractive.
Finally, user input is very asynchronous and slow. Asyncio offers you to do stuff like "command = yield from interact(); if command == "start": ...", both in websockets and GUI frameworks like kivy.
I've been lucky enough to have benefitted from Sqlalchemy and Mako (but it's been awhile). Thanks.
This article looked like it was going to hit the sweet spot of stuff I'm curious about, but I found I was still left with questions. If you (or anyone) will indulge me... I'll try and ask a question to help clarify matters.
I work at a University on legacy ERP system(s). During registration there are 800+ concurrent connections but normally it floats around 200. Most all of these connections are idle. As you pointed out, a db may not be io/context bound (still hazy on that one). At the end of the day I consider myself technically astute, but basically a crud business programmer. I understand ACID and transactions; threads and async, maybe not so much.
Where I've always thought async could provide benefit would be in the following scenario. Our apps make a large # of procedural db calls today. If after studying them I realize that many are independent (i.e. reads) and could be 'batched' could that not provide a big performance/latency improvement? I.e. instead of the serial sequence of calls that happen now (even if a stored proc), async allows me to submit multiple sql calls. What I'm calling batch. In this ideal world, sqlalchemy would take care of the details (perhaps with some guidance directives as to whether ordering of results was important) and assemble the results.
Is this not a possible future 'async' sqlalchemy with superior responsiveness? Don't threads block on each sql request?
if you want to send out a series of long-reply SQL calls and wait for them all in batch, that is doable with Postgresql's async support, but they'd all be on distinct database connections, so you wouldn't get transactional consistency between these calls, but maybe that's not important. You can do the same thing with threads but it would mean you'd need to spin up that many threads, but at least would be something you could test in the short term to see if it is in fact feasible.
The rudimental SQLAlchemy-emulation system within aiopg right now can probably accommodate this use case but it is Postgresql specific. "Legacy ERP system" sounds like there's some different database in play there, if you are relying upon closed-source drivers you'd have to get access to a non-blocking API within them. Else you're stuck with threads.
acveilleux's point about caching here is very relevant and I thought of that also. if these are indeed independently read sets of fairly non-changing data, pulling it from a cache is the more traditional approach to reducing latency (as well as the need for 800+ database connections).
Postgres lets you synchronize snapshots across connections so that they all see the same data (though subsequent changes are not visible to the other transactions unless you export another snapshot.)
http://www.postgresql.org/docs/9.4/static/functions-admin.ht...
This lets you parallelize work across multiple processes while maintaining consistency. When the client is the limiting factor you can use this with multiprocessing. When the db server is the limiting factor you can just use threads (or async.) Postgres backend processes are essentially single threaded.
So caching. Doesn't the db do that? And as much as I hate to say it, are the added complexities (webserver caches) better than the even more traditional approach - throw hardware at it? Always lots to think about. Thanks!
When I was more junior I was always told to use caching as a last resort. It's a good attitude to take to make sure you're not doing something stupid and hiding it with caching. These days though I look for caching opportunities up-front. In fact, I'll design with them in mind.
I did some work for a client some time ago that were expecting a lot of read load. Their backend had a bunch of constantly changing data in mongo - but it only refreshed every 10 seconds. I told them initially to just output the aggregated data to an S3 object and have all the clients access it from there. They decided to run loads of servers instead, they were muttering something about AWS Autoscale (even though I told them that wouldn't help).
As expected, I got a call one Friday evening asking if I could take a look at why their servers were timing out. When I got there, there were about 15 frontend servers hammering 1 poor mongo box that was aggregating the same query again and again - and within any 10 second window always getting the same result. I stripped it down to 1 frontend box with an nginx cache (after jumping through a hoop to support jsonp).
After the dust settled they apparently didn't want to admit that it was something that could just be solved with caching so it was described as a configuration issue to the business.
> In practice, you'll end up with so many "yield from" lines in your code that you're right back to "well, I guess I could context switch just about anywhere", which is the problem you were trying to avoid in the first place.
This is backwards, like when my Java colleague complained I was putting "final" in too many places. The point isn't the lines that contain "yield from". It's the lines that don't contain "yield from". You can make an explicit choice that you don't want context switches to happen on particular lines. In traditional multithreading you can kinda-sorta do this with mutexes - but it's less efficient and, more importantly, more error-prone.
> Database Code Handles Concurrency through ACID, Not In-Process Synchronization
Maybe. The problem is, not everything is (or should be) in the database. Even for CRUD webapps, we tend to end up needing a general-purpose programming language (otherwise you'd just write them in MS Access, no?) And so we need general-purpose language mechanisms for dealing with concurrency.
> The point isn't the lines that contain "yield from". It's the lines that don't contain "yield from"
What is the point though? Why does it matter where context switches happen (either CPU or IO ones). You are doing the job of the scheduler which is like being teleported back to Windows 3.1. Adding "yield froms" turns your functions into generators. The job of code maybe is to update shopping carts or send tweets or something like that, now that function returns a generator and anything on top of it has to deal with it.
The original point, which I think you missed, is that the typical advantage of "yield from" (or deferreds, or callbacks) is that you don't need to worry about synchronization. Look no mutexes, this is fantastic! Except that is wrong. As the application grows, top level code starts to look like:
r1 = yield from f1()
r2 = yield from f2(r1)
r3 = yield from f3(r2)
...
With callbacks it looks even uglier. That code is logically approaching the code that looks like:
r1 = f1()
r2 = f2(r1)
r3 = f3(r2)
...
In a multithreaded program without mutexes.
Therefore this thing exists: http://twistedmatrix.com/documents/8.1.0/api/twisted.interne... and I had to use it often enough. Because what happens is two clients would start 2 concurrent callback chains, and if they start updating some shared data ( a database or internal structure ) you've got a data race and you need to use the DeferredSemaphore.
> You are doing the job of the scheduler which is like being teleported back to Windows 3.1.
Indeed it is. The difference is that modern programming techniques are good enough that we can do this with minimal overhead. Like how modern fighter planes have gone back to being aerodynamically unstable.
> Therefore this thing exists: http://twistedmatrix.com/documents/8.1.0/api/twisted.interne.... and I had to use it often enough. Because what happens is two clients would start 2 concurrent callback chains, and if they start updating some shared data ( a database or internal structure ) you've got a data race and you need to use the DeferredSemaphore.
That's one approach, but as you say it simply recapitulates the problems of traditional multithreading. If that were the only option, we might as well use threads and mutices.
But there are other options. We can accumulate effects that need to happen as a single transaction through our async chain (state monad) and then execute them all at once, with no possibility of yielding in the middle. In languages with true concurrency we can use an actor; in single-threaded event-driven interpreters we don't even need that. We get a model that has the power of open-and-close transactions (whether they be database transactions or or mutices), but is clearer and simpler to reason about.
> For database code, you have exactly one technique to use in order to assure correct concurrency, and that is by using ACID-oriented constructs and techniques.
Actually no. And in the transfer example he listed it isn't even how banks do it in the real world. Mostly they use eventual consistency. And there are alternatives to ACID albeit far less simple. You can use something like Zookeeper to handle transactions. Or if you are using an eventually consistent database e.g. Cassandra simply set the quorum such that you are querying all nodes.
i'm curious about which banks in the real world use an eventually consistent transaction management approach and if you'd be willing to cite your sources.
I look forward to the followup 'why the yield from needs to be a real, structural part of the program and not just, for example, a magic comment' because I have the same kind of question :)
> Python is Very, Very Slow compared to your database
I never actually tried to measure this, so I have 2 questions:
1. Have I understood correctly that author implies that webapp and DB are running on one server? But that's usually just not the case!
2. Does anybody have an idea, of how we could compose more or less realistic benchmark for this, preferably in language-agnostic manner, so I could write a few scripts, connect few databases on multiple hosts, run it and see, how expensive all this stuff actually is, app-code/db performance wise?
2. https://www.techempower.com/benchmarks/ is the best effort I've seen. Still has all the problems of benchmarks (that is, it should be seen as giving an upper bound for performance of a particular tech stack, which you will almost never reach in practice), but it at least gives you end-to-end numbers on something like a realistic problem.
> Have I understood correctly that author implies that webapp and DB are running on one server? But that's usually just not the case!
no. Please see the benchmark suite where I ran the async/threaded performance tests both on the same machine as well as on different machines. As for the PyMySQL example, I can assure you, running it on the same machine, different machines, whatever, you'll see something very similar as far as that Python profiling result.
So the basic take away seems to be: don't bother using async patterns for single, low latency connections to a server on your local network.
For anything where you're dealing with thousands of connections from random Internet hosts, "just spawn a thread for it" does not cut it. If you take that approach, you're setting yourself up to be accidentally DoS'd at some point in the near future. Async, on the other hand, has more than proven itself to be apt for this kind of scenario.
I'd want data. The system I work on does in fact spawn a thread to handle each and every connection and in fact each connection thread spawns numerous child threads to exploit available parallelism within the request. The code is fully blocking and linear and anyone can read it and see what it is doing. The mentioned system is one of the largest public networks services on earth.
I am very skeptical of the idea that you must not handle thousands of connections with a thread per connection. High tens of thousands of threads per core is the minimum level where I would start to worry.
At this point those who blindly advocate async programming as generally faster just show their level of proficiency (a lack thereof).
The fact that threads can be just as performant (or as we saw, even more performant) for IO code should not be surprising for anyone who knows how stuff works at the lower levels.
BTW this irrational "async is always webscale" crap has been happening in the Java community as well. There is a nice summary of the outcome:
[Thousands of Threads and Blocking I/O. The old way to write Java Servers is New again
(and way better)]
Non-blocking IO based on select (and friends -- epoll, kqueue,...) is working well for very short callback chains. Think a proxy (haproxy, a webserver) or one page demo -- "Look Ma! I got a webscale server running in 3 lines of code!". Large business applications based on callback chains (even disguised as Deferreds, Futures and Promises) easily turn into a speghetti mess.
Going back to asyncio. I am less optimistic about it and I never liked it. It is good that it tried to unify and standardize non-blocking IO. But we already had that, it is called Twisted. Twisted did "async is cool" before it was really cool. It is a fantastic framework (I used for 5 years professionally) but in large code bases you feel its pain. BUT that is not the worst part, the worst part is it fragments the library echosystem. This is really bad especially for Python. Since one can argue the ecosystem of libraries is what makes Python great. With Twisted I had to go find for Twisted versions of drivers for databases. Now for asyncio I would have to look for asyncio version of libraries.
For Python I like either the classic threads for IO or eventlet/gevent threads. BTW eventlet should work with PyPy as well. The latter are great because they do not fragment the library ecosystem but they rely on monkey-patching. I can pick threaded database drivers, monkey patch the socket code and it can work with green threads. Or not work, because monkey-patching breaks sometimes...
Even better for larger concurrent applications I like channels and actors. Pick you eventlet green threads + queues. Or Go's channels. Or Akka. Or Erlang's processes. Clojure's STM is great as well. There are so many better abstraction for serious concurrent applications that if anyone picks callbacks as their default mechanism, they should be able to justify it and rationalize it well (Like say "I only know Javascript so I picked Node.js so I am using callbacks" or "I am building a proxy that maintains hundreds of thousands of TCP socket connections" etc.
> With Twisted I had to go find for Twisted versions of drivers for databases. Now for asyncio I would have to look for asyncio version of libraries.
Isn't this an argument in favour of asyncio (in as much as it is/becomes the "one blessed/stlib interface")? Migration will take time, of course. But that is the nature of change...
Agreed that is what I like about it. However it came a bit too late and it will force fragmentation of the libraries even more.
Guido at the time was going through an "async" phase and was listening to Twisted developers and perhaps looking at the cool party across the street that Node.js was having and he didn't listen or entertain much of any other alternatives (supporting PyPy with STM, supporting or mainlining greenlet based approaches like eventlet and gevent -- libraries that are probably most commonly used for concurrent programming in Python).
If you need that much performance out of Python, it's probably time to switch to Go. With Python, you still have the Global Interpreter Lock, even in PyPy. Multiple CPUs, which you probably have available, don't help.
Multiple processes in Python at scale mean really clunky interprocess communication, lots of copies of everything the interpreter loaded, and lots of cache misses.
Multiple CPUs don't help for a single request, but the majority of deployments use processes (and threads within the processes) to run multiple requests in parallel. You can definitely saturate all the CPUs with a single Django application, provided the application server is running multiple processes.
Sure. But you aren't likely to be changing core business logic.
And believe me the cost of retraining, recoding and retesting all of your code is going to be significantly more expensive than buying better hardware.
> ...the speed of Python is not nearly as fast as your database, when dealing in terms of standard CRUD-style applications...
The blanket assumptions that most developers are building "standard CRUD-style applications" and that the database is never a bottleneck make it difficult for me to take this post seriously. Sure, if you're running a simple to-do list app, you probably won't need asynchronous I/O, but in my experience, real-world business apps actually tend to use fairly complex database logic. And sure, there are ways of speeding up your app that don't involve touching the Python side (caching using Redis, denormalizing, etc), but they are a lot harder to implement than asynchronous I/O. Why give up on some free performance for many types of applications?
Also, the author mentions and then completely dismisses PyPy. Why? I've used PyPy on a production web app, and I can say for certain that from that point, the Python code completely ceased being the bottleneck at any point--the majority of the time spent was in the database. Again, I could have implemented better caching and restructured my schema, but at that point, dropping in asyncio would have been way easier and allowed the app to scale with much less effort.
There is definitely some validity to the post, but the author makes blanket assumptions that drastically weaken his argument.
I ran benchmarks for the app I mentioned. I don't have them on me because I no longer work for the company, but the database was the bottleneck on every request once we switched to PyPy. Again, I'm not saying your benchmarks are invalid, but the assumptions you make don't hold for all (or probably even the majority) of business web apps.
My post is aimed first and foremost at the Openstack community. Openstack is a large series of CRUD-based applications that don't run any stored procedures or business logic on the database. As to whether or not the "majority" of business apps are stored procedure oriented is very questionable, I've worked for years in finance and while I worked on stored procedure applications for sure, most apps were definitely just CRUD-based apps. The vast majority of apps we see for example in Django or Pyramid are most definitely CRUD based apps.
Also, I use Pypy a lot, and at best I get about a 40-50% speedup. If we factor that into the profiling where I'm showing IO taking up about 5% of the time, that still produces an application that is very much CPU bound.
If there's any point to take from the post, its the historical underpinnings of non-blocking IO, that we use it to poll a bunch of sleeping connections, where we have hundreds or thousands of them. It was never intended to be used on a small set of database connections and I think even if the database calls are slow, it will be very seldom that you'd see asynchronous context switching handling concurrency better than native threads.
Anecdotally, I've seen quite a few Django apps using gevent "just because". If you don't need it, you're just incurring additional overhead for no gain (which is what the author is saying).
There are legitimate usage cases for async patterns, but I have seen them used even in situations where they made things slower.
Download my suite and run them! Show me asyncio beating out threads in some database-centric scenario. I was really hoping to see that happen in some scenario or another.
> there are ways of speeding up your app that don't involve touching the Python side (caching using Redis, denormalizing, etc), but they are a lot harder to implement than asynchronous I/O. Why give up on some free performance for many types of applications?
1) I can't say I find migrating blocking code to non-blocking code trivial. The bad thing about it is that you never know what you left behind. Blocking code doesn't tell you when it's blocking your event loop.
2) Async helps apps use less memory by doing away with thread overhead in connections, but is otherwise suboptimal (hinders latency) under higher cpu load scenarios. It never was a way of getting additional "free performance" for your app.
saying that it's the database is also a blanket statement and assumption. why was the majority of the time spent in the database? was it because of suboptimal code? was it because a subsystem has not been implemented properly? pointing fingers at the database type or the application server type is equally irresponsible.
The first couple that come to mind:
- Caching. The 90-10 rule applies to most CRUD apps. Use varnish to get rid of some requests before they hit your webserver or make a redis/memcache LRU layer to prevent queries from hitting your database.
- Shifting more work away from Python to the database. A surprising number of apps don't do pagination properly.
For example, if you want to show the top 10 X objects for each item Y, you don't need to get X * Y rows. You can use a window function to get the top 10 for each Y can get 10 * Y rows instead.
- Using PyPy if you can. It's a free performance boost, in most cases.
- Smarter indexing. Postgres's partial indexes are really powerful.
- Intermediate caches. If you have common GROUP BY queries, a materialized view could go a long way.
- If you reallllllly need it, you can Cython and rewrite slow parts of your app in C.
My point is - parallelizing queries should be one of your last steps in speeding up your app. It adds a lot of complexity sometimes and there's a lot of cheap easy optimizations out there.