Thoughts about Software Transactional Memory
I've been reading and talking about Software Transactional Memory (STM) a bit. Commentators are conflicted as to whether it can perform well, with some claiming its performance degrades terribly under heavy load.
I was wondering, can we provide the same tools that STM provides at the language level with a smarter underlying implementation? The performance complaint centers around the fact that STM is implemented as an optimistic system, where each transaction attempts to perform its work, and detects conflicts with others and retries. With a high level of contention this can mean a lot of wasted effort; it can also make the character of the system hard to understand, with surprise jumps (down) in performance as load approaches certain levels.
Pessimistic locking systems, on the other hand, have predictable performance. A process has to lock a series of locks, does the work, releases them. There's no uncertainty or (much) wasted effort -- just time wasted acquiring locks. (Of course some CPU cycles are spent actually in the locking code, but this shouldn't be significant). The time spent acquiring the locks isn't really opportunity cost if the locking system is well-designed. We would expect with a good system that even if some threads are blocked on locks, overall the system is efficiently used. At least we assume this insofar as the algorithm that the program embodies actually is parallizable in the first place.
So, I wonder, can any regular language provide an
(We can't do this kind of reasoning in a SQL database environment because database servers don't know in advance what kind of queries they will be asked to evaluate. However, a compiler does know all about the code it will be evaluating, so we would expect the optimal locking algorithm for programs to be different and better than a database server's approach.)
A simple algorithm can provide correct semantics: every
Let's say that our processes are represented as
We have some set of processes like
Let's approach from another direction. Rather than having a single global lock, let's give every variable its own lock. Thus the process
This is efficient, because we allow
What about guaranteeing progress? Let's say we run all the processes in parallel. Perhaps
It's easy to see that we have a full deadlock -- all the processes are waiting on one another. Of course, it is not a particularly likely circumstance, but we as professional software engineers don't build our systems upon wishful thinking -- right?! We need a different way to do things.
Really, the problem as described above is the Dining Philosophers problem, and only a small adjustment is needed to prevent deadlock. The trick is in that process
It sounds to me like this solution would be viable, but probably not efficient. One could imagine locking in order based on the memory locations of the objects (assuming they're constant) or imposing some other ordering scheme on object locks.
This would work if the amount of shared state is small (a couple of objects accessed per
From here we could imagine improving our algorithm, aggregating locks together to reduce their number but preserve the same atomic semantics. For example, we might say that we only need to lock an array to access objects contained in the array, rather than locking each individual object. (This would require assuming that the objects cannot be referenced any other way.) Where common locks exist, we could imagine coalescing them into a single lock. For example, if many processes access exactly two variables, we could imagine the algorithm determining that they both will be protected by the same lock. Another alternative is to have a single global lock which a process locks, then may obtain at once any number of other locks at once quickly. There are a lot of possible optimizations we could research to make this locking system faster.
A quick poll of some PhD friends suggests that there is probably not much research in this area. There's a lot to explore! We could even imagine a locking system that dynamically profiles itself and re-organizes the locking structure to maximize throughput. If the environment is low-level enough, the scheduler could even be aware of the performance characteristics and could schedule accordingly (schedulers already do know which threads are blocked on locks, e.g.).
When the programming language I am designing gets a bit further along I will attempt to implement a simple
Is STM the solution? STM is much easier when one is working in a pure functional programming language with little or no shared state in many places. However, given that procedural languages are used widely in the industry, I hope other researchers are as interested in these problems as I am!
Overall, STM feels a lot to me like Garbage Collection (GC). The goal of it isn't to make programs run more efficiently; the goal is to make programming easier and thereby use programmer time more efficiently. Working with locks is difficult and it's easy to make mistakes; if STM can alleviate this with only a 20% performance decrease, it would be a huge win. I say it's like GC because it's a similar tradeoff: we'll give up memory efficiency and even runtime performance, and even yet suffer possible random freezes as the GC kicks in, but in allowing these things we make the much larger gain of freeing programmers from the concern of memory management. Managing memory and managing locks: both meticulous tasks well suited for a computer.
By adopting GC we've freed ourselves from the tiring troubleshooting of obscure memory issues, and become more productive. I wonder whether freeing ourselves from similar concurrent bugs would bring about a similar but even greater productivity increase. If we gain a powerful Software Transactional Memory model, not only will developers free themselves from troubleshooting concurrency bugs, but we may find developers generally being more willing to write concurrent software.
Think about it! Writing concurrent software is hard -- why bother if it will only provide a minor speedup? Commentators often note that software like Microsoft Word won't speed up on multiple processors. It's not because this software couldn't be sped up by multiple processors, it's that the gain in performance is not worth the extra development cost (which is likely to be small for most desktop software). But if the added development cost is low, suddenly concurrent implementations become more viable.
That's where I see things heading -- mainstream programming languages make concurrency easy, not just in the Big Server platform (Erlang) but in the everyday desktop platform. Once there is first-class language support, concurrent software will become a lot more available.
However, developers will still need to develop concurrent algorithms, which would take extra effort. And there's still the question of whether compilers that optimized as I've described above could ever be implemented. Donald Knuth, in a recent interview with Andrew Binstock, expressed doubt about whether the highly concurrent hardware will be well-used:
Cheers to all the researchers out there working hard to make this pipe dream reality.
[Edit: I have recently found an interesting paper relevant to the topic: Advanced Contention Management for Dynamic Software Transactional Memory. I'll write a follow-up if it's particularly relevant.]
I was wondering, can we provide the same tools that STM provides at the language level with a smarter underlying implementation? The performance complaint centers around the fact that STM is implemented as an optimistic system, where each transaction attempts to perform its work, and detects conflicts with others and retries. With a high level of contention this can mean a lot of wasted effort; it can also make the character of the system hard to understand, with surprise jumps (down) in performance as load approaches certain levels.
Pessimistic locking systems, on the other hand, have predictable performance. A process has to lock a series of locks, does the work, releases them. There's no uncertainty or (much) wasted effort -- just time wasted acquiring locks. (Of course some CPU cycles are spent actually in the locking code, but this shouldn't be significant). The time spent acquiring the locks isn't really opportunity cost if the locking system is well-designed. We would expect with a good system that even if some threads are blocked on locks, overall the system is efficiently used. At least we assume this insofar as the algorithm that the program embodies actually is parallizable in the first place.
So, I wonder, can any regular language provide an
atomic { } block? Let's say I provide this feature in a language. I need to generate locks around the blocks to provide correct semantics; I need to optimize so that it runs efficiently.(We can't do this kind of reasoning in a SQL database environment because database servers don't know in advance what kind of queries they will be asked to evaluate. However, a compiler does know all about the code it will be evaluating, so we would expect the optimal locking algorithm for programs to be different and better than a database server's approach.)
A simple algorithm can provide correct semantics: every
atomic block locks a single global lock. This is trivially correct but not very efficient. Can we do better?Let's say that our processes are represented as
{A, B, C} where the block operates on some shared variables A, B, C. The particular operations or order is unimportant since we're guaranteeing that process has a transactional view of all variables. (Well, we could in theory optimize blocks down for the programmer, by detecting that (say) one atomic block actually could be split up into two. Let's ignore that for now or just assume a separate and correct algorithm for minimizing the data shared in atomic blocks.)We have some set of processes like
{A, B} {B, C} {C, D} {D, A}. We humans know that, for example, both process {A, B} and {C, D} could execute in parallel with no conflict -- but they don't, because a process can only execute its atomic block if it's holding the global lock. We want our STM system to be at least smart enough to run parallel tasks in parallel, so this isn't good enough.Let's approach from another direction. Rather than having a single global lock, let's give every variable its own lock. Thus the process
{D, A} must get the lock for D and A before proceeding.This is efficient, because we allow
{A, B} and {C, D} to run in parallel; they will each lock their respective locks and go on their way.What about guaranteeing progress? Let's say we run all the processes in parallel. Perhaps
{D, A} runs far enough to get the lock for D; next the process {C, D} runs, gets C, blocks on D; next {B, C} runs and blocks on C; and {A, B} runs and blocks on B. Let me describe each process and the locks it currently holds:{A, B}:A(waiting onB){B, C}:B(waiting onC){C, D}:C(waiting onD){D, A}:D(waiting onA)
It's easy to see that we have a full deadlock -- all the processes are waiting on one another. Of course, it is not a particularly likely circumstance, but we as professional software engineers don't build our systems upon wishful thinking -- right?! We need a different way to do things.
Really, the problem as described above is the Dining Philosophers problem, and only a small adjustment is needed to prevent deadlock. The trick is in that process
{D, A}. What we need is a locking convention; as long as there is a global total order on the order in which processes acquire locks, we are ensured that there is no deadlock. This formulation of the problem naturally lends itself to one: we can lock all locks in alphabetical order. This will work even if processes require many locks, like {A, B, C}, {B, C, D}, etc.It sounds to me like this solution would be viable, but probably not efficient. One could imagine locking in order based on the memory locations of the objects (assuming they're constant) or imposing some other ordering scheme on object locks.
This would work if the amount of shared state is small (a couple of objects accessed per
atomic), but what if the number is large? Could we use this for a game, if we need to (say) iterate through an array of all objects in the world and update their physics? This system breaks down if blocks need to access large numbers of objects.From here we could imagine improving our algorithm, aggregating locks together to reduce their number but preserve the same atomic semantics. For example, we might say that we only need to lock an array to access objects contained in the array, rather than locking each individual object. (This would require assuming that the objects cannot be referenced any other way.) Where common locks exist, we could imagine coalescing them into a single lock. For example, if many processes access exactly two variables, we could imagine the algorithm determining that they both will be protected by the same lock. Another alternative is to have a single global lock which a process locks, then may obtain at once any number of other locks at once quickly. There are a lot of possible optimizations we could research to make this locking system faster.
A quick poll of some PhD friends suggests that there is probably not much research in this area. There's a lot to explore! We could even imagine a locking system that dynamically profiles itself and re-organizes the locking structure to maximize throughput. If the environment is low-level enough, the scheduler could even be aware of the performance characteristics and could schedule accordingly (schedulers already do know which threads are blocked on locks, e.g.).
When the programming language I am designing gets a bit further along I will attempt to implement a simple
atomic block construct. One could imagine a simple naive global lock implementation for development followed by later optimization as necessary; it might be the case that a naive implementation is sufficient for many simple cases (like basic GUIs). If it is a useful way to program, hopefully we can discover solid implementations of the construct.Is STM the solution? STM is much easier when one is working in a pure functional programming language with little or no shared state in many places. However, given that procedural languages are used widely in the industry, I hope other researchers are as interested in these problems as I am!
Overall, STM feels a lot to me like Garbage Collection (GC). The goal of it isn't to make programs run more efficiently; the goal is to make programming easier and thereby use programmer time more efficiently. Working with locks is difficult and it's easy to make mistakes; if STM can alleviate this with only a 20% performance decrease, it would be a huge win. I say it's like GC because it's a similar tradeoff: we'll give up memory efficiency and even runtime performance, and even yet suffer possible random freezes as the GC kicks in, but in allowing these things we make the much larger gain of freeing programmers from the concern of memory management. Managing memory and managing locks: both meticulous tasks well suited for a computer.
By adopting GC we've freed ourselves from the tiring troubleshooting of obscure memory issues, and become more productive. I wonder whether freeing ourselves from similar concurrent bugs would bring about a similar but even greater productivity increase. If we gain a powerful Software Transactional Memory model, not only will developers free themselves from troubleshooting concurrency bugs, but we may find developers generally being more willing to write concurrent software.
Think about it! Writing concurrent software is hard -- why bother if it will only provide a minor speedup? Commentators often note that software like Microsoft Word won't speed up on multiple processors. It's not because this software couldn't be sped up by multiple processors, it's that the gain in performance is not worth the extra development cost (which is likely to be small for most desktop software). But if the added development cost is low, suddenly concurrent implementations become more viable.
That's where I see things heading -- mainstream programming languages make concurrency easy, not just in the Big Server platform (Erlang) but in the everyday desktop platform. Once there is first-class language support, concurrent software will become a lot more available.
However, developers will still need to develop concurrent algorithms, which would take extra effort. And there's still the question of whether compilers that optimized as I've described above could ever be implemented. Donald Knuth, in a recent interview with Andrew Binstock, expressed doubt about whether the highly concurrent hardware will be well-used:
To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop, worse than the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.
Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX.
How many programmers do you know who are enthusiastic about these promised machines of the future? I hear almost nothing but grief from software people, although the hardware folks in our department assure me that I’m wrong.
I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years. ...
The machine I use today has dual processors. I get to use them both only when I’m running two independent jobs at the same time; that’s nice, but it happens only a few minutes every week. If I had four processors, or eight, or more, I still wouldn’t be any better off, considering the kind of work I do—even though I’m using my computer almost every day during most of the day. So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream.
Cheers to all the researchers out there working hard to make this pipe dream reality.
[Edit: I have recently found an interesting paper relevant to the topic: Advanced Contention Management for Dynamic Software Transactional Memory. I'll write a follow-up if it's particularly relevant.]


0 Comments:
Post a Comment
<< Home