A Brief Introduction to Multi-Threading in PHP
- Foreword
- Execution
- Sharing
- Synchronization
- Pitfalls
- WTF ??
Preface
If you are a PHP programmer who spends a lot of time at the console, or someone who is interested in high performance modern programming of PHP, this document is for you.
The intention here is to provide information concise and short enough that you (and the community at large) remember it; in the hope that one day all of this will be common knowledge among PHP programmers.
By the end of the document, you should have a clear understanding of how, and why, pthreads exists, and executes.
If you have any comments, suggestions or insults please forward them to [email protected]
Insults will be ignored.
Foreword
Since PHP4, May 22nd 2000, PHP has been equipped to execute isolated instances of the interpreter in multiple threads within a single process without any context interfering with another. We call this TSRM, it is a rarely studied omnipresent part of PHP that nobody really talks about.
If you have ever used XAMPP or PHP on Windows, it’s likely that you used a threaded PHP without even knowing it.
TSRM has the ability to create isolated instances of the interpreter, which is how pthreads executes userland threads in PHP. The instances of the interpreter are as isolated as they are when executing any threaded build of PHP, the Apache2 Worker MPM PHP5 Module for example. The job of pthreads is to facilitate communication and synchronization between the otherwise isolated contexts.
Exactly how TSRM works is beyond the scope of this document, and would only confuse the reader (and subject), suffice to say that PHP has been able to work in a multi-threaded environment for more than a decade. The implementation is stable; there is however one well known, but completely misunderstood pitfall, which I shall explain the facts of, and clarify: PHP is a wrapper around third parties, every part of PHP is implemented like this, if a third party does not implement their library in a re-entrant (thread safe) way then the PHP wrapper for that library will fail and or cause unexpected behaviour during execution. A well known example of such a library is locale. It should be clear that this is beyond the control of PHP or pthreads. Such libraries are well known (documented) and or obvious, the vast majority of extensions will have no problem executing in a pthreads application.
Threading in user land was never a concern for the PHP team, and it remains as such today. You should understand that in the world where PHP does its business, there's already a defined method of scaling - add hardware. Over the many years PHP has existed, hardware has got cheaper and cheaper and so this became less and less of a concern for the PHP team. While it was getting cheaper, it also got much more powerful; today, our mobile phones and tablets have dual and quad core architectures and plenty of RAM to go with it, our desktops and servers commonly have 8 or 16 cores, 16 and 32 gigabytes of RAM, though we may not always be able to have two within budget and having two desktops is rarely useful for most of us.
In addition to the concerns of the PHP team, there are concerns of the programmer: PHP was written for the non-programmer, it is many hobbyists native tongue. The reason PHP is so easily adopted is because it is an easy language to learn and write. Multi-threaded programming is not easy for most, even with the most coherent and reliable API, there are different things to think about, and many misconceptions. The PHP group do not wish for user land multi-threading to be a core feature, it has never been given serious attention - and rightly so. PHP should not be complex, for everyone.
All things considered, there are still benefits to be had from allowing PHP to utilize its production ready and tested features to allow a means of making the most out of what we have, when adding more isn't always an option, and for a lot of tasks is never really needed if you can take advantage of all you have.
A note about nothing, or more precisely, sharing nothing: The architecture of PHP is referred to as Shared Nothing, this simply means that whenever PHP services a request, via any SAPI, its environment, in the sense of the data structures PHP requires to operate, are isolated from one another. On the surface, pthreads would appear to violate this standard and break the architecture that keeps PHP executing. Relax, this is not so. In fact, another job of pthreads (that is never evident to the programmer) is to maintain that architecture; it does this utilizing copy-on-read and copy-on-write semantics and carefully programmed mutex manipulation. The upshot of this is, any time a user does anything, in the sense of reading or writing to an object, or executing its methods, it is safe to assume that the operation was safe and there is no need for further action like the explicit use of mutex by the programmer.
Terms in the foreword that are new to the reader should now be researched, as they may appear throughout this document
Execution
Threading is about dividing your instructions into units of execution, and distributing those units among your processors &| cores in such a way as to maximize the throughput of your application.
This should always be done using as few threads as possible.
pthreads exposes two models of execution. The Thread model and the Worker model, they expose much of the same functionality to the programmer, and are internally very similar, with one key difference: what they consider to be the unit of execution.
A Thread is representative of both an interpreter context and a unit of execution (that’s its ::run method).
A Worker is representative of an interpreter context; its ::run method is used to configure that context. The unit of execution for this model is the Stackables, more precisely Stackable::run.
When the programmer calls Thread::start, a new thread is created, a PHP interpreter context is initialized and then (safely) manipulated to mirror the context that made the call to ::start. Execution continues concurrently in both contexts at this point. Execution in the Thread is passed to the ::run method of the Thread. At the end of the ::run method the context for the Thread is destroyed.
When the programmer calls Worker::start, a new thread is created, a PHP interpreter context is again initialized in the same way as a normal Thread, when execution in the Worker leaves Worker::run, the Worker begins to pop Stackables from the stack and execute them in the order they were stacked. If there are no items on the stack the Worker will wait for some to appear. The Worker will continue to do this until Worker::shutdown is called. If Worker::shutdown is called while items remain on the stack they will be executed first and the context that called Worker::shutdown will block until shutdown can occur.
Great care should be taken to avoid wasting contexts unnecessarily, starting a Thread or Worker is not free. Where you can, use the Worker model, this almost eliminates the tendency to be wasteful while multi-threading. Almost, but not completely ...
There is a tendency to be wasteful; it’s a common misunderstanding to think that threading anything can make it faster, it cannot. More threads does not always equate to more throughput, in the same way as more water does not always equate to wetter.
Thinking outside the box is a prerequisite of a good multi-threaded programmer; common sense should dictate that more water does mean wetter, but if you consider the central point of the bottom of the bowl: Once it is wet, it does not matter how much water you place on top, it cannot get wetter ...
Too much water, or threads, and you will drown.
The author of pthreads will not take responsibility for drowning programmers, or their code.
Sharing
Threading would be rather useless if threads could not manipulate a common set of data, which appears to be a problem in a shared nothing architecture. I don’t see shared nothing as a hindrance, I see it as a rather big helpful push in the right direction.
One of the normal problems for a programmer writing multi-threaded code is the safety and synchronization of data, it is normally very very easy to corrupt an array if 10 threads manipulate it at once.
Shared Nothing solves this problem; if no two contexts ever manipulate the same data then they cannot corrupt each others stack, the architecture is maintained along with its stability.
Objects descending from pthreads utilize a thread safe member storage table that works slightly differently to any other objects. When you write a member to such an object, the table is locked, the data is copied, and then stored in the table and the lock is released. When a subsequent read of that member occurs, the table is locked, the data is copied for return and the lock is released. This means that no two contexts ever manipulate the same physical data - Share Nothing.
Some data does not lend itself to being easily copied, PHP has a solution to this in the form of the serialization API. Serialization is utilized on arrays, and objects not descended from pthreads. Objects descended from pthreads are never serialized, as such you should always use pthreads objects as containers for data you intend to manipulate in multiple contexts.
All objects descending from pthreads can be manipulated, by any context with a reference, as arrays and objects, they also include methods for manipulating members in a thread safe manner. There shouldn’t be a kind of data set you cannot implement with what is exposed by pthreads, and basic sets (arrays) are built in.
This is all done in such a way that minimizes memory usage while still maintaining architecture and safety. It may seem wasteful, but it’s a small price to pay, that diminishes with the price of memory.
Synchronization
Sharing isn’t enough, the last piece of the puzzle is synchronization. This is going to be a topic completely alien to a lot of programmers.
While your are executing, and sharing, you must also be able to control when to share, and when to execute; it is no good trying to manipulate data that does not exist !!
Synchronization can be used to put a thread into a receptive, but sleepy state, known as waiting, and can be used to awaken such a thread, known as notifying.
Synchronizing with a unit of execution is easy, but does come with a danger of misuse, which I hope to give a brief, simple explanation of that will stick in your mind and help you to avoid misuse.
Make this your mantra: Only ever wait FOR something
$this->synchronized(function(){
$this->wait();
});
The above code looks simple enough, but what or who is it waiting for, and what happens if whatever they are waiting for has already sent notification ... waiting forever is the price for not paying attention to your own mantra.
The syntax of synchronization may look a bit strange, here’s an explanation that gives you a good reason to keep typing all that stuff: when you call ::synchronized a mutex (lock) is acquired, when you call ::wait, that mutex is atomically locked and unlocked to allow other contexts to acquire it while the waiting context blocks on a condition waiting for notification.
Waiting for something looks like this:
$this->synchronized(function(){
if (!$this->data) {
$this->wait();
}
});
/* I can manipulate $this->data and know it exists */
While notification looks like this:
$that->synchronized(function($that){
$that->data = “some”;
$that->notify();
}, $that);
In the notification example, you ensure that the context that is waiting is not left hanging around forever because if you have acquired the synchronization lock and the object is not waiting then it need not wait (by the time it can acquire the synchronization lock the data is already set). A call to notify will ensure if you managed to acquire the synchronization lock because it was atomically released by the waiting thread, the waiting thread is awoken and will continue executing.
This kind of explicit synchronization can make for powerful programming, study it well.
Pitfalls
The garbage collection built into PHP was never prepared for this kind of prolonged execution, if pthreads followed the PHP way and edited reference counts of objects when we accepted them (as an argument to a method, or as the data for a member property), then memory usage soars, it becomes difficult to retain control of your own code.
So we do not do the done thing; in a pthreads application, you are responsible for the objects you create, you are also responsible for retaining a reference to objects that are going to be executed, or accessed from other executing contexts, until that execution or access has taken place.
This circumvents the problem of out of control memory usage, but it creates another problem; dreaded segfaults.
Segmentation faults occur when you instruct a processor to address memory that it cannot access, they result in abortion of execution. The prime suspect when you encounter segmentation faults during development is objects being referenced that were already destroyed in the context that originally created the object.
Avoiding these segmentation faults sounds much more complex than it in reality is, this can be illustrated best with a (bad) example:
class W extends Worker {
public function run(){}
}
class S extends Stackable {
public function run(){}
}
/* 1 */
$w = new W();
/* 2 */
$j = array(
new S(), new S(), new S()
);
/* 3 */
foreach ($j as $job)
$w->stack($job);
/* 4 */
$j = array();
$w->start();
$w->shutdown();
The above example will always segfault; steps 1-3 are perfectly normal, but before the Worker is started the stacked objects are deleted, resulting in a segfault when the Worker is allowed to start. Your code will not always look so explicit, but if you can see a route where this could conceivably happen, then program a different way.
Other symptoms of this kind of programming error are the fatal error
Call to a member function member() on a non-object in /my/code.php
and the notice
Trying to get property of non-object in /my/code.php
If you experience these errors, carefully look over your code and make sure everything you have passed to any other context exists all the time it is being referenced or executed in any other context.
This is probably the hardest part of creating applications with pthreads, but it doesn't take a lot to avoid; plan with care, and program with even more care.
WTF ??
I hear the criticism that I have taken something simple, that's PHP, and made it more complex by exposing this kind of functionality. I hear you; I would argue that I have taken something complex, and made it relatively simple.
Something being complex, or difficult, is no kind of justification for avoiding it. The complexity of anything should decrease as your knowledge increases, if it does not, then you are not taking in the right kind of information. This is the nature of learning.
To the idea that I haven't made anything simple; oh rly? If the task is simple: get two things done at once, the implementation is simple. The fact that you are even considering complex ideas is the thing you should be paying attention to!!
To the rest of the nay-sayers: Progress is made by pushing forwards, when we all push at once, we make more progress !!
Even if you hate the idea, I hope I've said enough to convince you to give it a try before you form a long lasting opinion that will affect your decisions in the future, what is the worst that can happen !?
What you mean is that worker doesn't refcount (making it a bit like weakref).
I don't see why the PHP thread system is too complex or confusing. It just makes the PHP engine go from being like a singleton to something you can have multiple instances of. For each instance you can have a thread as each instance is isolated and has no access to anything from any other instance.
The obvious problem with that is that each thread has its own data. If you want more than one thread working on the data then you have to copy it. Basically what you get is like a mini-fork. Instead of copying the entire process and all its data you only copy the data you need your instance to work on but it's still a copy. That means that either you'll still have duplication and may also need some kind of ITC (inter thread/instance communication, SHM, etc) for more complex scenarios or to avoid the duplication. That's at least the "safe" option.
pthreads take the opposite approach as far as I can tell from your example. It allows data access between threads but leaves it up to you to make sure it's safe. The same as in C or any language where you need to decide where to have locking or other relevant mechanisms but in PHP it's probably more difficult because of all the things that can happen under the hood that you aren't directly privy to like you are in C.
There are also hybrid approaches not explored (besides making PHP actually threaded) of all sorts. If you separate call from return in a C library PHP extends then it's easy to have it use threads though it tends to be harder to get libraries to work well together (though masquerading as a stream could do it). This isn't just a problem for threaded but also for asynchronous PHP. A lot of multiplexed libraries work well on their own but not against other libraries. Other hybrid approaches more close to real threading might be rather than copying data transferring ownership. It would be like in effect calling unset on it in the current instance and setting it in the new instance, handing over your pointer. Things like that however only works for certain opportunities (like having a refcount of 1 for example, sort of like an unset where if you'd GC you can had it over instead).
For the missing refs, it then might be normal for a programmer to then hold a reference to everything. Now you have another problem. Now you have memory leaks and these are harder to detect as you never get a segfault or error message until ages later down the line where you have OOM all of a sudden. The difficulty in this is that you can't just ref all the things. You really want to to get it perfect. That's nothing to do with PHP. You'll get memory leaks doing that in any language with heavily managed memory. Java, Javascript, Perl, Python, etc.
What I don't get is you explanation for it. PHP is extremely good for running for a long time and that was by and large the case even in 2012 though to be fair PHP at that time hadn't been at its newly achieved level of stability for long.
"The garbage collection built into PHP was never prepared for this kind of prolonged execution"
This really tends not to be true. I've been making long running and very intensive processes with a variety of complex loads also including various IO such as sockets (async) for easily half a decade now. I find that PHP does a pretty good job with high uptime. Most memory leaks I've experienced are the few cases I've gotten it wrong when writing extensions and usually from frameworks which are made only with the request model in mind.
I strongly doubt even in 2012 that's the reason. It really makes no sense to be honest. It seems more like to me the library just opted for maximum speed and programmer responsibility for stability or correct function (the alternative, making PHP truly thread safe, just not being viable). Or some internals kludge.