Even back then, I had a fascination with massive distributed parallelism, despite the fact that my Apple IIe would never get me there.
It wasn't until my fourth year of university that I had the chance to use a real parallel computer for my honors thesis project. I was able to use a handful of Transputers - processors with four high-speed serial interconnects that could be joined together like carbon atoms. I built a cube-shaped topology and used it to do an A* best-first search to solve the knapsack problem. Back then, I used a language called Occam, that was based on Hoare's Communicating Sequential Processes [1]. I discovered two interesting things as part of this project.
Discovery #1 - The two-colored algorithm for distributed termination detection
The first thing I discovered was that if you have no global state, it is really hard to determine when you are done. Whenever one of the processors finished all of its work, it would ask its neighbors for more to do. That way, all the processors would stay busy until all the work was done.The problem was that if you passed around a "are you done?" message asking each processor if it was done, each might answer "yes", but between the time that processor N answered "yes" and the time that all the remaining processors answered "yes", processor N might have found work to do from processor M. Meanwhile, processor M finishes its work before the done request comes and it answers "yes, I'm done" as well. In other words, the remaining work might play hide-and-seek with the "are you done?" messages.
The solution to this problem was something I found in the literature [2]. Basically, you form a Hamiltonian cycle out of the processors (that is, a circular path that goes through each processor exactly once).
Then you give a yellow token to one of the nodes and whenever it runs out of work, it passes the yellow token to the next node in the Hamiltonian circuit. Eventually, the yellow token arrives back at the original node.
If the original node has been idle the whole time, it converts the token to a red token and passes it around again. If the red token arrives back at the original node and it has still been idle the whole time, then processing is done.
On the other hand, if a node receives a token (either red or yellow) and it hasn't been idle, it converts the token back to a yellow token, and the process repeats.
Discovery #2 - Super-linear speedup
The second thing I discovered was something that was hard to explain at first. I had expected that as the number of processors increased, the efficiency of each processor would get worse and worse. In other words, eight processors might perform 6x as good as one processor. In actual fact, I observed the opposite: eight processors performed >8x as fast as one processor. This seemed to defy the laws of nature.As it turns out, my sub-par choice of data structures [3], plus the non-deterministic jitter added by parallelism, led to a situation where the parallel algorithm was superior to the sequential algorithm. In fact, I was able to show the same super-linear speedup by running eight processes on a single processor. I was later able to confirm that other researchers had independently observed the same phenomenon. [4]
Conclusion
Parallelism leads to unexpected new problems and phenomena. Modern tools and libraries have changed things a lot since my early days (more on this in a future post), but it is still surprising what you will discover.---
[1] Interesting fact: the Go language uses the same paradigm. (http://golang.org/)
[2] Oliver Vornberger, Transputer networks for operations research problems, Journal of Microcomputer Applications, Volume 13, Issue 1, January 1990, Pages 69-79, ISSN 0745-7138, 10.1016/0745-7138(90)90007-T. (http://www.sciencedirect.com/science/article/pii/074571389090007T)
[3] Tip of the day: a heap is better than a binary tree for implementing a priority queue.
[4] A search for "super-linear speedup" will give you lots of examples.