Parallel Computing¶
Most modern computers possess more than one CPU, and several computers can be combined together in a cluster. Harnessing the power of these multiple CPUs allows many computations to be completed more quickly. There are two major factors that influence performance: the speed of the CPUs themselves, and the speed of their access to memory. In a cluster, it’s fairly obvious that a given CPU will have fastest access to the RAM within the same computer (node). Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU. Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once.
Julia’s implementation of message passing is different from other environments such as MPI [1]. Communication in Julia is generally “one-sided”, meaning that the programmer needs to explicitly manage only one process in a two-process operation. Furthermore, these operations typically do not look like “message send” and “message receive” but rather resemble higher-level operations like calls to user functions.
Parallel programming in Julia is built on two primitives: remote
references and remote calls. A remote reference is an object that can
be used from any process to refer to an object stored on a particular
process. A remote call is a request by one process to call a certain
function on certain arguments on another (possibly the same) process.
A remote call returns a remote reference to its result. Remote calls
return immediately; the process that made the call proceeds to its
next operation while the remote call happens somewhere else. You can
wait for a remote call to finish by calling wait()
on its remote
reference, and you can obtain the full value of the result using
fetch()
. You can store a value to a remote reference using put!()
.
Let’s try this out. Starting with julia-pn
provides n
worker
processes on the local machine. Generally it makes sense for n
to
equal the number of CPU cores on the machine.
$./julia-p2julia>r=remotecall(2,rand,2,2)RemoteRef(2,1,5)julia>fetch(r)2x2Float64Array:0.604010.5011110.1745720.157411julia>s=@spawnat21.+fetch(r)RemoteRef(2,1,7)julia>fetch(s)2x2Float64Array:1.604011.501111.174571.15741
The first argument to remotecall()
is the index of the process
that will do the work. Most parallel programming in Julia does not
reference specific processes or the number of processes available,
but remotecall()
is considered a low-level interface providing
finer control. The second argument to remotecall()
is the function
to call, and the remaining arguments will be passed to this
function. As you can see, in the first line we asked process 2 to
construct a 2-by-2 random matrix, and in the second line we asked it
to add 1 to it. The result of both calculations is available in the
two remote references, r
and s
. The @spawnat
macro
evaluates the expression in the second argument on the process
specified by the first argument.
Occasionally you might want a remotely-computed value immediately. This
typically happens when you read from a remote object to obtain data
needed by the next local operation. The function remotecall_fetch()
exists for this purpose. It is equivalent to fetch(remotecall(...))
but is more efficient.
julia>remotecall_fetch(2,getindex,r,1,1)0.10824216411304866
Remember that getindex(r,1,1)
is equivalent to
r[1,1]
, so this call fetches the first element of the remote
reference r
.
The syntax of remotecall()
is not especially convenient. The macro
@spawn
makes things easier. It operates on an expression rather than
a function, and picks where to do the operation for you:
julia>r=@spawnrand(2,2)RemoteRef(1,1,0)julia>s=@spawn1.+fetch(r)RemoteRef(1,1,1)julia>fetch(s)1.108242164113048661.137982338779231161.123762927063550741.18750497916607167
Note that we used 1.+fetch(r)
instead of 1.+r
. This is because we
do not know where the code will run, so in general a fetch()
might be
required to move r
to the process doing the addition. In this
case, @spawn
is smart enough to perform the computation on the
process that owns r
, so the fetch()
will be a no-op.
(It is worth noting that @spawn
is not built-in but defined in Julia
as a macro. It is possible to define your
own such constructs.)
Code Availability and Loading Packages¶
Your code must be available on any process that runs it. For example, type the following into the Julia prompt:
julia>function rand2(dims...)return2*rand(dims...)endjulia>rand2(2,2)2x2Float64Array:0.1537560.3685141.151190.918912julia>fetch(@spawnrand2(2,2))ERROR:Onworker2:function rand2notdefinedonprocess2
Process 1 knew about the function rand2
, but process 2 did not.
Most commonly you’ll be loading code from files or packages, and you
have a considerable amount of flexibility in controlling which
processes load code. Consider a file, "DummyModule.jl"
, containing
the following code:
moduleDummyModuleexportMyType,ftype MyTypea::Intendf(x)=x^2+1println("loaded")end
Starting Julia with julia-p2
, you can use this to verify the following:
include("DummyModule.jl")
loads the file on just a single process (whichever one executes the statement).usingDummyModule
causes the module to be loaded on all processes; however, the module is brought into scope only on the one executing the statement.As long as
DummyModule
is loaded on process 2, commands likerr=RemoteRef(2)put!(rr,MyType(7))
allow you to store an object of type
MyType
on process 2 even ifDummyModule
is not in scope on process 2.
You can force a command to run on all processes using the @everywhere
macro.
For example, @everywhere
can also be used to directly define a function on all processes:
julia>@everywhereid=myid()julia>remotecall_fetch(2,()->id)2
A file can also be preloaded on multiple processes at startup, and a driver script can be used to drive the computation:
julia-p<n>-Lfile1.jl-Lfile2.jldriver.jl
Each process has an associated identifier. The process providing the interactive Julia prompt always has an id equal to 1, as would the Julia process running the driver script in the example above. The processes used by default for parallel operations are referred to as “workers”. When there is only one process, process 1 is considered a worker. Otherwise, workers are considered to be all processes other than process 1.
The base Julia installation has in-built support for two types of clusters:
- A local cluster specified with the
-p
option as shown above. - A cluster spanning machines using the
--machinefile
option. This uses a passwordlessssh
login to start Julia worker processes (from the same path as the current host) on the specified machines.
Functions addprocs()
, rmprocs()
, workers()
, and others are available as a programmatic means of
adding, removing and querying the processes in a cluster.
Note that workers do not run a .juliarc.jl
startup script, nor do they synchronize their global state
(such as global variables, new method definitions, and loaded modules) with any of the other running processes.
Other types of clusters can be supported by writing your own custom
ClusterManager
, as described below in the ClusterManagers
section.
Data Movement¶
Sending messages and moving data constitute most of the overhead in a parallel program. Reducing the number of messages and the amount of data sent is critical to achieving performance and scalability. To this end, it is important to understand the data movement performed by Julia’s various parallel programming constructs.
fetch()
can be considered an explicit data movement operation, since
it directly asks that an object be moved to the local machine.
@spawn
(and a few related constructs) also moves data, but this is
not as obvious, hence it can be called an implicit data movement
operation. Consider these two approaches to constructing and squaring a
random matrix:
# method 1A=rand(1000,1000)Bref=@spawnA^2...fetch(Bref)# method 2Bref=@spawnrand(1000,1000)^2...fetch(Bref)
The difference seems trivial, but in fact is quite significant due to
the behavior of @spawn
. In the first method, a random matrix is
constructed locally, then sent to another process where it is squared.
In the second method, a random matrix is both constructed and squared on
another process. Therefore the second method sends much less data than
the first.
In this toy example, the two methods are easy to distinguish and choose
from. However, in a real program designing data movement might require
more thought and likely some measurement. For example, if the first
process needs matrix A
then the first method might be better. Or,
if computing A
is expensive and only the current process has it,
then moving it to another process might be unavoidable. Or, if the
current process has very little to do between the @spawn
and
fetch(Bref)
then it might be better to eliminate the parallelism
altogether. Or imagine rand(1000,1000)
is replaced with a more
expensive operation. Then it might make sense to add another @spawn
statement just for this step.
Parallel Map and Loops¶
Fortunately, many useful parallel computations do not require data
movement. A common example is a Monte Carlo simulation, where multiple
processes can handle independent simulation trials simultaneously. We
can use @spawn
to flip coins on two processes. First, write the
following function in count_heads.jl
:
function count_heads(n)c::Int=0fori=1:nc+=rand(Bool)endcend
The function count_heads
simply adds together n
random bits.
Here is how we can perform some trials on two machines, and add together the
results:
require("count_heads")a=@spawncount_heads(100000000)b=@spawncount_heads(100000000)fetch(a)+fetch(b)
This example demonstrates a powerful and often-used
parallel programming pattern. Many iterations run independently over
several processes, and then their results are combined using some
function. The combination process is called a reduction, since it is
generally tensor-rank-reducing: a vector of numbers is reduced to a
single number, or a matrix is reduced to a single row or column, etc. In
code, this typically looks like the pattern x=f(x,v[i])
, where
x
is the accumulator, f
is the reduction function, and the
v[i]
are the elements being reduced. It is desirable for f
to be
associative, so that it does not matter what order the operations are
performed in.
Notice that our use of this pattern with count_heads
can be
generalized. We used two explicit @spawn
statements, which limits
the parallelism to two processes. To run on any number of processes,
we can use a parallel for loop, which can be written in Julia like
this:
nheads=@parallel(+)fori=1:200000000Int(rand(Bool))end
This construct implements the pattern of assigning iterations to
multiple processes, and combining them with a specified reduction (in
this case (+)
). The result of each iteration is taken as the value
of the last expression inside the loop. The whole parallel loop
expression itself evaluates to the final answer.
Note that although parallel for loops look like serial for loops, their behavior is dramatically different. In particular, the iterations do not happen in a specified order, and writes to variables or arrays will not be globally visible since iterations run on different processes. Any variables used inside the parallel loop will be copied and broadcast to each process.
For example, the following code will not work as intended:
a=zeros(100000)@parallelfori=1:100000a[i]=iend
However, this code will not initialize all of a
, since each
process will have a separate copy of it. Parallel for loops like these
must be avoided. Fortunately, distributed arrays can be used to get
around this limitation (see the
DistributedArrays.jl
package).
Using “outside” variables in parallel loops is perfectly reasonable if the variables are read-only:
a=randn(1000)@parallel(+)fori=1:100000f(a[rand(1:end)])end
Here each iteration applies f
to a randomly-chosen sample from a
vector a
shared by all processes.
As you could see, the reduction operator can be omitted if it is not needed.
In that case, the loop executes asynchronously, i.e. it spawns independent
tasks on all available workers and returns an array of RemoteRef
immediately without waiting for completion.
The caller can wait for the RemoteRef
completions at a later
point by calling fetch()
on them, or wait for completion at the end of the
loop by prefixing it with @sync
, like @sync@parallelfor
.
In some cases no reduction operator is needed, and we merely wish to
apply a function to all integers in some range (or, more generally, to
all elements in some collection). This is another useful operation
called parallel map, implemented in Julia as the pmap()
function.
For example, we could compute the singular values of several large
random matrices in parallel as follows:
M=Matrix{Float64}[rand(1000,1000)fori=1:10]pmap(svd,M)
Julia’s pmap()
is designed for the case where each function call does
a large amount of work. In contrast, @parallelfor
can handle
situations where each iteration is tiny, perhaps merely summing two
numbers. Only worker processes are used by both pmap()
and @parallelfor
for the parallel computation. In case of @parallelfor
, the final reduction
is done on the calling process.
Synchronization With Remote References¶
Scheduling¶
Julia’s parallel programming platform uses
Tasks (aka Coroutines) to switch among
multiple computations. Whenever code performs a communication operation
like fetch()
or wait()
, the current task is suspended and a
scheduler picks another task to run. A task is restarted when the event
it is waiting for completes.
For many problems, it is not necessary to think about tasks directly. However, they can be used to wait for multiple events at the same time, which provides for dynamic scheduling. In dynamic scheduling, a program decides what to compute or where to compute it based on when other jobs finish. This is needed for unpredictable or unbalanced workloads, where we want to assign more work to processes only when they finish their current tasks.
As an example, consider computing the singular values of matrices of different sizes:
M=Matrix{Float64}[rand(800,800),rand(600,600),rand(800,800),rand(600,600)]pmap(svd,M)
If one process handles both 800x800 matrices and another handles both
600x600 matrices, we will not get as much scalability as we could. The
solution is to make a local task to “feed” work to each process when
it completes its current task. This can be seen in the implementation of
pmap()
:
function pmap(f,lst)np=nprocs()# determine the number of processes availablen=length(lst)results=cell(n)i=1# function to produce the next work item from the queue.# in this case it's just an index.nextidx()=(idx=i;i+=1;idx)@syncbeginforp=1:npifp!=myid()||np==1@asyncbeginwhiletrueidx=nextidx()ifidx>nbreakendresults[idx]=remotecall_fetch(p,f,lst[idx])endendendendendresultsend
@async
is similar to @spawn
, but only runs tasks on the
local process. We use it to create a “feeder” task for each process.
Each task picks the next index that needs to be computed, then waits for
its process to finish, then repeats until we run out of indexes. Note
that the feeder tasks do not begin to execute until the main task
reaches the end of the @sync
block, at which point it surrenders
control and waits for all the local tasks to complete before returning
from the function. The feeder tasks are able to share state via
nextidx()
because they all run on the same process. No locking is
required, since the threads are scheduled cooperatively and not
preemptively. This means context switches only occur at well-defined
points: in this case, when remotecall_fetch()
is called.
Channels¶
Channels provide for a fast means of inter-task communication. A
Channel(T::Type,n::Int)
is a shared queue of maximum length n
holding objects of type T
. Multiple readers can read off the channel
via fetch
and take!
. Multiple writers can add to the channel via
put!
. isready
tests for the presence of any object in
the channel, while wait
waits for an object to become available.
close
closes a Channel. On a closed channel, put!
will fail,
while take!
and fetch
successfully return any existing values
till it is emptied.
A Channel can be used as an iterable object in a for
loop, in which
case the loop runs as long as the channel has data or is open. The loop
variable takes on all values added to the channel. An empty, closed channel
causes the for
loop to terminate.
RemoteRefs and AbstractChannels¶
A RemoteRef
is a proxy for an implementation of an AbstractChannel
A concrete implementation of an AbstractChannel
(like Channel
), is required
to implement put!
, take!
, fetch
, isready
and wait
. The remote object
referred to by a RemoteRef()
or RemoteRef(pid)
is stored in a Channel{Any}(1)
,
i.e., a channel of size 1 capable of holding objects of Any
type.
Methods put!
, take!
, fetch
, isready
and wait
on a RemoteRef
are proxied onto
the backing store on the remote process.
The constructor RemoteRef(f::Function,pid)
allows us to construct references to channels holding
more than one value of a specific type. f()
is a function executed on pid
and it must return
an AbstractChannel
.
For example, RemoteRef(()->Channel{Int}(10),pid)
, will return a reference to a channel of type Int
and size 10.
RemoteRef
can thus be used to refer to user implemented AbstractChannel
objects. A simple
example of this is provided in examples/dictchannel.jl
which uses a dictionary as its remote store.
Distributed Garbage Collection¶
Objects referred to by remote references can be freed only when all held references in the cluster are deleted.
The node where the value is stored keeps track of which of the workers have a reference to it.
Every time a RemoteRef
is serialized to a worker, the node pointed to by the reference is
notified. And every time a RemoteRef
is garbage collected locally, the node owning the value
is again notified.
The notifications are done via sending of “tracking” messages - an “add reference” message when a reference is serialized to a different process and a “delete reference” message when a reference is locally garbage collected.
It is important to note that when an object is locally garbage collected depends on the size of the object and the current memory pressure in the system.
In case of remote references, the size of the local reference object is quite small, while the value
stored on the remote node may be quite large. Since the local object may not be collected immediately,
it is a good practice to explicitly call finalize
on local instances of RemoteRef
. Explicitly
calling finalize
results in an immediate message sent to the remote node to go ahead and
remove its reference to the value.
Once finalized, a reference becomes invalid and cannot be used in any further calls.
Like remote references, SharedArray
objects are also dependent on garbage collection
on the creating node to release references from all participating workers. Code which
creates many short lived shared array objects would benefit from explicitly
finalizing these objects as soon as possible. This results in both memory and file
handles mapping the shared segment being released sooner.
ClusterManagers¶
The launching, management and networking of Julia processes into a logical
cluster is done via cluster managers. A ClusterManager
is responsible for
- launching worker processes in a cluster environment
- managing events during the lifetime of each worker
- optionally, a cluster manager can also provide data transport
A Julia cluster has the following characteristics:
- The initial Julia process, also called the master
is special and has a id of 1.
- Only the master
process can add or remove worker processes.
- All processes can directly communicate with each other.
Connections between workers (using the in-built TCP/IP transport) is established in the following manner:
addprocs()
is called on the master process with aClusterManager
objectaddprocs()
calls the appropriatelaunch()
method which spawns required number of worker processes on appropriate machines- Each worker starts listening on a free port and writes out its host, port information to
STDOUT
- The cluster manager captures the stdout’s of each worker and makes it available to the master process
- The master process parses this information and sets up TCP/IP connections to each worker
- Every worker is also notified of other workers in the cluster
- Each worker connects to all workers whose id is less than its own id
- In this way a mesh network is established, wherein every worker is directly connected with every other worker
While the default transport layer uses plain TCP sockets, it is possible for a Julia cluster to provide its own transport.
Julia provides two in-built cluster managers:
LocalManager
, used whenaddprocs()
oraddprocs(np::Integer)
are calledSSHManager
, used whenaddprocs(hostnames::Array)
is called with a list of hostnames
LocalManager
is used to launch additional workers on the same host, thereby leveraging multi-core
and multi-processor hardware.
Thus, a minimal cluster manager would need to:
- be a subtype of the abstract
ClusterManager
- implement
launch()
, a method responsible for launching new workers - implement
manage()
, which is called at various events during a worker’s lifetime
addprocs(manager::FooManager)
requires FooManager
to implement:
function launch(manager::FooManager,params::Dict,launched::Array,c::Condition)...endfunction manage(manager::FooManager,id::Integer,config::WorkerConfig,op::Symbol)...end
As an example let us see how the LocalManager
, the manager responsible for
starting workers on the same host, is implemented:
immutableLocalManager<:ClusterManagernp::Integerendfunction launch(manager::LocalManager,params::Dict,launched::Array,c::Condition)...endfunction manage(manager::LocalManager,id::Integer,config::WorkerConfig,op::Symbol)...end
The launch()
method takes the following arguments:
manager::ClusterManager
- the cluster manageraddprocs()
is called withparams::Dict
- all the keyword arguments passed toaddprocs()
launched::Array
- the array to append one or moreWorkerConfig
objects toc::Condition
- the condition variable to be notified as and when workers are launched
The launch()
method is called asynchronously in a separate task. The termination of this task
signals that all requested workers have been launched. Hence the launch()
function MUST exit as soon
as all the requested workers have been launched.
Newly launched workers are connected to each other, and the master process, in a all-to-all manner.
Specifying command argument, --worker
results in the launched processes initializing themselves
as workers and connections being setup via TCP/IP sockets. Optionally --bind-tobind_addr[:port]
may also be specified to enable other workers to connect to it at the specified bind_addr
and port
.
This is useful for multi-homed hosts.
For non-TCP/IP transports, for example, an implementation may choose to use MPI as the transport,
--worker
must NOT be specified. Instead newly launched workers should call init_worker()
before using any of the parallel constructs
For every worker launched, the launch()
method must add a WorkerConfig
object (with appropriate fields initialized) to launched
type WorkerConfig# Common fields relevant to all cluster managersio::Nullable{IO}host::Nullable{AbstractString}port::Nullable{Integer}# Used when launching additional workers at a hostcount::Nullable{Union{Int,Symbol}}exename::Nullable{AbstractString}exeflags::Nullable{Cmd}# External cluster managers can use this to store information at a per-worker level# Can be a dict if multiple fields need to be stored.userdata::Nullable{Any}# SSHManager / SSH tunnel connections to workerstunnel::Nullable{Bool}bind_addr::Nullable{AbstractString}sshflags::Nullable{Cmd}max_parallel::Nullable{Integer}connect_at::Nullable{Any}.....end
Most of the fields in WorkerConfig
are used by the inbuilt managers.
Custom cluster managers would typically specify only io
or host
/ port
:
If io
is specified, it is used to read host/port information. A Julia
worker prints out its bind address and port at startup. This allows Julia
workers to listen on any free port available instead of requiring worker ports
to be configured manually.
If io
is not specified, host
and port
are used to connect.
count
, exename
and exeflags
are relevant for launching additional workers from a worker.
For example, a cluster manager may launch a single worker per node, and use that to launch
additional workers. count
with an integer value n
will launch a total of n
workers,
while a value of :auto
will launch as many workers as cores on that machine.
exename
is the name of the julia
executable including the full path.
exeflags
should be set to the required command line arguments for new workers.
tunnel
, bind_addr
, sshflags
and max_parallel
are used when a ssh tunnel is
required to connect to the workers from the master process.
userdata
is provided for custom cluster managers to store their own worker specific information.
manage(manager::FooManager,id::Integer,config::WorkerConfig,op::Symbol)
is called at different
times during the worker’s lifetime with appropriate op
values:
- with
:register
/:deregister
when a worker is added / removed from the Julia worker pool. - with
:interrupt
wheninterrupt(workers)
is called. TheClusterManager
should signal the appropriate worker with an interrupt signal. - with
:finalize
for cleanup purposes.
Cluster Managers with custom transports¶
Replacing the default TCP/IP all-to-all socket connections with a custom transport layer is a little more involved. Each Julia process has as many communication tasks as the workers it is connected to. For example, consider a Julia cluster of 32 processes in a all-to-all mesh network:
- Each Julia process thus has 31 communication tasks
- Each task handles all incoming messages from a single remote worker in a message processing loop
- The message processing loop waits on an
AsyncStream
object - for example, a TCP socket in the default implementation, reads an entire message, processes it and waits for the next one - Sending messages to a process is done directly from any Julia task - not just communication tasks - again, via the appropriate
AsyncStream
object
Replacing the default transport involves the new implementation to setup connections to remote workers, and to provide appropriate
AsyncStream
objects that the message processing loops can wait on. The manager specific callbacks to be implemented are:
connect(manager::FooManager,pid::Integer,config::WorkerConfig)kill(manager::FooManager,pid::Int,config::WorkerConfig)
The default implementation (which uses TCP/IP sockets) is implemented as connect(manager::ClusterManager,pid::Integer,config::WorkerConfig)
.
connect
should return a pair of AsyncStream
objects, one for reading data sent from worker pid
,
and the other to write data that needs to be sent to worker pid
. Custom cluster managers can use an in-memory BufferStream
as the plumbing to proxy data between the custom, possibly non-AsyncStream transport and Julia’s in-built parallel infrastructure.
A BufferStream
is an in-memory IOBuffer
which behaves like an AsyncStream
.
Folder examples/clustermanager/0mq
is an example of using ZeroMQ is connect Julia workers in a star network with a 0MQ broker in the middle.
Note: The Julia processes are still all logically connected to each other - any worker can message any other worker directly without any
awareness of 0MQ being used as the transport layer.
When using custom transports:
- Julia workers must NOT be started with
--worker
. Starting with--worker
will result in the newly launched workers defaulting to the TCP/IP socket transport implementation - For every incoming logical connection with a worker,
Base.process_messages(rd::AsyncStream,wr::AsyncStream)
must be called. This launches a new task that handles reading and writing of messages from/to the worker represented by theAsyncStream
objects init_worker(manager::FooManager)
MUST be called as part of worker process initializaton- Field
connect_at::Any
inWorkerConfig
can be set by the cluster manager whenlaunch
is called. The value of this field is passed in in allconnect
callbacks. Typically, it carries information on how to connect to a worker. For example, the TCP/IP socket transport uses this field to specify the(host,port)
tuple at which to connect to a worker
kill(manager,pid,config)
is called to remove a worker from the cluster.
On the master process, the corresponding AsyncStream
objects must be closed by the implementation to ensure proper cleanup. The default
implementation simply executes an exit()
call on the specified remote worker.
examples/clustermanager/simple
is an example that shows a simple implementation using unix domain sockets for cluster setup
Specifying network topology (Experimental)¶
Keyword argument topology
to addprocs
is used to specify how the workers must be connected to each other:
:all_to_all
: is the default, where all workers are connected to each other.:master_slave
: only the driver process, i.e. pid 1 has connections to the workers.:custom
: thelaunch
method of the cluster manager specifes the connection topology. Fieldsident
andconnect_idents
inWorkerConfig
are used to specify the same.connect_idents
is a list ofClusterManager
provided identifiers to workers that worker with identified byident
must connect to.
Currently sending a message between unconnected workers results in an error. This behaviour, as also the functionality and interface should be considered experimental in nature and may change in future releases.
Footnotes
[1] | In this context, MPI refers to the MPI-1 standard. Beginning with MPI-2, the MPI standards committee introduced a new set of communication mechanisms, collectively referred to as Remote Memory Access (RMA). The motivation for adding RMA to the MPI standard was to facilitate one-sided communication patterns. For additional information on the latest MPI standard, see http://www.mpi-forum.org/docs. |