Performance Tips¶
In the following sections, we briefly go through a few techniques that can help make your Julia code run as fast as possible.
Avoid global variables¶
A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.
Any code that is performance critical or being benchmarked should be inside a function.
We find that global names are frequently constants, and declaring them as such greatly improves performance:
constDEFAULT_VAL=0
Uses of non-constant globals can be optimized by annotating their types at the point of use:
globalxy=f(x::Int+1)
Writing functions is better style. It leads to more reusable code and clarifies what steps are being done, and what their inputs and outputs are.
NOTE: All code in the REPL is evaluated in global scope, so a variable defined and assigned at toplevel will be a global variable.
In the following REPL session:
julia>x=1.0
is equivalent to:
julia>globalx=1.0
so all the performance issues discussed previously apply.
Measure performance with @time
and pay attention to memory allocation¶
The most useful tool for measuring performance is the @time
macro.
The following example illustrates good working style:
julia>function f(n)s=0fori=1:ns+=i/2endsendf(genericfunction with1method)julia>@timef(1)elapsedtime:0.004710563seconds(93504bytesallocated)0.5julia>@timef(10^6)elapsedtime:0.04123202seconds(32002136bytesallocated)2.5000025e11
On the first call (@timef(1)
), f
gets compiled. (If you’ve
not yet used @time
in this session, it will also compile functions
needed for timing.) You should not take the results of this run
seriously. For the second run, note that in addition to reporting the
time, it also indicated that a large amount of memory was allocated.
This is the single biggest advantage of @time
vs. functions like
tic()
and toc()
, which only report time.
Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability. Consequently, in addition to the allocation itself, it’s very likely that the code generated for your function is far from optimal. Take such indications seriously and follow the advice below.
As a teaser, note that an improved version of this function allocates no memory (except to pass back the result back to the REPL) and has an order of magnitude faster execution after the first call:
julia>@timef_improved(1)# first callelapsedtime:0.003702172seconds(78944bytesallocated)0.5julia>@timef_improved(10^6)elapsedtime:0.004313644seconds(112bytesallocated)2.5000025e11
Below you’ll learn how to spot the problem with f
and how to fix it.
In some situations, your function may need to allocate memory as part of its operation, and this can complicate the simple picture above. In such cases, consider using one of the tools below to diagnose problems, or write a version of your function that separates allocation from its algorithmic aspects (see Pre-allocating outputs).
Tools¶
Julia and its package ecosystem includes tools that may help you diagnose problems and improve the performance of your code:
- Profiling allows you to measure the performance of your running code and identify lines that serve as bottlenecks. For complex projects, the ProfileView package can help you visualize your profiling results.
- Unexpectedly-large memory allocations—as reported by
@time
,@allocated
, or the profiler (through calls to the garbage-collection routines)—hint that there might be issues with your code. If you don’t see another reason for the allocations, suspect a type problem. You can also start Julia with the--track-allocation=user
option and examine the resulting*.mem
files to see information about where those allocations occur. See Memory allocation analysis. @code_warntype
generates a representation of your code that can be helpful in finding expressions that result in type uncertainty. See @code_warntype below.- The Lint and TypeCheck packages can also warn you of certain types of programming errors.
Avoid containers with abstract type parameters¶
When working with parameterized types, including arrays, it is best to avoid parameterizing with abstract types where possible.
Consider the following:
a=Real[]# typeof(a) = Array{Real,1}if(f=rand())<.8push!(a,f)end
Because a
is a an array of abstract type Real
, it must be able
to hold any Real value. Since Real
objects can be of arbitrary
size and structure, a
must be represented as an array of pointers to
individually allocated Real
objects. Because f
will always be
a Float64
, we should instead, use:
a=Float64[]# typeof(a) = Array{Float64,1}
which will create a contiguous block of 64-bit floating-point values that can be manipulated efficiently.
See also the discussion under Parametric Types.
Type declarations¶
In many languages with optional type declarations, adding declarations is the principal way to make code run faster. This is not the case in Julia. In Julia, the compiler generally knows the types of all function arguments, local variables, and expressions. However, there are a few specific instances where declarations are helpful.
Avoid fields with abstract type¶
Types can be declared without specifying the types of their fields:
julia>type MyAmbiguousTypeaend
This allows a
to be of any type. This can often be useful, but it
does have a downside: for objects of type MyAmbiguousType
, the
compiler will not be able to generate high-performance code. The
reason is that the compiler uses the types of objects, not their
values, to determine how to build code. Unfortunately, very little can
be inferred about an object of type MyAmbiguousType
:
julia>b=MyAmbiguousType("Hello")MyAmbiguousType("Hello")julia>c=MyAmbiguousType(17)MyAmbiguousType(17)julia>typeof(b)MyAmbiguousTypejulia>typeof(c)MyAmbiguousType
b
and c
have the same type, yet their underlying
representation of data in memory is very different. Even if you stored
just numeric values in field a
, the fact that the memory
representation of a UInt8
differs from a Float64
also means
that the CPU needs to handle them using two different kinds of
instructions. Since the required information is not available in the
type, such decisions have to be made at run-time. This slows
performance.
You can do better by declaring the type of a
. Here, we are focused
on the case where a
might be any one of several types, in which
case the natural solution is to use parameters. For example:
julia>type MyType{T<:AbstractFloat}a::Tend
This is a better choice than
julia>type MyStillAmbiguousTypea::AbstractFloatend
because the first version specifies the type of a
from the type of
the wrapper object. For example:
julia>m=MyType(3.2)MyType{Float64}(3.2)julia>t=MyStillAmbiguousType(3.2)MyStillAmbiguousType(3.2)julia>typeof(m)MyType{Float64}julia>typeof(t)MyStillAmbiguousType
The type of field a
can be readily determined from the type of
m
, but not from the type of t
. Indeed, in t
it’s possible
to change the type of field a
:
julia>typeof(t.a)Float64julia>t.a=4.5f04.5f0julia>typeof(t.a)Float32
In contrast, once m
is constructed, the type of m.a
cannot
change:
julia>m.a=4.5f04.5f0julia>typeof(m.a)Float64
The fact that the type of m.a
is known from m
‘s type—coupled
with the fact that its type cannot change mid-function—allows the
compiler to generate highly-optimized code for objects like m
but
not for objects like t
.
Of course, all of this is true only if we construct m
with a
concrete type. We can break this by explicitly constructing it with
an abstract type:
julia>m=MyType{AbstractFloat}(3.2)MyType{AbstractFloat}(3.2)julia>typeof(m.a)Float64julia>m.a=4.5f04.5f0julia>typeof(m.a)Float32
For all practical purposes, such objects behave identically to those
of MyStillAmbiguousType
.
It’s quite instructive to compare the sheer amount code generated for a simple function
func(m::MyType)=m.a+1
using
code_llvm(func,(MyType{Float64},))code_llvm(func,(MyType{AbstractFloat},))code_llvm(func,(MyType,))
For reasons of length the results are not shown here, but you may wish to try this yourself. Because the type is fully-specified in the first case, the compiler doesn’t need to generate any code to resolve the type at run-time. This results in shorter and faster code.
Avoid fields with abstract containers¶
The same best practices also work for container types:
julia>type MySimpleContainer{A<:AbstractVector}a::Aendjulia>type MyAmbiguousContainer{T}a::AbstractVector{T}end
For example:
julia>c=MySimpleContainer(1:3);julia>typeof(c)MySimpleContainer{UnitRange{Int64}}julia>c=MySimpleContainer([1:3;]);julia>typeof(c)MySimpleContainer{Array{Int64,1}}julia>b=MyAmbiguousContainer(1:3);julia>typeof(b)MyAmbiguousContainer{Int64}julia>b=MyAmbiguousContainer([1:3;]);julia>typeof(b)MyAmbiguousContainer{Int64}
For MySimpleContainer
, the object is fully-specified by its type
and parameters, so the compiler can generate optimized functions. In
most instances, this will probably suffice.
While the compiler can now do its job perfectly well, there are cases
where you might wish that your code could do different things
depending on the element type of a
. Usually the best way to
achieve this is to wrap your specific operation (here, foo
) in a
separate function:
function sumfoo(c::MySimpleContainer)s=0forxinc.as+=foo(x)endsendfoo(x::Integer)=xfoo(x::AbstractFloat)=round(x)
This keeps things simple, while allowing the compiler to generate optimized code in all cases.
However, there are cases where you may need to declare different
versions of the outer function for different element types of
a
. You could do it like this:
function myfun{T<:AbstractFloat}(c::MySimpleContainer{Vector{T}})...endfunction myfun{T<:Integer}(c::MySimpleContainer{Vector{T}})...end
This works fine for Vector{T}
, but we’d also have to write
explicit versions for UnitRange{T}
or other abstract types. To
prevent such tedium, you can use two parameters in the declaration of
MyContainer
:
type MyContainer{T,A<:AbstractVector}a::AendMyContainer(v::AbstractVector)=MyContainer{eltype(v),typeof(v)}(v)julia>b=MyContainer(1.3:5);julia>typeof(b)MyContainer{Float64,UnitRange{Float64}}
Note the somewhat surprising fact that T
doesn’t appear in the
declaration of field a
, a point that we’ll return to in a moment.
With this approach, one can write functions such as:
function myfunc{T<:Integer,A<:AbstractArray}(c::MyContainer{T,A})returnc.a[1]+1end# Note: because we can only define MyContainer for# A<:AbstractArray, and any unspecified parameters are arbitrary,# the previous could have been written more succinctly as# function myfunc{T<:Integer}(c::MyContainer{T})function myfunc{T<:AbstractFloat}(c::MyContainer{T})returnc.a[1]+2endfunction myfunc{T<:Integer}(c::MyContainer{T,Vector{T}})returnc.a[1]+3endjulia>myfunc(MyContainer(1:3))2julia>myfunc(MyContainer(1.0:3))3.0julia>myfunc(MyContainer([1:3]))4
As you can see, with this approach it’s possible to specialize on both
the element type T
and the array type A
.
However, there’s one remaining hole: we haven’t enforced that A
has element type T
, so it’s perfectly possible to construct an
object like this:
julia>b=MyContainer{Int64,UnitRange{Float64}}(1.3:5);julia>typeof(b)MyContainer{Int64,UnitRange{Float64}}
To prevent this, we can add an inner constructor:
type MyBetterContainer{T<:Real,A<:AbstractVector}a::AMyBetterContainer(v::AbstractVector{T})=new(v)endMyBetterContainer(v::AbstractVector)=MyBetterContainer{eltype(v),typeof(v)}(v)julia>b=MyBetterContainer(1.3:5);julia>typeof(b)MyBetterContainer{Float64,UnitRange{Float64}}julia>b=MyBetterContainer{Int64,UnitRange{Float64}}(1.3:5);ERROR:nomethodMyBetterContainer(UnitRange{Float64},)
The inner constructor requires that the element type of A
be T
.
Annotate values taken from untyped locations¶
It is often convenient to work with data structures that may contain
values of any type (arrays of type Array{Any}
). But, if you’re using one of
these structures and happen to know the type of an element, it helps to
share this knowledge with the compiler:
function foo(a::Array{Any,1})x=a[1]::Int32b=x+1...end
Here, we happened to know that the first element of a
would be an
Int32
. Making an annotation like this has the added benefit that it
will raise a run-time error if the value is not of the expected type,
potentially catching certain bugs earlier.
Declare types of keyword arguments¶
Keyword arguments can have declared types:
function with_keyword(x;name::Int=1)...end
Functions are specialized on the types of keyword arguments, so these declarations will not affect performance of code inside the function. However, they will reduce the overhead of calls to the function that include keyword arguments.
Functions with keyword arguments have near-zero overhead for call sites that pass only positional arguments.
Passing dynamic lists of keyword arguments, as in f(x;keywords...)
,
can be slow and should be avoided in performance-sensitive code.
Break functions into multiple definitions¶
Writing a function as many small definitions allows the compiler to directly call the most applicable code, or even inline it.
Here is an example of a “compound function” that should really be written as multiple definitions:
function norm(A)ifisa(A,Vector)returnsqrt(real(dot(A,A)))elseifisa(A,Matrix)returnmax(svd(A)[2])elseerror("norm: invalid argument")endend
This can be written more concisely and efficiently as:
norm(x::Vector)=sqrt(real(dot(x,x)))norm(A::Matrix)=max(svd(A)[2])
Write “type-stable” functions¶
When possible, it helps to ensure that a function always returns a value of the same type. Consider the following definition:
pos(x)=x<0?0:x
Although this seems innocent enough, the problem is that 0
is an
integer (of type Int
) and x
might be of any type. Thus,
depending on the value of x
, this function might return a value of
either of two types. This behavior is allowed, and may be desirable in
some cases. But it can easily be fixed as follows:
pos(x)=x<0?zero(x):x
There is also a one()
function, and a more general oftype(x,y)
function, which returns y
converted to the type of x
.
Avoid changing the type of a variable¶
An analogous “type-stability” problem exists for variables used repeatedly within a function:
function foo()x=1fori=1:10x=x/bar()endreturnxend
Local variable x
starts as an integer, and after one loop iteration
becomes a floating-point number (the result of /
operator). This
makes it more difficult for the compiler to optimize the body of the
loop. There are several possible fixes:
- Initialize
x
withx=1.0
- Declare the type of
x
:x::Float64=1
- Use an explicit conversion:
x=one(T)
Separate kernel functions (aka, function barriers)¶
Many functions follow a pattern of performing some set-up work, and then running many iterations to perform a core computation. Where possible, it is a good idea to put these core computations in separate functions. For example, the following contrived function returns an array of a randomly-chosen type:
function strange_twos(n)a=Array(rand(Bool)?Int64:Float64,n)fori=1:na[i]=2endreturnaend
This should be written as:
function fill_twos!(a)fori=1:length(a)a[i]=2endendfunction strange_twos(n)a=Array(rand(Bool)?Int64:Float64,n)fill_twos!(a)returnaend
Julia’s compiler specializes code for argument types at function
boundaries, so in the original implementation it does not know the type
of a
during the loop (since it is chosen randomly). Therefore the
second version is generally faster since the inner loop can be
recompiled as part of fill_twos!
for different types of a
.
The second form is also often better style and can lead to more code reuse.
This pattern is used in several places in the standard library. For
example, see hvcat_fill
in
abstractarray.jl,
or the fill!
function, which we could have used instead of writing
our own fill_twos!
.
Functions like strange_twos
occur when dealing with data of
uncertain type, for example data loaded from an input file that might
contain either integers, floats, strings, or something else.
Types with values-as-parameters¶
Let’s say you want to create an N
-dimensional array that
has size 3 along each axis. Such arrays can be created like this:
A=fill(5.0,(3,3))
This approach works very well: the compiler can figure out that A
is an Array{Float64,2}
because it knows the type of the fill value
(5.0::Float64
) and the dimensionality ((3,3)::NTuple{2,Int}
).
This implies that the compiler can generate very efficient code for
any future usage of A
in the same function.
But now let’s say you want to write a function that creates a 3×3×... array in arbitrary dimensions; you might be tempted to write a function
function array3(fillval,N)fill(fillval,ntuple(d->3,N))end
This works, but (as you can verify for yourself using @code_warntypearray3(5.0,2)
) the problem is that the output type cannot be
inferred: the argument N
is a value of type Int
, and
type-inference does not (and cannot) predict its value in
advance. This means that code using the output of this function has to
be conservative, checking the type on each access of A
; such code
will be very slow.
Now, one very good way to solve such problems is by using the
function-barrier technique. However, in some cases you might
want to eliminate the type-instability altogether. In such cases, one
approach is to pass the dimensionality as a parameter, for example
through Val{T}
(see “Value types”):
function array3{N}(fillval,::Type{Val{N}})fill(fillval,ntuple(d->3,Val{N}))end
Julia has a specialized version of ntuple
that accepts a
Val{::Int}
as the second parameter; by passing N
as a
type-parameter, you make its “value” known to the compiler.
Consequently, this version of array3
allows the compiler to
predict the return type.
However, making use of such techniques can be surprisingly subtle. For
example, it would be of no help if you called array3
from a
function like this:
function call_array3(fillval,n)A=array3(fillval,Val{n})end
Here, you’ve created the same problem all over again: the compiler
can’t guess the type of n
, so it doesn’t know the type of
Val{n}
. Attempting to use Val
, but doing so incorrectly, can
easily make performance worse in many situations. (Only in
situations where you’re effectively combining Val
with the
function-barrier trick, to make the kernel function more efficient,
should code like the above be used.)
An example of correct usage of Val
would be:
function filter3{T,N}(A::AbstractArray{T,N})kernel=array3(1,Val{N})filter(A,kernel)end
In this example, N
is passed as a parameter, so its “value” is
known to the compiler. Essentially, Val{T}
works only when T
is either hard-coded (Val{3}
) or already specified in the
type-domain.
The dangers of abusing multiple dispatch (aka, more on types with values-as-parameters)¶
Once one learns to appreciate multiple dispatch, there’s an understandable tendency to go crazy and try to use it for everything. For example, you might imagine using it to store information, e.g.
immutableCar{Make,Model}year::Int...morefields...end
and then dispatch on objects like Car{:Honda,:Accord}(year,args...)
.
This might be worthwhile when the following are true:
- You require CPU-intensive processing on each
Car
, and it becomes vastly more efficient if you know theMake
andModel
at compile time. - You have homogenous lists of the same type of
Car
to process, so that you can store them all in anArray{Car{:Honda,:Accord},N}
.
When the latter holds, a function processing such a homogenous array can be productively specialized: Julia knows the type of each element in advance (all objects in the container have the same concrete type), so Julia can “look up” the correct method calls when the function is being compiled (obviating the need to check at run-time) and thereby emit efficient code for processing the whole list.
When these do not hold, then it’s likely that you’ll get no benefit;
worse, the resulting “combinatorial explosion of types” will be
counterproductive. If items[i+1]
has a different type than
item[i]
, Julia has to look up the type at run-time, search for the
appropriate method in method tables, decide (via type intersection)
which one matches, determine whether it has been JIT-compiled yet (and
do so if not), and then make the call. In essence, you’re asking the
full type- system and JIT-compilation machinery to basically execute
the equivalent of a switch statement or dictionary lookup in your own
code.
Some run-time benchmarks comparing (1) type dispatch, (2) dictionary lookup, and (3) a “switch” statement can be found on the mailing list.
Perhaps even worse than the run-time impact is the compile-time
impact: Julia will compile specialized functions for each different
Car{Make,Model}
; if you have hundreds or thousands of such types,
then every function that accepts such an object as a parameter (from a
custom get_year
function you might write yourself, to the generic
push!
function in the standard library) will have hundreds or
thousands of variants compiled for it. Each of these increases the
size of the cache of compiled code, the length of internal lists of
methods, etc. Excess enthusiasm for values-as-parameters can easily
waste enormous resources.
Access arrays in memory order, along columns¶
Multidimensional arrays in Julia are stored in column-major order. This
means that arrays are stacked one column at a time. This can be verified
using the vec
function or the syntax [:]
as shown below (notice
that the array is ordered [1324]
, not [1234]
):
julia>x=[12;34]2×2Array{Int64,2}:1234julia>x[:]4-elementArray{Int64,1}:1324
This convention for ordering arrays is common in many languages like
Fortran, Matlab, and R (to name a few). The alternative to column-major
ordering is row-major ordering, which is the convention adopted by C and
Python (numpy
) among other languages. Remembering the ordering of
arrays can have significant performance effects when looping over
arrays. A rule of thumb to keep in mind is that with column-major
arrays, the first index changes most rapidly. Essentially this means
that looping will be faster if the inner-most loop index is the first to
appear in a slice expression.
Consider the following contrived example. Imagine we wanted to write a
function that accepts a Vector
and returns a square Matrix
with either the rows or the columns filled with copies of the input
vector. Assume that it is not important whether rows or columns are
filled with these copies (perhaps the rest of the code can be easily
adapted accordingly). We could conceivably do this in at least four ways
(in addition to the recommended call to the built-in repmat()
):
function copy_cols{T}(x::Vector{T})n=size(x,1)out=Array{T}(n,n)fori=1:nout[:,i]=xendoutendfunction copy_rows{T}(x::Vector{T})n=size(x,1)out=Array{T}(n,n)fori=1:nout[i,:]=xendoutendfunction copy_col_row{T}(x::Vector{T})n=size(x,1)out=Array{T}(n,n)forcol=1:n,row=1:nout[row,col]=x[row]endoutendfunction copy_row_col{T}(x::Vector{T})n=size(x,1)out=Array{T}(n,n)forrow=1:n,col=1:nout[row,col]=x[col]endoutend
Now we will time each of these functions using the same random 10000
by 1
input vector:
julia>x=randn(10000);julia>fmt(f)=println(rpad(string(f)*": ",14,' '),@elapsedf(x))julia>map(fmt,Any[copy_cols,copy_rows,copy_col_row,copy_row_col]);copy_cols:0.331706323copy_rows:1.799009911copy_col_row:0.415630047copy_row_col:1.721531501
Notice that copy_cols
is much faster than copy_rows
. This is
expected because copy_cols
respects the column-based memory layout
of the Matrix
and fills it one column at a time. Additionally,
copy_col_row
is much faster than copy_row_col
because it follows
our rule of thumb that the first element to appear in a slice expression
should be coupled with the inner-most loop.
Pre-allocating outputs¶
If your function returns an Array or some other complex type, it may have to allocate memory. Unfortunately, oftentimes allocation and its converse, garbage collection, are substantial bottlenecks.
Sometimes you can circumvent the need to allocate memory on each function call by preallocating the output. As a trivial example, compare
function xinc(x)return[x,x+1,x+2]endfunction loopinc()y=0fori=1:10^7ret=xinc(i)y+=ret[2]endyend
with
function xinc!{T}(ret::AbstractVector{T},x::T)ret[1]=xret[2]=x+1ret[3]=x+2nothingendfunction loopinc_prealloc()ret=Array{Int}(3)y=0fori=1:10^7xinc!(ret,i)y+=ret[2]endyend
Timing results:
julia>@timeloopinc()elapsedtime:1.955026528seconds(1279975584bytesallocated)50000015000000julia>@timeloopinc_prealloc()elapsedtime:0.078639163seconds(144bytesallocated)50000015000000
Preallocation has other advantages, for example by allowing the
caller to control the “output” type from an algorithm. In the example
above, we could have passed a SubArray
rather than an Array
,
had we so desired.
Taken to its extreme, pre-allocation can make your code uglier, so
performance measurements and some judgment may be required. However,
for “vectorized” (element-wise) functions, the convenient syntax
x.=f.(y)
can be used for in-place operations with fused loops
and no temporary arrays (Dot Syntax for Vectorizing Functions).
Avoid string interpolation for I/O¶
When writing data to a file (or other I/O device), forming extra intermediate strings is a source of overhead. Instead of:
println(file,"$a$b")
use:
println(file,a," ",b)
The first version of the code forms a string, then writes it to the file, while the second version writes values directly to the file. Also notice that in some cases string interpolation can be harder to read. Consider:
println(file,"$(f(a))$(f(b))")
versus:
println(file,f(a),f(b))
Optimize network I/O during parallel execution¶
When executing a remote function in parallel:
responses=Vector{Any}(nworkers())@syncbeginfor(idx,pid)inenumerate(workers())@asyncresponses[idx]=remotecall_fetch(pid,foo,args...)endend
is faster than:
refs=Vector{Any}(nworkers())for(idx,pid)inenumerate(workers())refs[idx]=@spawnatpidfoo(args...)endresponses=[fetch(r)forrinrefs]
The former results in a single network round-trip to every worker, while the
latter results in two network calls - first by the @spawnat
and the
second due to the fetch
(or even a wait
). The fetch
/wait
is also being executed serially resulting in an overall poorer performance.
Fix deprecation warnings¶
A deprecated function internally performs a lookup in order to print a relevant warning only once. This extra lookup can cause a significant slowdown, so all uses of deprecated functions should be modified as suggested by the warnings.
Tweaks¶
These are some minor points that might help in tight inner loops.
- Avoid unnecessary arrays. For example, instead of
sum([x,y,z])
usex+y+z
. - Use
abs2(z)
instead ofabs(z)^2
for complexz
. In general, try to rewrite code to useabs2()
instead ofabs()
for complex arguments. - Use
div(x,y)
for truncating division of integers instead oftrunc(x/y)
,fld(x,y)
instead offloor(x/y)
, andcld(x,y)
instead ofceil(x/y)
.
Performance Annotations¶
Sometimes you can enable better optimization by promising certain program properties.
- Use
@inbounds
to eliminate array bounds checking within expressions. Be certain before doing this. If the subscripts are ever out of bounds, you may suffer crashes or silent corruption. - Use
@fastmath
to allow floating point optimizations that are correct for real numbers, but lead to differences for IEEE numbers. Be careful when doing this, as this may change numerical results. This corresponds to the-ffast-math
option of clang. - Write
@simd
in front offor
loops that are amenable to vectorization. This feature is experimental and could change or disappear in future versions of Julia.
Note: While @simd
needs to be placed directly in front of a
loop, both @inbounds
and @fastmath
can be applied to
several statements at once, e.g. using begin
... end
, or even
to a whole function.
Here is an example with both @inbounds
and @simd
markup:
function inner(x,y)s=zero(eltype(x))fori=1:length(x)@inboundss+=x[i]*y[i]endsendfunction innersimd(x,y)s=zero(eltype(x))@simdfori=1:length(x)@inboundss+=x[i]*y[i]endsendfunction timeit(n,reps)x=rand(Float32,n)y=rand(Float32,n)s=zero(Float64)time=@elapsedforjin1:repss+=inner(x,y)endprintln("GFlop/sec = ",2.0*n*reps/time*1E-9)time=@elapsedforjin1:repss+=innersimd(x,y)endprintln("GFlop/sec (SIMD) = ",2.0*n*reps/time*1E-9)endtimeit(1000,1000)
On a computer with a 2.4GHz Intel Core i5 processor, this produces:
GFlop/sec=1.9467069505224963GFlop/sec(SIMD)=17.578554163920018
(GFlop/sec
measures the performance, and larger numbers are better.)
The range for a @simdfor
loop should be a one-dimensional range.
A variable used for accumulating, such as s
in the example, is called
a reduction variable. By using @simd
, you are asserting several
properties of the loop:
- It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
- Floating-point operations on reduction variables can be reordered,
possibly causing different results than without
@simd
. - No iteration ever waits on another iteration to make forward progress.
A loop containing break
, continue
, or @goto
will cause a
compile-time error.
Using @simd
merely gives the compiler license to vectorize. Whether
it actually does so depends on the compiler. To actually benefit from the
current implementation, your loop should have the following additional
properties:
- The loop must be an innermost loop.
- The loop body must be straight-line code. This is why
@inbounds
is currently needed for all array accesses. The compiler can sometimes turn short&&
,||
, and?:
expressions into straight-line code, if it is safe to evaluate all operands unconditionally. Consider usingifelse()
instead of?:
in the loop if it is safe to do so. - Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).
- The stride should be unit stride.
- In some simple cases, for example with 2-3 arrays accessed in a loop, the
LLVM auto-vectorization may kick in automatically, leading to no further
speedup with
@simd
.
Here is an example with all three kinds of markup. This program first calculates the finite difference of a one-dimensional array, and then evaluates the L2-norm of the result:
function init!(u)n=length(u)dx=1.0/(n-1)@fastmath@inbounds@simdforiin1:nu[i]=sin(2pi*dx*i)endendfunction deriv!(u,du)n=length(u)dx=1.0/(n-1)@fastmath@inboundsdu[1]=(u[2]-u[1])/dx@fastmath@inbounds@simdforiin2:n-1du[i]=(u[i+1]-u[i-1])/(2*dx)end@fastmath@inboundsdu[n]=(u[n]-u[n-1])/dxendfunction norm(u)n=length(u)T=eltype(u)s=zero(T)@fastmath@inbounds@simdforiin1:ns+=u[i]^2end@fastmath@inboundsreturnsqrt(s/n)endfunction main()n=2000u=Array{Float64}(n)init!(u)du=similar(u)deriv!(u,du)nu=norm(du)@timeforiin1:10^6deriv!(u,du)nu=norm(du)endprintln(nu)endmain()
On a computer with a 2.7 GHz Intel Core i7 processor, this produces:
$juliawave.jl;elapsedtime:1.207814709seconds(0bytesallocated)$julia--math-mode=ieeewave.jl;elapsedtime:4.487083643seconds(0bytesallocated)
Here, the option --math-mode=ieee
disables the @fastmath
macro, so that we can compare results.
In this case, the speedup due to @fastmath
is a factor of about
3.7. This is unusually large – in general, the speedup will be
smaller. (In this particular example, the working set of the benchmark
is small enough to fit into the L1 cache of the processor, so that
memory access latency does not play a role, and computing time is
dominated by CPU usage. In many real world programs this is not the
case.) Also, in this case this optimization does not change the result
– in general, the result will be slightly different. In some cases,
especially for numerically unstable algorithms, the result can be very
different.
The annotation @fastmath
re-arranges floating point
expressions, e.g. changing the order of evaluation, or assuming that
certain special cases (inf, nan) cannot occur. In this case (and on
this particular computer), the main difference is that the expression
1/(2*dx)
in the function deriv
is hoisted out of the loop
(i.e. calculated outside the loop), as if one had written idx=1/(2*dx)
. In the loop, the expression .../(2*dx)
then becomes
...*idx
, which is much faster to evaluate. Of course, both the
actual optimization that is applied by the compiler as well as the
resulting speedup depend very much on the hardware. You can examine
the change in generated code by using Julia’s code_native()
function.
Treat Subnormal Numbers as Zeros¶
Subnormal numbers, formerly called denormal numbers,
are useful in many contexts, but incur a performance penalty on some hardware.
A call set_zero_subnormals(true)
grants permission for floating-point operations to treat subnormal
inputs or outputs as zeros, which may improve performance on some hardware.
A call set_zero_subnormals(false)
enforces strict IEEE behavior for subnormal numbers.
Below is an example where subnormals noticeably impact performance on some hardware:
function timestep{T}(b::Vector{T},a::Vector{T},Δt::T)@assertlength(a)==length(b)n=length(b)b[1]=1# Boundary conditionfori=2:n-1b[i]=a[i]+(a[i-1]-T(2)*a[i]+a[i+1])*Δtendb[n]=0# Boundary conditionendfunction heatflow{T}(a::Vector{T},nstep::Integer)b=similar(a)fort=1:div(nstep,2)# Assume nstep is eventimestep(b,a,T(0.1))timestep(a,b,T(0.1))endendheatflow(zeros(Float32,10),2)# Force compilationfortrial=1:6a=zeros(Float32,1000)set_zero_subnormals(iseven(trial))# Odd trials use strict IEEE arithmetic@timeheatflow(a,1000)end
This example generates many subnormal numbers because the values in a
become
an exponentially decreasing curve, which slowly flattens out over time.
Treating subnormals as zeros should be used with caution, because doing so
breaks some identities, such as x-y==0
implies x==y
:
julia>x=3f-38;y=2f-38;julia>set_zero_subnormals(false);(x-y,x==y)(1.0000001f-38,false)julia>set_zero_subnormals(true);(x-y,x==y)(0.0f0,false)
In some applications, an alternative to zeroing subnormal numbers is
to inject a tiny bit of noise. For example, instead of
initializing a
with zeros, initialize it with:
a=rand(Float32,1000)*1.f-9
@code_warntype
¶
The macro @code_warntype
(or its function variant code_warntype()
)
can sometimes be helpful in diagnosing type-related problems. Here’s an
example:
pos(x)=x<0?0:xfunction f(x)y=pos(x)sin(y*x+1)endjulia>@code_warntypef(3.2)Variables:x::Float64y::UNION(INT64,FLOAT64)_var0::Float64_var3::Tuple{Int64}_var4::UNION(INT64,FLOAT64)_var1::Float64_var2::Float64Body:begin# none, line 2:_var0=(top(box))(Float64,(top(sitofp))(Float64,0))unless(top(box))(Bool,(top(or_int))((top(lt_float))(x::Float64,_var0::Float64)::Bool,(top(box))(Bool,(top(and_int))((top(box))(Bool,(top(and_int))((top(eq_float))(x::Float64,_var0::Float64)::Bool,(top(lt_float))(_var0::Float64,9.223372036854776e18)::Bool)),(top(slt_int))((top(box))(Int64,(top(fptosi))(Int64,_var0::Float64)),0)::Bool))))goto1_var4=0goto21:_var4=x::Float642:y=_var4::UNION(INT64,FLOAT64)# line 3:_var1=y::UNION(INT64,FLOAT64)*x::Float64::Float64_var2=(top(box))(Float64,(top(add_float))(_var1::Float64,(top(box))(Float64,(top(sitofp))(Float64,1))))return(GlobalRef(Base.Math,:nan_dom_err))((top(ccall))($(Expr(:call1,:(top(tuple)),"sin",GlobalRef(Base.Math,:libm))),Float64,$(Expr(:call1,:(top(tuple)),:Float64)),_var2::Float64,0)::Float64,_var2::Float64)::Float64end::Float64
Interpreting the output of @code_warntype
, like that of its cousins
@code_lowered
, @code_typed
, @code_llvm
, and
@code_native
, takes a little practice. Your
code is being presented in form that has been partially digested on
its way to generating compiled machine code. Most of the expressions
are annotated by a type, indicated by the ::T
(where T
might
be Float64
, for example). The most important characteristic of
@code_warntype
is that non-concrete types are displayed in red; in
the above example, such output is shown in all-caps.
The top part of the output summarizes the type information for the different
variables internal to the function. You can see that y
, one of the
variables you created, is a Union{Int64,Float64}
, due to the
type-instability of pos
. There is another variable, _var4
, which you
can see also has the same type.
The next lines represent the body of f
. The lines starting with a
number followed by a colon (1:
, 2:
) are labels, and represent
targets for jumps (via goto
) in your code. Looking at the body,
you can see that pos
has been inlined into f
—everything
before 2:
comes from code defined in pos
.
Starting at 2:
, the variable y
is defined, and again annotated
as a Union
type. Next, we see that the compiler created the
temporary variable _var1
to hold the result of y*x
. Because
a Float64
times either an Int64
or Float64
yields a
Float64
, all type-instability ends here. The net result is that
f(x::Float64)
will not be type-unstable in its output, even if some of the
intermediate computations are type-unstable.
How you use this information is up to you. Obviously, it would be far
and away best to fix pos
to be type-stable: if you did so, all of
the variables in f
would be concrete, and its performance would be
optimal. However, there are circumstances where this kind of
ephemeral type instability might not matter too much: for example,
if pos
is never used in isolation, the fact that f
‘s output
is type-stable (for Float64
inputs) will shield later code from
the propagating effects of type instability. This is particularly
relevant in cases where fixing the type instability is difficult or
impossible: for example, currently it’s not possible to infer the
return type of an anonymous function. In such cases, the tips above
(e.g., adding type annotations and/or breaking up functions) are your
best tools to contain the “damage” from type instability.
The following examples may help you interpret expressions marked as containing non-leaf types:
- Function body ending in
end::Union{T1,T2})
- Interpretation: function with unstable return type
- Suggestion: make the return value type-stable, even if you have to annotate it
f(x::T)::Union{T1,T2}
- Interpretation: call to a type-unstable function
- Suggestion: fix the function, or if necessary annotate the return value
(top(arrayref))(A::Array{Any,1},1)::Any
- Interpretation: accessing elements of poorly-typed arrays
- Suggestion: use arrays with better-defined types, or if necessary annotate the type of individual element accesses
(top(getfield))(A::ArrayContainer{Float64},:data)::Array{Float64,N}
- Interpretation: getting a field that is of non-leaf type. In this case,
ArrayContainer
had a fielddata::Array{T}
. ButArray
needs the dimensionN
, too, to be a concrete type. - Suggestion: use concrete types like
Array{T,3}
orArray{T,N}
, whereN
is now a parameter ofArrayContainer
- Interpretation: getting a field that is of non-leaf type. In this case,