Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Is the timing of MATLAB reliable? If yes, can we reproduce the performance with julia, fortran, etc.?

Ask Question
Originally this is a problem coming up in mathematica.SE , but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
% for-loop version
H = zeros(s,s);
for c = 1:s
    for r = 1:s
        H(r,c) = 1/(r+c-1);
%Elapsed time is 1.359625 seconds.... For-loop 
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
    do c=1, s
        hmn(r,c)=1._real64/(r + c - 1)
    end do 
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds

execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
@time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
                tic/toc could be replaced with timeit() for more robust benchmarking, but that just averages multiple runs. Other than that "reliable" is subjective, sure tic/toc time stuff, what are you trying to determine about the reliability?  Why do you expect a given execution time for one language based on the execution time of other languages? FWIW I can reproduce the MATLAB timings, the vectorised version is significantly faster - not sure why you wouldn't expect that?
– Wolfie
                Dec 30, 2021 at 12:05
                @Wolfie Well, sorry, I'm not sure if I've understood your question, but MATLAB, fortran, julia are all famous for numeric calculations, and all the samples are for the same purpose. (Generate a matrix whose general term is 1. /(r+c-1). ) I think it's natural to expect that their performances are about the same when the samples are fully optimized?
– xzczd
                Dec 30, 2021 at 12:10
                @francescalus Er… why is the tag [fortran] removed? I did test the problem with fortran and included a fortran sample in the question.
– xzczd
                Dec 30, 2021 at 12:32
                "No" would be subjective as I said before I think? tic/toc is reliably measuring execution time but we must understand what is happening during execution. Then the answer to your second question is simply you can reproduce the performance benchmarks by running MATLAB from a "cold-start" with a clear workspace
– Wolfie
                Dec 30, 2021 at 12:49
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to @btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
           for c ∈ 1:size(H, 1)
               for r ∈ 1:size(H, 1)
                   H[r, c] = 1 / (r + c - 1)
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
           c = 1:size(H, 1)
           r = c'
           H .= 1 ./ (r .+ c .- 1)
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> @btime f_loop!($H);
  625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> @btime f_vec!($H);
  625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num  |  Workspace cleared  |    s    |  Loop (sec)  |  Vectorised (sec) 
    1     |       Yes           |  15000  |    2.10      |       1.41
    2     |        No           |  15000  |    2.73      |       0.07
    3     |        No           |  15000  |    2.50      |       0.07
    4     |        No           |  15001  |    2.74      |       1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
                Did you put the code in a function M-file? Or are you copy-pasting to the command window? The two will give wildly different timings because the JIT doesn’t get used in the second case. Also, given the large fluctuation in timing, it looks to me that you’re seeing the effect of other stuff happening on your system at the same time. This is one of the reasons to use timeit, a single timing is meaningless.
– Cris Luengo
                Dec 30, 2021 at 15:15
                @Cris I get pretty comparable timing running it either way, of course splitting the calculation options into their own m-files has the effect of giving each a clear private workspace so you don't see the memory-related speed jump on subsequent runs, otherwise I can reproduce it when using a function, script, or "CTRL+Enter" to run section. Indeed I would recommend timeit, I think my point here was more for orders of magnitude than a precise method comparison, hopefully even with the fluctuations the points stand (i.e. approx. 2.5sec vs 1.5sec vs 0.1sec). Ran many times while testing.
– Wolfie
                Dec 30, 2021 at 16:12
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
% for-loop version
H = zeros(s,s);
for i =1:10
  for c = 1:s
    for r = 1:s
        H(r,c) = H(r,c) + 1/(r+c-1+i);
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
   HH= HH + 1./(r+c-1+i);
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
                This new benchmark conflates two things - the initial memory allocation and the element-wise operation to get each value. My answer showed that the memory allocation was a pretty big overhead for the vec approach, and you're only doing that once in each case despite doing the element-wise calculation 10 times each. You could move the i loop around the entire benchmark (since all that does is give us multiple tests to average) and immediately inside the loop use clearvars -except i to clear all variables in the calcs. This gives results much more in line with mine - approx a 2x difference.
– Wolfie
                Dec 30, 2021 at 14:10
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
    for r in range(1, s):
        H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop:   32.94566780002788 seconds

vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.
                The timing of your python sample on my laptop is, for-loop: around 60 seconds, vectorized: around 1.1 seconds. It doesn't beat fortran/julia. (Wow, you have a really fast CPU! )
– xzczd
                Jan 5, 2022 at 8:48
                Numpy is a thin wrapper around BLAS. Where BLAS is used in Julia or Fortran I would expect similar speeds.
– Francis King
                Jan 17, 2022 at 15:14
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.