XClose

COMP0233: Research Software Engineering With Python

Home
Menu

Miscellaneous libraries to improve your code's performance

Over the course of years, Python ecosystem has developed more and more solutions to combat the relatively slow performance of the language. As seen in the previous lessons, the speed of computation matters quite a lot while using a technology in scientific setting. This lesson introduces a few such popular libraries that have been widely adopted in the scientific use of Python.

Compiling Python code

We know that Python is an interpreted and not a compiled language, but is there a way to compile Python code? There are a few libraries/frameworks that lets you just in time (JIT) or ahead of time (AOT) compile Python code. Both the techniques allow users to compile their Python code without using any explicit low level langauge's bindings, but both the techniques are different from each other.

Just in time compilers compile functions/methods at runtime whereas ahead of time compilers compile the entire code before runtime. AOT can do much more compiler optimizations than JIT, but AOT compilers produce huge binaries and that too only for specific platforms. JIT compilers have limited optimization routines but they produce small and platform independent binaries.

JIT and AOT compilers for Python include, but are not limited to -

  • Numba: a Just-In-Time Compiler for Numerical Functions in Python
  • mypyc: compiles Python modules to C extensions using standard Python type hints
  • JAX: a Python library for accelerator-oriented array computation and program transformation

Although all of them were built to speed-up Python code, each one of them is a bit different from each other when it comes to their use-case. We will specifically look into Numba in this lesson.

Numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. The good thing about Numba is that it works on plain Python loops and one doesn't need to configure any compilers manually to JIT compile Python code. Another good thing is that it understands NumPy natively, but as detailed in its documentation, it only understands a subset of NumPy functionalities.

Numba provides users with an easy to use function decorator - jit. Let's start by importing the libraries we will use in this lesson.

In [1]:
import math

from numba import jit, njit
import numpy as np

We can mark a function to be JIT compiled by decorating it with numba's @jit. The decorator takes a nopython argument that tells Numba to enter the compilation mode and not fall back to usual Python execution.

Here, we are showing a usual python function, and one that's decorated. We don't need to duplicate and change its name when using numba, but we want to keep both of them here to compare their execution times.

In [2]:
def f(x):
    return np.sqrt(x)

@jit(nopython=True)
def jit_f(x):
    return np.sqrt(x)

It looks like the jit decorator should make our Numba compile our function and make it much faster than the non-jit version. Let's test this out.

Note that the first function call is not included while timing because that is where Numba compiles the function. The compilation at runtime is called just in time compilation and resultant binaries are cached for the future runs.

In [3]:
data = np.random.uniform(low=0.0, high=100.0, size=1_000)
In [4]:
%%timeit
f(data)
1.99 μs ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [5]:
%%timeit -n 1 -r 1
_ = jit_f(data)  # compilation and run
506 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [6]:
%%timeit
jit_f(data)  # just run
1.69 μs ± 2.48 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Surprisingly, the JITted function was slower than plain Python and NumPy implementation! Why did this happen? Numba does not provide valuable performance gains over pure Python or NumPy code for simple operations and small dataset. The JITted function turned out to be slower than the non-JIT implementation because of the compilation overhead. Note that the result from the compilation run could be very noisy and could give a higher than real value, as mentioned in the previous lessons. Let's try increasing the size of our data and perform a non-NumPy list comprehension on the data.

The jit decorator with nopython=True is so widely used there exists an alias decorator for the same - @njit

In [7]:
data = np.random.uniform(low=0.0, high=100.0, size=1_000_000)
In [8]:
def f(x):
    return [math.sqrt(elem) for elem in x]

@njit
def jit_f(x):
    return [math.sqrt(elem) for elem in x]
In [9]:
%%timeit
f(data)
115 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]:
%%timeit -n 1 -r 1
_ = jit_f(data)  # compilation and run
127 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [11]:
%%timeit
jit_f(data)  # just run
55.2 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That was way faster than the non-JIT function! But, the result was still slower than the NumPy implementation. NumPy is still good for relatively simple computations, but as the complexity increases, Numba functions start outperforming NumPy implementations.

Let's go back to our plain Python mandelbrot code from the previous lessons and JIT compile it -

In [12]:
@njit
def mandel1(position, limit=50):
    
    value = position
    
    while abs(value) < 2:
        limit -= 1        
        value = value**2 + position
        if limit < 0:
            return 0
        
    return limit

xmin = -1.5
ymin = -1.0
xmax = 0.5
ymax = 1.0
resolution = 300
xstep = (xmax - xmin) / resolution
ystep = (ymax - ymin) / resolution
xs = [(xmin + (xmax - xmin) * i / resolution) for i in range(resolution)]
ys = [(ymin + (ymax - ymin) * i / resolution) for i in range(resolution)]
In [13]:
%%timeit -n 1 -r 1
data = [[mandel1(complex(x, y)) for x in xs] for y in ys]  # compilation and run
183 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [14]:
%%timeit
data = [[mandel1(complex(x, y)) for x in xs] for y in ys]  # just run
63.2 ms ± 134 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The compiled code already beats our fastest NumPy implementation! It is not necessary the compiled code will perform better than NumPy code, but it usually gives performance gains for signifantly large computations. As always, it is good to measure the performance to check if there are any gains.

Let's try JITting our NumPy code.

In [15]:
@njit
def mandel_numpy(position,limit=50):
    value = position
    diverged_at_count = np.zeros(position.shape)
    while limit > 0:
        limit -= 1
        value = value**2 + position
        diverging = value * np.conj(value) > 4
        first_diverged_this_time = np.logical_and(diverging, diverged_at_count == 0)
        diverged_at_count[first_diverged_this_time] = limit
        value[diverging] = 2
        
    return diverged_at_count

ymatrix, xmatrix = np.mgrid[ymin:ymax:ystep, xmin:xmax:xstep]
values = xmatrix + 1j * ymatrix
In [16]:
%%timeit
mandel_numpy(values)
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[16], line 1
----> 1 get_ipython().run_cell_magic('timeit', '', 'mandel_numpy(values)\n')

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2541, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2539 with self.builtin_trap:
   2540     args = (magic_arg_s, cell)
-> 2541     result = fn(*args, **kwargs)
   2543 # The code below prevents the output from being displayed
   2544 # when using magics with decorator @output_can_be_silenced
   2545 # when the last Python token in the expression is a ';'.
   2546 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/IPython/core/magics/execution.py:1185, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1183 for index in range(0, 10):
   1184     number = 10 ** index
-> 1185     time_number = timer.timeit(number)
   1186     if time_number >= 0.2:
   1187         break

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number)
    171 gc.disable()
    172 try:
--> 173     timing = self.inner(it, self.timer)
    174 finally:
    175     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function setitem>) found for signature:
 
 >>> setitem(array(float64, 2d, C), array(bool, 2d, C), int64)
 
There are 16 candidate implementations:
     - Of which 14 did not match due to:
     Overload of function 'setitem': File: <numerous>: Line N/A.
       With argument(s): '(array(float64, 2d, C), array(bool, 2d, C), int64)':
      No match.
     - Of which 2 did not match due to:
     Overload in function 'SetItemBuffer.generic': File: numba/core/typing/arraydecl.py: Line 221.
       With argument(s): '(array(float64, 2d, C), array(bool, 2d, C), int64)':
      Rejected as the implementation raised a specific error:
        NumbaTypeError: Multi-dimensional indices are not supported.
  raised from /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/numba/core/typing/arraydecl.py:91

During: typing of setitem at /tmp/ipykernel_12413/2672903129.py (10)

File "../../../../../../tmp/ipykernel_12413/2672903129.py", line 10:
<source missing, REPL/exec in use?>

That does not work. The error might be solvable or it might just be out of Numba's scope. Numba does not distinguish between plain Python and NumPy; hence both the implementations are broken down to the same machine code. Therefore, for Numba, it makes sense to write complex computations with the ease of Python loops and lists than with NumPy functions. Moreover, Numba only understands a subset of Python and NumPy so it is possible that a NumPy snippet does not work but a simplified Python loop does.

Let's make minor adjustments to fix the NumPy implementation and measure its performance. We flatten the NumPy arrays and consider only the real part while performing the comparison.

In [17]:
@njit
def mandel_numpy(position,limit=50):
    value = position.flatten()
    diverged_at_count = np.zeros(position.shape).flatten()
    while limit > 0:
        limit -= 1
        value = value**2 + position.flatten()
        diverging = (value * np.conj(value)).real > 4
        first_diverged_this_time = (np.logical_and(diverging, diverged_at_count == 0))
        diverged_at_count[first_diverged_this_time] = limit
        value[diverging] = 2

    return diverged_at_count.reshape(position.shape)

ymatrix, xmatrix = np.mgrid[ymin:ymax:ystep, xmin:xmax:xstep]
values = xmatrix + 1j * ymatrix
In [18]:
%%timeit -n 1 -r 1
mandel_numpy(values)  # compilation and run
879 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [19]:
%%timeit
mandel_numpy(values)  # just run
33.7 ms ± 60.1 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The code performs similar to the plain Python example!

Numba also has functionalities to vectorize, parallelize, and strictly type check the code. All of these functions boost the performance even further or helps Numba to avoid falling back to the "object" mode (nopython=False). Refer to Numba's documentation for a complete list of features.

Numba support in Scientific Python ecosystem

Most of the scientific libraries nowadays ship Numba support with them. For example, Awkward Array is a library that lets users perfor JIT compilable operations on non-uniform data, something that NumPy misses. Similarly, Akimbo provides JIT compilable processing of nested, non-uniform data in dataframes. There are other niche libraries, such as, vector that provides Numba support for high energy physics vector.

Compiled Python bindings

Several Python libraries/frameworks are written in a compiled language but provide Python bindings for their compiled code. This code does not have to be explicitly marked to be compiled by developers. A few Python libraries that have their core written in compiled languages but provide Python bindings -

  • Cython: superset of Python that supports calling C functions and declaring C types on variables and class attributes
  • Scipy: wraps highly-optimized scientific implementations written in low-level languages like Fortran, C, and C++
  • Polars: dataframes powered by a multithreaded, vectorized query engine, written in Rust
  • boost-histogram: Python bindings for the C++14 Boost::Histogram library
  • Awkward Array: compiled and fast manipulation of JSON-like data with NumPy-like idioms
  • Astropy: astronomy and astrophysics core library

JIT or AOT compiling libraries and the libraries providing Python bindings for a compiled language are not mutually exclusive. For instance, PyTorch and Tensorflow offer users "eager" and "graph" executation. Eager execution builds computational graph at runtime, making it slow, easy to debug, and JIT compilable. On the other hand, "graph" execution builds the computational graph at the kernel level, making it fast, hard to debug, and with no need to JIT compile. Similarly, Awkward Array provides Python bindings for array creation, but high-level operations on these arrays can be JIT compiled.

We will specifically look into Cython in the next lesson.

Distributed and parallel computing with Python

Along with the speed of execution of a program, scientific computation usually requires a large amount of memory because of the massive nature of data produced at scientific experiments. This issue can be addressed by employing distributed and parallel computing frameworks to spread the computation across multiple machines or multiple cores in a single machine. Parallel computing at a smaller level can be achieved by distributing a task over multiple threads, but Python's GIL (Global Interpreter Lock) blocks the interpretor from doing this for computational tasks. There are ongoing efforts, such as the implementation of PEP 703 to make GIL optional, but Numba allows users to bypass GIL by pass nogil=True to @jit.

At a larger scale, Dask can be used to parallelize computation over multiple machine or "nodes". Dask can even parallelize the code marked with @numba.jit(nogil=True) to multiple threads in a machine, but it does not itself bypass the GIL.

Dask

Dask is a Python library for parallel and distributed computing. Dask supports arrays, dataframes, Python objects, as well as tasks on computer clusters and nodes. The API of Dask is very similar to that of NumPy, Pandas, and plain Python, allowing users to quickly parallelize their implementations. Let's create a dummy dataframe using dask.

In [20]:
import dask

df = dask.datasets.make_people()  # returns a dask "bag"
df = df.to_dataframe()
In [21]:
df
# The computation gave us a dask object and not the actual answer. Why is that?
# Displaying the dataframe just displays the metadata of the variable, and not any
# data. This is because of the "lazy" nature of dask. Dask has "lazy" execution,
# which means that it will store the operations on the data and
# create a task graph for the same, but will not execute the operations until a
# user explicitly asks for the result. The metadata specifies `npartitions=10`,
# which means that the dataframe is split into 10 parts that will be accessed parallely.
# We can explicitly tell dask to give us the dataframe values using `.compute()`.
Out[21]:
Dask DataFrame Structure:
age name occupation telephone address credit-card
npartitions=10
int64 string string string string string
... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: to_pyarrow_string, 1 expression
In [22]:
df.compute()
Out[22]:
age name occupation telephone address credit-card
0 45 ('Maia', 'Payne') Hardware Dealer +1-605-014-6242 {'address': '721 Martinez Extension', 'city': ... {'number': '2390 7408 2475 9349', 'expiration-...
1 42 ('Jarrod', 'Hawkins') Statistician +1-208-996-2055 {'address': '568 Ada Turnpike', 'city': 'Coral... {'number': '3723 469750 19735', 'expiration-da...
2 89 ('Katharyn', 'Munoz') Insurance Assessor +1-212-206-5280 {'address': '570 Sandpiper Cove Motorway', 'ci... {'number': '3454 333440 88536', 'expiration-da...
3 20 ('Marilou', 'Mccray') Health Advisor +13363524926 {'address': '476 Judah Run', 'city': 'Loveland'} {'number': '5453 9284 7658 4746', 'expiration-...
4 59 ('Ramiro', 'Mcclure') Remedial Therapist +15087806013 {'address': '1219 Huron Extension', 'city': 'D... {'number': '4932 7965 9218 0291', 'expiration-...
... ... ... ... ... ... ...
995 44 ('Toby', 'Spencer') Church Warden +16069230460 {'address': '334 Magnolia Alley', 'city': 'St.... {'number': '5586 2069 0457 0888', 'expiration-...
996 73 ('Johnny', 'Frye') Studio Manager +18509329659 {'address': '1231 Richter Bend', 'city': 'Port... {'number': '4743 1282 8431 8873', 'expiration-...
997 10 ('Angila', 'Mcdowell') Medical Assistant +15415352742 {'address': '1225 Loraine Station', 'city': 'A... {'number': '2718 2485 4173 5639', 'expiration-...
998 17 ('Nathanael', 'Guthrie') Bacon Curer +1-914-915-2091 {'address': '951 Powhattan Field', 'city': 'Ne... {'number': '4008 7787 5322 9421', 'expiration-...
999 105 ('Ward', 'Woodward') Store Detective +15405468978 {'address': '1005 Daniel Burnham Glen', 'city'... {'number': '5203 7768 8217 3189', 'expiration-...

10000 rows × 6 columns

We can now perform pandas like operation on our dataframe and let dask create a task graph for the same.

In [23]:
new_df = df.groupby("occupation").age.max()
new_df
Out[23]:
Dask Series Structure:
npartitions=1
    int64
      ...
Dask Name: max, 3 expressions
Expr=Max(frame=FromGraph(74ef6e8)[['age', 'occupation']], observed=False, chunk_kwargs={'numeric_only': False}, aggregate_kwargs={'numeric_only': False}, _slice='age')

Instead of computing, let's visualize the task graph.

In [24]:
dask.visualize(new_df, filename="visualization.png")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/backend/execute.py:76, in run_check(cmd, input_lines, encoding, quiet, **kwargs)
     75         kwargs['stdout'] = kwargs['stderr'] = subprocess.PIPE
---> 76     proc = _run_input_lines(cmd, input_lines, kwargs=kwargs)
     77 else:

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/backend/execute.py:96, in _run_input_lines(cmd, input_lines, kwargs)
     95 def _run_input_lines(cmd, input_lines, *, kwargs):
---> 96     popen = subprocess.Popen(cmd, stdin=subprocess.PIPE, **kwargs)
     98     stdin_write = popen.stdin.write

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/subprocess.py:1026, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize, process_group)
   1023             self.stderr = io.TextIOWrapper(self.stderr,
   1024                     encoding=encoding, errors=errors)
-> 1026     self._execute_child(args, executable, preexec_fn, close_fds,
   1027                         pass_fds, cwd, env,
   1028                         startupinfo, creationflags, shell,
   1029                         p2cread, p2cwrite,
   1030                         c2pread, c2pwrite,
   1031                         errread, errwrite,
   1032                         restore_signals,
   1033                         gid, gids, uid, umask,
   1034                         start_new_session, process_group)
   1035 except:
   1036     # Cleanup if the child failed starting.

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/subprocess.py:1955, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session, process_group)
   1954 if err_filename is not None:
-> 1955     raise child_exception_type(errno_num, err_msg, err_filename)
   1956 else:

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('dot')

The above exception was the direct cause of the following exception:

ExecutableNotFound                        Traceback (most recent call last)
Cell In[24], line 1
----> 1 dask.visualize(new_df, filename="visualization.png")

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/dask/base.py:761, in visualize(filename, traverse, optimize_graph, maxval, engine, *args, **kwargs)
    757 args, _ = unpack_collections(*args, traverse=traverse)
    759 dsk = dict(collections_to_dsk(args, optimize_graph=optimize_graph))
--> 761 return visualize_dsk(
    762     dsk=dsk,
    763     filename=filename,
    764     traverse=traverse,
    765     optimize_graph=optimize_graph,
    766     maxval=maxval,
    767     engine=engine,
    768     **kwargs,
    769 )

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/dask/base.py:896, in visualize_dsk(dsk, filename, traverse, optimize_graph, maxval, o, engine, limit, **kwargs)
    893 if engine == "graphviz":
    894     from dask.dot import dot_graph
--> 896     return dot_graph(dsk, filename=filename, **kwargs)
    897 elif engine in ("cytoscape", "ipycytoscape"):
    898     from dask.dot import cytoscape_graph

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/dask/dot.py:274, in dot_graph(dsk, filename, format, **kwargs)
    236 """
    237 Render a task graph using dot.
    238 
   (...)
    271 dask.dot.to_graphviz
    272 """
    273 g = to_graphviz(dsk, **kwargs)
--> 274 return graphviz_to_file(g, filename, format)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/dask/dot.py:291, in graphviz_to_file(g, filename, format)
    288 if format is None:
    289     format = "png"
--> 291 data = g.pipe(format=format)
    292 if not data:
    293     raise RuntimeError(
    294         "Graphviz failed to properly produce an image. "
    295         "This probably means your installation of graphviz "
   (...)
    298         "issues/485 for more information."
    299     )

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/piping.py:104, in Pipe.pipe(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
     55 def pipe(self,
     56          format: typing.Optional[str] = None,
     57          renderer: typing.Optional[str] = None,
   (...)
     61          engine: typing.Optional[str] = None,
     62          encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]:
     63     """Return the source piped through the Graphviz layout command.
     64 
     65     Args:
   (...)
    102         '<?xml version='
    103     """
--> 104     return self._pipe_legacy(format,
    105                              renderer=renderer,
    106                              formatter=formatter,
    107                              neato_no_op=neato_no_op,
    108                              quiet=quiet,
    109                              engine=engine,
    110                              encoding=encoding)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/_tools.py:171, in deprecate_positional_args.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    162     wanted = ', '.join(f'{name}={value!r}'
    163                        for name, value in deprecated.items())
    164     warnings.warn(f'The signature of {func.__name__} will be reduced'
    165                   f' to {supported_number} positional args'
    166                   f' {list(supported)}: pass {wanted}'
    167                   ' as keyword arg(s)',
    168                   stacklevel=stacklevel,
    169                   category=category)
--> 171 return func(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/piping.py:121, in Pipe._pipe_legacy(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
    112 @_tools.deprecate_positional_args(supported_number=2)
    113 def _pipe_legacy(self,
    114                  format: typing.Optional[str] = None,
   (...)
    119                  engine: typing.Optional[str] = None,
    120                  encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]:
--> 121     return self._pipe_future(format,
    122                              renderer=renderer,
    123                              formatter=formatter,
    124                              neato_no_op=neato_no_op,
    125                              quiet=quiet,
    126                              engine=engine,
    127                              encoding=encoding)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/piping.py:161, in Pipe._pipe_future(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
    159     else:
    160         return raw.decode(encoding)
--> 161 return self._pipe_lines(*args, input_encoding=self.encoding, **kwargs)

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/backend/piping.py:161, in pipe_lines(engine, format, input_lines, input_encoding, renderer, formatter, neato_no_op, quiet)
    155 cmd = dot_command.command(engine, format,
    156                           renderer=renderer,
    157                           formatter=formatter,
    158                           neato_no_op=neato_no_op)
    159 kwargs = {'input_lines': (line.encode(input_encoding) for line in input_lines)}
--> 161 proc = execute.run_check(cmd, capture_output=True, quiet=quiet, **kwargs)
    162 return proc.stdout

File /opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/graphviz/backend/execute.py:81, in run_check(cmd, input_lines, encoding, quiet, **kwargs)
     79 except OSError as e:
     80     if e.errno == errno.ENOENT:
---> 81         raise ExecutableNotFound(cmd) from e
     82     raise
     84 if not quiet and proc.stderr:

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

We can see that the task graph starts with 10 independent branches because our dataframe was split into 10 partitions at the start. Let's compute the answer.

In [25]:
new_df.compute()
Out[25]:
occupation
Accountant           111
Accounts Clerk        80
Accounts Manager      94
Acoustic Engineer    118
Actor                116
                    ... 
Gun Smith             93
Jeweller             116
Genealogist           86
Lathe Operator        63
Presser              117
Name: age, Length: 1155, dtype: int64

Similarly, one can peform such computations on arrays and selected Python data structures.

Dask support in Scientific Python ecosystem

Similar to the adoption of Numba in the scientific Python ecosystem, dask is being adopted at an increasing rate. Some examples include dask-awkward for ragged data computation, dask-histogram for parallel filling of histograms dask support in cuDF, for parallelizing CUDA dataframes, and dask-sql, a distributed SQL Engine.

GPU accelerated computing in Python

The last method for speeding up Python that we will look at is running your code on GPUs instead of CPUs. GPUs are specifically built to speed up computational tasks such as doing linear algebra or experimenting with computer graphics. Support for GPUs can be provided by writing custom kernels or by using Pythonic GPU libraris such as CuPy.

CuPy

CuPy is an array-based library enabling Python code to be scaled to GPUs. CuPy's API is very similar to NumPy and SciPy, and it was built as their extension to GPUs. There are a few differences in the API, and even a few missing functions, but it is actively developed and in a stable state. CuPy requires CUDA or AMD ROCm to be installed on your machine. After the installation, majority of the NumPy and SciPy scripts can be converted to CuPy by just replacing numpy and scipy with cupy.

import cupy as cp

print(cp.cuda.runtime.getDeviceCount())  # check if CuPy identifies the CUDA GPU

x = cp.arange(6).reshape(2, 3).astype('f')
y = cp.arange(6).reshape(2, 3).astype('f')

result = x + y

The code is not executed in this lesson as the service where these lessons are built do not posses a GPU.

CuPy support in Scientific Python ecosystem

Similar to Numba and Dask, CuPy is being adopted in multiple scientific libraries to provide GPU capabilities to the users. Dask along with its exosystem libraries, such as dask-image provide CuPy and GPU support. Similarly, packages like Awkward Array, cupy-xarray, and pyxem use CuPy internally to offer GPU support.

Writing custom GPU kernels and binding them to Python

Some Python libraries write their own custom kernels to provide GPU support to the users. For instance, cuDF uses libcudf, a C++/CUDA dataframe library to provide GPU support for dataframes in Python. Similarly, Tensorflow and PyTorch have their own GPU kernels and they do not depend on CuPy to run their code on GPUs. One can even write custom GPU kernels in CuPy, and libraries like Awkward Array leverage that for ragged data computation.