Notebook download
Run in browser

NumPy¶

The scientific Python trilogy¶

Why is Python so popular for research work?

MATLAB has typically been the most popular "language of technical computing", with strong built-in support for efficient numerical analysis with matrices (the mat in MATLAB is for matrix, not maths), and plotting.

Other dynamic languages can be argued to have cleaner, more logical syntax, for example Ruby or Scheme.

But Python users developed three critical libraries, matching the power of MATLAB for scientific work:

Matplotlib, a plotting library for creating visualizations in Python.
NumPy, a fast numeric computing library offering a flexible n-dimensional array type.
IPython, an interactive Python interpreter that later led to the Jupyter notebook interface.

Who created those?

John D. Hunter created Matplotlib
Travis Oliphant is the primary creator of NumPy, founding contributor of SciPy, and he's also the founder of Anaconda
Fernando Perez created IPython

By combining a plotting library, a fast numeric library, and an easy-to-use interface allowing live plotting commands in a persistent environment, the powerful capabilities of MATLAB were matched by a free and open toolchain.

We've learned about Matplotlib and IPython in this course already. NumPy is the last part of the trilogy.

Limitations of Python lists¶

The standard Python list is inherently one dimensional. To make a matrix (two-dimensional array of numbers), we could create a list of lists:

In [1]:

x = [[row_index + col_index for col_index in range(5)] for row_index in range(5)]
x

Out[1]:

[[0, 1, 2, 3, 4],
 [1, 2, 3, 4, 5],
 [2, 3, 4, 5, 6],
 [3, 4, 5, 6, 7],
 [4, 5, 6, 7, 8]]

However, applying an operation to every element is a pain. We would like to be able to use an intuitive syntax like the following

In [2]:

x + 5

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 x + 5

TypeError: can only concatenate list (not "int") to list

As the + operator is used to perform concatenation for lists we instead have to use something more cumbersome such as the following.

In [3]:

[[elem + 5 for elem in row] for row in x]

Out[3]:

[[5, 6, 7, 8, 9],
 [6, 7, 8, 9, 10],
 [7, 8, 9, 10, 11],
 [8, 9, 10, 11, 12],
 [9, 10, 11, 12, 13]]

Common useful operations like transposing a matrix or reshaping a 10 by 10 matrix into a 20 by 5 matrix are not easy to perform with nested Python lists.

Importing NumPy¶

The main NumPy application programming interface (API) is exposed via the top-level numpy module. This is not part of the Python standard library but instead needs to be separately installed, for example using a package manager such as pip or conda.

As we will typically need to access the names defined in the numpy module a lot when working with NumPy in code, it is common in practice to import numpy as the shorthand name np. While modern editing environments such as the Jupyter Lab interface have tab-completion functionality which makes the reduction in keystrokes necessary less important, the shorter name can still have value in reducing line length and is a very common convention so we will follow suit here.

In [4]:

import numpy as np

The NumPy array¶

NumPy's ndarray type represents a multidimensional grid of values of a shared type. In NumPy nomenclature each dimension is termed an axes and the tuple of sizes of all the dimensions of the array is termed the array shape. We can construct a ndarray using the np.array function. The first positional argument to this function should be an array like Python object: this can be another array, or more commonly a (nested) sequence, with the constraint that sequences at each level must be of the same length. For example to construct a one-dimensional array with 5 integer elements we can pass in a corresponding list of 5 integers.

In [5]:

my_array = np.array([0, 1, 2, 3, 4])
type(my_array)

Out[5]:

numpy.ndarray

The NumPy array seems at first to be very similar to a list:

In [6]:

my_array

Out[6]:

array([0, 1, 2, 3, 4])

In [7]:

my_array[2]

Out[7]:

np.int64(2)

In [8]:

for element in my_array:
    print("Hello " * element)

Hello 
Hello Hello 
Hello Hello Hello 
Hello Hello Hello Hello

However, we see see there are some differences, for example:

In [9]:

my_array.append(4)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 my_array.append(4)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

For NumPy arrays it is generally expected that you will not change the size of an array once it has been defined and the way arrays are stored in memory would make such resize operations inefficient. Python lists on the other can be efficiently appended to, joined and split. However, you gain a lot of functionality in return for this limitation.

Array creation routines¶

As well as np.array, NumPy has various other routines available for creating arrays. For example np.arange provides an array equivalent to the built-in range function, with start, stop and step arguments with the same semantics as for range. When called with integer arguments np.arange will return a NumPy array equivalent to an equivalent call to range passed to np.array. For example

In [10]:

np.arange(0, 10)

Out[10]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:

np.array(range(0, 10))

Out[11]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:

np.arange(1, 5, 2)

Out[12]:

array([1, 3])

In [13]:

np.array(range(1, 5, 2))

Out[13]:

array([1, 3])

Unlike range, np.arange can also be used with non-integer arguments, for example

In [14]:

np.arange(0.0, 0.5, 0.1)

Out[14]:

array([0. , 0.1, 0.2, 0.3, 0.4])

Beware however because of the limits of floating point precision, using np.arange with non-integer arguments can sometimes lead to inconsistent seeming outputs:

In [15]:

print(np.arange(14.1, 15.1, 0.1))
print(np.arange(15.1, 16.1, 0.1))

[14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 15. ]
[15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9 16.  16.1]

The np.linspace function is an alternative array creation routine which offers a safer approach for constructing arrays of equally spaced non-integer values. The first three arguments to np.linspace are start, stop and num, corresponding under the default keyword argument values to respectively the starting value of the returned sequence, the end value of the sequence and the number of values in the sequence. Unlike the stop argument to np.arange by default the stop value in np.linspace is an inclusive upper bound on the sequence values and the num argument allows explicitly stating the length of the returned sequence, preventing inconsistencies in the returned lengths for similar inputs encountered above for np.arange. For example the following cell constructs an array of 11 evenly spaced floating point values between 15.1 and 16.1 inclusive

In [16]:

np.linspace(15.1, 16.1, 11)

Out[16]:

array([15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 15.7, 15.8, 15.9, 16. , 16.1])

np.linspace also accepts an optional boolean keyword argument endpoint which can be used to specify whether the stop argument corresponds to the last sample; if False then the first num of the num + 1 evenly spaced samples between start and stop (inclusive) are returned. For example

In [17]:

np.linspace(15.1, 16.1, 10, endpoint=False)

Out[17]:

array([15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 15.7, 15.8, 15.9, 16. ])

NumPy also provides routines for constructing arrays with all one elements (np.ones), all zero elements (np.zeros) and all elements equal to a specified value (np.full)

In [18]:

np.ones(shape=5)

Out[18]:

array([1., 1., 1., 1., 1.])

In [19]:

np.zeros(shape=10)

Out[19]:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [20]:

np.full(shape=100, fill_value=42)

Out[20]:

array([42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
       42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
       42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
       42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
       42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
       42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42])

The np.empty function can be used to construct an array in which the array memory is left uninitialised; while this can potentially be slightly cheaper than initialising arrays with a defined value care needs to be taken to not use the uninitialised values as these will depend on whatever was stored in the memory previously (and may not even evaluate to a valid number)

In [21]:

np.empty(50)

Out[21]:

array([4.65049051e-310, 0.00000000e+000, 0.00000000e+000, 6.92498816e-310,
       2.27270197e-322, 6.92498786e-310, 6.92498786e-310, 6.92498821e-310,
       6.92498819e-310, 6.92498821e-310, 6.92498821e-310, 6.92498821e-310,
       6.92498820e-310, 6.92498783e-310, 6.92498821e-310, 6.92498821e-310,
       6.92498820e-310, 6.92498821e-310, 6.92498786e-310, 6.92498821e-310,
       6.92498821e-310, 6.92498821e-310, 6.92498821e-310, 6.92498820e-310,
       6.92498820e-310, 6.92498820e-310, 6.92498818e-310, 6.92498821e-310,
       6.92498821e-310, 6.92498786e-310, 6.92498821e-310, 6.92498786e-310,
       6.92498786e-310, 6.92498821e-310, 6.92498821e-310, 6.92498819e-310,
       6.92498819e-310, 6.92498821e-310, 6.92498786e-310, 6.92498821e-310,
       6.92498821e-310, 6.92498786e-310, 6.92498821e-310, 6.92498821e-310,
       6.92498820e-310, 6.92498819e-310, 6.92498821e-310, 6.92498820e-310,
       6.92498821e-310, 6.92498821e-310])

Array data types¶

A Python list can contain data of mixed type:

In [22]:

x = ['hello', 2, 3.4, True]
for el in x:
    print(type(el))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>

In most cases all the elements of a NumPy array will be of the same type. The data type of an array can be specified by the dtype argument to np.array (as well as to other array creation routines such as np.arange and np.linspace). If omitted the default is to use the 'minimum' (least generic) type required to hold the objects in the (nested) sequence, with all the objects cast to this type. The results of this type conversion can sometimes be non-intuitive. For example, in the following

In [23]:

for el in np.array(x):
    print(type(el))

<class 'numpy.str_'>
<class 'numpy.str_'>
<class 'numpy.str_'>
<class 'numpy.str_'>

the array data type has been automatically set to the numpy.str_ string type as the other integer, float and bool objects can all be represented as strings. In contrast if we repeat the same code snippet but exclude the first string entry in the list x

In [24]:

for el in np.array(x[1:]):
    print(type(el))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>

we see that all the array elements are now numpy.float64 double-precision floating point values, as both integer and bool values can be represented as floats. The main takeaway here is that when construct NumPy arrays with np.array it generally is advisable to either always use (nested) sequences containing objects of a uniform type or to explicitly specify the data type using the dtype argument to ensure the constructed array has the data type you expect!

An important exception to the rule-of-thumb that all elements in NumPy arrays are of the same type is the NumPy array with an object datatype. This is the 'catch-all' data type used when no other type can represent all the objects in the nested sequence passed to np.array or if this data type is explicitly specified via the dtype argument. In this case the array stores only references to the objects and when the array elements are accessed the original objects are recovered:

In [25]:

for el in np.array(x, dtype=object):
    print(type(el))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>

While this flexibility allows arrays to hold objects of any type, as we will see most of NumPy's performance gains arise from using arrays with specific data types that allow using efficient compiled code to implement computations.

Arrays have a dtype attrinute which specifies their data type:

In [26]:

x = np.array([2., 3.4, 7.2, 0.])
x.dtype

Out[26]:

dtype('float64')

NumPy supports a wide range of numeric data types of varying precision levels. The type code in the data type string representation typically consists of the combination of a primitive type and integer specifying the bit width of the value.

NumPy will also convert Python type names to corresponding data types:

In [27]:

int_array = np.array(x, dtype=int)

In [28]:

float_array = np.array(x, dtype=float)

In [29]:

int_array

Out[29]:

array([2, 3, 7, 0])

In [30]:

float_array

Out[30]:

array([2. , 3.4, 7.2, 0. ])

In [31]:

int_array.dtype

Out[31]:

dtype('int64')

In [32]:

float_array.dtype

Out[32]:

dtype('float64')

Elementwise operations and scalar broadcasting¶

Most arithmetic operations can be applied directly to NumPy arrays, with the elementwise interpretation of the operations corresponding to what we would intutively expect from the corresponding mathematical notation.

In [33]:

my_array = np.arange(5)
my_array + my_array

Out[33]:

array([0, 2, 4, 6, 8])

In [34]:

my_array * my_array

Out[34]:

array([ 0,  1,  4,  9, 16])

We can also use unary operations, for example

In [35]:

-my_array

Out[35]:

array([ 0, -1, -2, -3, -4])

As well as binary operations between arrays of the same shape as above, we can also apply binary operations to mixes of arrays and scalars, with binary operations involving a scalar and an array being broadcast such that the scalar is treated as if it was an array of equal shape to the array operand and the operation then performed elementwise. This again gives compact expressions which correspond with how we would typically intepret such expressions in mathematical notation

In [36]:

my_array * 2

Out[36]:

array([0, 2, 4, 6, 8])

In [37]:

my_array + 1

Out[37]:

array([1, 2, 3, 4, 5])

In [38]:

2 ** my_array

Out[38]:

array([ 1,  2,  4,  8, 16])

These vectorised operations are very fast. For example, we can use the %timeit magic to compare the time taken to use a list comprehension to compute the squares of the first 10 000 integers:

In [39]:

%timeit [x**2 for x in range(10_000)]

538 μs ± 3.86 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

with the time taken to compute the corresponding array of squared integers using NumPy:

In [40]:

%timeit np.arange(10_000)**2

7.13 μs ± 70.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Note that this speed-up is a consequence of all of the array elements being known in advance to be of a specific data type enabling the use of efficient compiled loop in NumPy's backend when computing the operation. The performance advantage is lost when using arrays with the catch-all object data type, as in this case NumPy has to revert to looping over the array elements in Python:

In [41]:

%timeit np.arange(10_000, dtype=object)**2

500 μs ± 710 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Numpy mathematical functions¶

NumPy comes with vectorised versions of common mathematical functions which work elementwise when applied to arrays.

Compared to the list comprehensions used previously these signficantly simplify the process of plotting functions using Matplotlib:

In [42]:

from matplotlib import pyplot as plt
x = np.linspace(-3, 3, 100)
for func in (np.sin, np.cos, np.tanh, np.arctan, np.absolute):
    plt.plot(x, func(x), label=func.__name__)
plt.legend()

Out[42]:

<matplotlib.legend.Legend at 0x7f7a487ee8a0>

No description has been provided for this image

Multi-dimensional arrays¶

A particularly powerful feature of NumPy is its ability to handle arrays of (almost) arbitrary dimension. For example to create a three dimensional array of zeros we can use the following

In [43]:

np.zeros(shape=(3, 4, 2))  # or equivalently np.zeros((3, 4, 2))

Out[43]:

array([[[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]]])

Unlike nested lists in Python, we can change the shape of NumPy arrays using the np.reshape function (or ndarray.reshape method) providing the new total number of elements in the new shape matches the old shape. For example

In [44]:

x = np.arange(12).reshape((3, 4))
x

Out[44]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

We can also reorder the dimensions of arrays using the np.transpose function or corresponding ndarray.transpose method. By default this reverses the order of the axes (dimensions)

In [45]:

x.transpose()

Out[45]:

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

The shorthand ndarray.T property can also be used access the result corresponding to calling the transpose method with its default arguments

In [46]:

x.T

Out[46]:

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

We can also pass in a specific permutation of the axes indices to transpose to get different reorderings, for example

In [47]:

y = np.arange(24).reshape((3, 4, 2))
y.transpose((0, 2, 1))

Out[47]:

array([[[ 0,  2,  4,  6],
        [ 1,  3,  5,  7]],

       [[ 8, 10, 12, 14],
        [ 9, 11, 13, 15]],

       [[16, 18, 20, 22],
        [17, 19, 21, 23]]])

The shape of a particular array can always be accessed using the ndarray.shape attribute

In [48]:

y.shape

Out[48]:

(3, 4, 2)

In [49]:

y.transpose((0, 2, 1)).shape

Out[49]:

(3, 2, 4)

The total number of dimensions (axes) of an array can be accessed using the ndarray.ndim attribute

In [50]:

y.ndim

Out[50]:

Array indexing and slicing¶

A multidimensional array accepts a comma-delimited sequence of integer indices or index ranges enclosed in square brackets, of up to the number of array dimensions. For an array with n dimensions if n integer indices are specified then the result is a scalar value of the array's data type

In [51]:

x = np.arange(40).reshape([4, 5, 2])
x[2, 1, 0]

Out[51]:

np.int64(22)

If we pass m < n indices, then the indices are used to select a slice of the array corresponding to using the specified indices to select the first m dimensions and selecting all of the remaining n - m dimensions

In [52]:

x[2, 1]

Out[52]:

array([22, 23])

In [53]:

x[2]

Out[53]:

array([[20, 21],
       [22, 23],
       [24, 25],
       [26, 27],
       [28, 29]])

Similar to lists, NumPy arrays also support an extended indexing syntax to support selecting portions of a particular dimension; somewhat confusingly the term slice is used to refer both to the outcome of selecting a portion of a sequence and the object and associated syntactic sugar used to select such portions. Passing a colon separated range start:stop:step as the index for an array dimension, will select the elements for which this dimension's index is in the corresponding range object range(start, stop, step) for example

In [54]:

np.arange(10)[1:10:2]

Out[54]:

array([1, 3, 5, 7, 9])

As for lists the step component of the range can be omitted; in this case the second colon can also be left out.

In [55]:

np.arange(10)[1:3]

Out[55]:

array([1, 2])

If the start argument is omitted it is implicitly assumed to be zero - that is to start from the beginning of the dimension. If the end argument is omitted it is implicitly assumed to be equal to the length of that dimension, that is to stop at the final index along that dimension. Combining these rules together means that for example a plain colon : will be interpreted as referring to all indices in that dimension

In [56]:

np.arange(10)[:]

Out[56]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

For multiple dimensional arrays we can slice along multiple dimensions simultaneously

In [57]:

x[2:, :1, 0]

Out[57]:

array([[20],
       [30]])

As for lists, the inbuilt slice object can be used to programatically define slices, for example the following is equivalent to the slicing in the above cell

In [58]:

x[slice(2, None), slice(None, 1), 0]

Out[58]:

array([[20],
       [30]])

Reduction operations¶

NumPy provides various functions and associate ndarray methods which apply an operation along one or more array axes, resulting in an output of reduced dimension. The reduction operations include

sum, prod: compute the sum or product along one or more dimensions,
min, max: compute the minimum or maximum along one or more dimensions,
argmin, argmax: compute the indices corresponding to the minimum or maximum along one or more dimensions,
mean, std: compute the empirical mean or standard deviation along one or dimensions.

All of these operations include both a functional form available in the numpy module namespace, and a corresponding ndarray method. The interface to the ndarray methods match the function other than the first array positional argument being set to array the method is being called on; for example np.sum(x) and x.sum() will give equivalent results.

In [59]:

x = np.arange(12).reshape(2, 2, 3)
x

Out[59]:

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

In [60]:

np.sum(x)  # Sums along all axes

Out[60]:

np.int64(66)

In [61]:

x.sum() # Also sums along all axes

Out[61]:

np.int64(66)

All the reduction operations accept an optional axis argument which specifies the array axis (dimension) or axes to apply the reduction along. This defaults to None corresponding to applying along all axes, with the returned output then a scalar.

In [62]:

x.sum(axis=None)  # Also sums along all axes

Out[62]:

np.int64(66)

If axis is set to an integer (corresponding to a valid axis index for the array) then the reduction will be applied only along that axis, resulting in a returned output of dimension one less than the original array.

In [63]:

x.sum(axis=0)  # Sums along the first axis

Out[63]:

array([[ 6,  8, 10],
       [12, 14, 16]])

In [64]:

x.sum(axis=1)  # Sums along the second axis

Out[64]:

array([[ 3,  5,  7],
       [15, 17, 19]])

In [65]:

x.sum(axis=2)  # Sums along the third axis

Out[65]:

array([[ 3, 12],
       [21, 30]])

If axis is set a tuple of integer axis indices, the reduction is applied along all the corresponding axes, with the returned output then of dimension equal to the original array dimension minus the length of the axis index tuple.

In [66]:

x.sum(axis=(0, 1)) # Sums along the first and second axes

Out[66]:

array([18, 22, 26])

In [67]:

x.sum(axis=(0, 1, 2))  # Also sums along all axes

Out[67]:

np.int64(66)

Advanced broadcasting¶

We earlier encountered the concept of broadcasting in the context of binary operations on mixes of scalars and arrays. A very powerful feature of NumPy is that broadcasting is also extended to apply to operations involving arrays with differing but compatible shapes. This allows us to broadcast a smaller array across a larger array without needlessly repeating the data in the smaller array. It also

NumPy's binary operations are usually applied elementwise on pairs of arrays, with the simplest case being when the arrays have exactly matching shapes

In [68]:

np.arange(0, 5) * np.arange(5, 10)

Out[68]:

array([ 0,  6, 14, 24, 36])

In [69]:

np.ones((3, 2)) + np.zeros((3, 2))

Out[69]:

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

If we apply binary operations to arrays with non-matching shapes we will typically get an error

In [70]:

np.arange(5) * np.arange(6)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[70], line 1
----> 1 np.arange(5) * np.arange(6)

ValueError: operands could not be broadcast together with shapes (5,) (6,)

In [71]:

np.ones((2, 3)) * np.zeros((2, 4))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[71], line 1
----> 1 np.ones((2, 3)) * np.zeros((2, 4))

ValueError: operands could not be broadcast together with shapes (2,3) (2,4)

However, the condition that array shapes exactly match is relaxed to allow operations to be peformed on pairs of arrays for which the shapes are compatible under certain rules. The shape of two arguments to a binary operation are considered compatible if: working from the rightmost dimension leftwards all dimensions which are defined for both arrays are either equal or one of them is equal to one. Any dimensions for which one array only has size one, that array is treated as if the array element was repeated a number of times equal to the size of the corresponding dimension of the other array.

This provides a convenient way for example to perform an outer product operation on vectors (one dimensional arrays)

In [72]:

np.arange(5).reshape(5, 1) * np.arange(5, 10).reshape(1, 5)

Out[72]:

array([[ 0,  0,  0,  0,  0],
       [ 5,  6,  7,  8,  9],
       [10, 12, 14, 16, 18],
       [15, 18, 21, 24, 27],
       [20, 24, 28, 32, 36]])

Importantly arrays do not need to have the same number of dimensions to have compatible shapes, providing the rightmost dimensions of the array with a larger dimension are compatible with the shape of the smaller array, with the missing leftmost dimensions of the smaller array treated as if they were of size one. For example

In [73]:

np.arange(6).reshape(3, 2) + np.arange(2)

Out[73]:

array([[0, 2],
       [2, 4],
       [4, 6]])

For a more complete description of NumPy's broadcasting rules including some helpful visualisation see this article in the official documentation.

Adding new axes¶

Broadcasting is very powerful, and NumPy allows indexing with np.newaxis to temporarily create new length one dimensions on the fly, rather than explicitly calling reshape.

In [74]:

x = np.arange(10).reshape(2, 5)
y = np.arange(8).reshape(2, 2, 2)

In [75]:

x.reshape(2, 5, 1, 1).shape

Out[75]:

(2, 5, 1, 1)

In [76]:

x[:, :, np.newaxis, np.newaxis].shape

Out[76]:

(2, 5, 1, 1)

In [77]:

y[:, np.newaxis, :, :].shape

Out[77]:

(2, 1, 2, 2)

In [78]:

res = x[:, :, np.newaxis, np.newaxis] * y[:, np.newaxis, :, :]
res.shape

Out[78]:

(2, 5, 2, 2)

This is particularly useful when performing outer product type operations

In [79]:

x = np.arange(5)
y = np.arange(5, 10)
x[:, np.newaxis] * y[np.newaxis, :]

Out[79]:

array([[ 0,  0,  0,  0,  0],
       [ 5,  6,  7,  8,  9],
       [10, 12, 14, 16, 18],
       [15, 18, 21, 24, 27],
       [20, 24, 28, 32, 36]])

Note that newaxis works because an array with extra length one dimensions has the same overall size (and so can be a view to the same underlying data) just with a different shape. In other words, a $3 \times 1 \times 3$ and a $3 \times 3$ array contain the same data, differently shaped:

In [80]:

three_by_three = np.arange(9).reshape(3, 3)
three_by_three

Out[80]:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [81]:

three_by_three[:, np.newaxis, :]

Out[81]:

array([[[0, 1, 2]],

       [[3, 4, 5]],

       [[6, 7, 8]]])

Matrix multiplications¶

NumPy interprets the standard * operator as elementwise multiplication

In [82]:

a = np.arange(9).reshape(3, 3)
b = np.arange(3, 12).reshape(3, 3)
a * b

Out[82]:

array([[ 0,  4, 10],
       [18, 28, 40],
       [54, 70, 88]])

To perform matrix multiplication we use the @ operator instead

In [83]:

a @ b

Out[83]:

array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

As well as matrix-matrix products (that is products of pairs of two dimensional arrays) this can also be used for matrix-vector products (products of a two dimensional array and one dimensional array) and vector-vector products (products of pairs of one dimensional arrays)

In [84]:

v = np.arange(3)
a @ v

Out[84]:

array([ 5, 14, 23])

In [85]:

v @ v

Out[85]:

np.int64(5)

We can alternatively use a built in function:

In [86]:

np.dot(a, b)

Out[86]:

array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

Though it is possible to represent this in the algebra of broadcasting and newaxis:

In [87]:

a[:, :, np.newaxis].shape

Out[87]:

(3, 3, 1)

In [88]:

b[np.newaxis, :, :].shape

Out[88]:

(1, 3, 3)

In [89]:

a[:, :, np.newaxis] * b[np.newaxis, :, :]

Out[89]:

array([[[ 0,  0,  0],
        [ 6,  7,  8],
        [18, 20, 22]],

       [[ 9, 12, 15],
        [24, 28, 32],
        [45, 50, 55]],

       [[18, 24, 30],
        [42, 49, 56],
        [72, 80, 88]]])

In [90]:

(a[:, :, np.newaxis] * b[np.newaxis, :, :]).sum(1)

Out[90]:

array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

Or if you prefer:

In [91]:

(a.reshape(3, 3, 1) * b.reshape(1, 3, 3)).sum(1)

Out[91]:

array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

We use broadcasting to generate $A_{ij}B_{jk}$ as a 3-d matrix:

In [92]:

a.reshape(3, 3, 1) * b.reshape(1, 3, 3)

Out[92]:

array([[[ 0,  0,  0],
        [ 6,  7,  8],
        [18, 20, 22]],

       [[ 9, 12, 15],
        [24, 28, 32],
        [45, 50, 55]],

       [[18, 24, 30],
        [42, 49, 56],
        [72, 80, 88]]])

Then we sum over the middle, $j$ axis, [which is the 1-axis of three axes numbered (0,1,2)] of this 3-d matrix. Thus we generate $\Sigma_j A_{ij}B_{jk}$ .

We can see that the broadcasting concept gives us a powerful and efficient way to express many linear algebra operations computationally.

Structured arrays¶

So far we have encountered arrays with 'simple' data types corresponding to a single type. NumPy also offers arrays with structured data types, sometimes termed record arrays, for which each array element is composed of several fields, with each field having the same data type across all array elements. These are a special array structure designed to match the comma separated variable (CSV) record and field model. We saw this when we looked at CSV files:

In [93]:

x = np.arange(50).reshape((10, 5))

In [94]:

record_x = x.view(
    dtype={
        'names': ["col1", "col2", "another", "more", "last"], 
        'formats': [int] * 5 
    } 
)

In [95]:

record_x

Out[95]:

array([[( 0,  1,  2,  3,  4)],
       [( 5,  6,  7,  8,  9)],
       [(10, 11, 12, 13, 14)],
       [(15, 16, 17, 18, 19)],
       [(20, 21, 22, 23, 24)],
       [(25, 26, 27, 28, 29)],
       [(30, 31, 32, 33, 34)],
       [(35, 36, 37, 38, 39)],
       [(40, 41, 42, 43, 44)],
       [(45, 46, 47, 48, 49)]],
      dtype=[('col1', '<i8'), ('col2', '<i8'), ('another', '<i8'), ('more', '<i8'), ('last', '<i8')])

Record arrays can be addressed with field names like they were a dictionary:

In [96]:

record_x['col1']

Out[96]:

array([[ 0],
       [ 5],
       [10],
       [15],
       [20],
       [25],
       [30],
       [35],
       [40],
       [45]])

Comparison operators and boolean indexing¶

Numpy defines comparison operators like == and < to apply to arrays elementwise and also to broadcast similar to arithmetic operations

In [97]:

x = np.arange(-1, 2)[:, np.newaxis] * np.arange(-2, 2)[np.newaxis, :]
x

Out[97]:

array([[ 2,  1,  0, -1],
       [ 0,  0,  0,  0],
       [-2, -1,  0,  1]])

In [98]:

is_zero = (x == 0)
is_zero

Out[98]:

array([[False, False,  True, False],
       [ True,  True,  True,  True],
       [False, False,  True, False]])

Boolean arrays can also be used to filter the elements of an array matching some condition. For example

In [99]:

x[is_zero]

Out[99]:

array([0, 0, 0, 0, 0, 0])

We can use the unary negation operator ~ to negate conditions

In [100]:

x[~is_zero]

Out[100]:

array([ 2,  1, -1, -2, -1,  1])

Although when used to get items from an array, boolean indexing results in a new one dimensional array, if the boolean indexing is instead part of an assignment statement the selected elements of the array are changed in place

In [101]:

x[is_zero] = 5

In [102]:

Out[102]:

array([[ 2,  1,  5, -1],
       [ 5,  5,  5,  5],
       [-2, -1,  5,  1]])

For more details about boolean array indexing see the official documentation.

Copies and views¶

Care needs to be taken when assigning to slices of a NumPy array

In [103]:

x = np.arange(6).reshape((3, 2))
x

Out[103]:

array([[0, 1],
       [2, 3],
       [4, 5]])

In [104]:

y = x[0, :]
y[1] = -99
x

Out[104]:

array([[  0, -99],
       [  2,   3],
       [  4,   5]])

In general NumPy will try to return views to the same underlying array data buffer when performing indexing and slicing operations, where possible. These views share the same underlying data in memory with the original array, and so make such indexing operations cheap in both memory usage and computational cost (by avoiding unnececssary copies). As we saw above however, if we assign to a slice we will therefore also update the original array. We can use the np.copy function or corresponding ndarray.copy method to force creation of an array referencing a new copy of the underlying data

In [105]:

x = np.arange(6).reshape((3, 2))
y = x[0, :].copy()
y[1] = -99
x

Out[105]:

array([[0, 1],
       [2, 3],
       [4, 5]])

More details are given in the official documentation.