XClose

COMP0233: Research Software Engineering With Python

Home
Menu

NumPy

The Scientific Python Trilogy

Why is Python so popular for research work?

MATLAB has typically been the most popular "language of technical computing", with strong built-in support for efficient numerical analysis with matrices (the mat in MATLAB is for Matrix, not Maths), and plotting.

Other dynamic languages have cleaner, more logical syntax (Ruby, Haskell)

But Python users developed three critical libraries, matching the power of MATLAB for scientific work:

By combining a plotting library, a matrix maths library, and an easy-to-use interface allowing live plotting commands in a persistent environment, the powerful capabilities of MATLAB were matched by a free and open toolchain.

We've learned about Matplotlib and IPython in this course already. NumPy is the last part of the trilogy.

Limitations of Python Lists

The normal Python list is just one dimensional. To make a matrix, we have to nest Python lists:

In [1]:
x = [list(range(5)) for N in range(5)]
In [2]:
x
Out[2]:
[[0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4]]
In [3]:
x[2][2]
Out[3]:
2

Applying an operation to every element is a pain:

In [4]:
x + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 x + 5

TypeError: can only concatenate list (not "int") to list
In [5]:
[[elem + 5 for elem in row] for row in x]
Out[5]:
[[5, 6, 7, 8, 9],
 [5, 6, 7, 8, 9],
 [5, 6, 7, 8, 9],
 [5, 6, 7, 8, 9],
 [5, 6, 7, 8, 9]]

Common useful operations like transposing a matrix or reshaping a 10 by 10 matrix into a 20 by 5 matrix are not easy to code in raw Python lists.

The NumPy array

NumPy's array type represents a multidimensional matrix $M_{i,j,k...n}$

The NumPy array seems at first to be just like a list. For example, we can index it and iterate over it:

In [6]:
import numpy as np
my_array = np.array(range(5))
In [7]:
my_array
Out[7]:
array([0, 1, 2, 3, 4])
In [8]:
my_array[2]
Out[8]:
2
In [9]:
for element in my_array:
    print("Hello" * element)
Hello
HelloHello
HelloHelloHello
HelloHelloHelloHello

We can also see our first weakness of NumPy arrays versus Python lists:

In [10]:
my_array.append(4)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 my_array.append(4)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

For NumPy arrays, you typically don't change the data size once you've defined your array, whereas for Python lists, you can do this efficiently. However, you get back lots of goodies in return...

Elementwise Operations

Most operations can be applied element-wise automatically!

In [11]:
my_array + 2
Out[11]:
array([2, 3, 4, 5, 6])

These "vectorized" operations are very fast: (the %%timeit magic reports how long it takes to run a cell; there is more information available if interested)

In [12]:
import numpy as np
big_list = range(10000)
big_array = np.arange(10000)
In [13]:
%%timeit
[x**2 for x in big_list]
3.04 ms ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]:
%%timeit
big_array**2
3.21 µs ± 67.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

arange and linspace

NumPy has two methods for quickly defining evenly-spaced arrays of (floating-point) numbers. These can be useful, for example, in plotting.

The first method is arange:

In [15]:
x = np.arange(0, 10, 0.1)  # Start, stop, step size

This is similar to Python's range, although note that we can't use non-integer steps with the latter!

In [16]:
y = list(range(0, 10, 0.1))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 y = list(range(0, 10, 0.1))

TypeError: 'float' object cannot be interpreted as an integer

The second method is linspace:

In [17]:
import math
values = np.linspace(0, math.pi, 100)  # Start, stop, number of steps
In [18]:
values
Out[18]:
array([0.        , 0.03173326, 0.06346652, 0.09519978, 0.12693304,
       0.1586663 , 0.19039955, 0.22213281, 0.25386607, 0.28559933,
       0.31733259, 0.34906585, 0.38079911, 0.41253237, 0.44426563,
       0.47599889, 0.50773215, 0.53946541, 0.57119866, 0.60293192,
       0.63466518, 0.66639844, 0.6981317 , 0.72986496, 0.76159822,
       0.79333148, 0.82506474, 0.856798  , 0.88853126, 0.92026451,
       0.95199777, 0.98373103, 1.01546429, 1.04719755, 1.07893081,
       1.11066407, 1.14239733, 1.17413059, 1.20586385, 1.23759711,
       1.26933037, 1.30106362, 1.33279688, 1.36453014, 1.3962634 ,
       1.42799666, 1.45972992, 1.49146318, 1.52319644, 1.5549297 ,
       1.58666296, 1.61839622, 1.65012947, 1.68186273, 1.71359599,
       1.74532925, 1.77706251, 1.80879577, 1.84052903, 1.87226229,
       1.90399555, 1.93572881, 1.96746207, 1.99919533, 2.03092858,
       2.06266184, 2.0943951 , 2.12612836, 2.15786162, 2.18959488,
       2.22132814, 2.2530614 , 2.28479466, 2.31652792, 2.34826118,
       2.37999443, 2.41172769, 2.44346095, 2.47519421, 2.50692747,
       2.53866073, 2.57039399, 2.60212725, 2.63386051, 2.66559377,
       2.69732703, 2.72906028, 2.76079354, 2.7925268 , 2.82426006,
       2.85599332, 2.88772658, 2.91945984, 2.9511931 , 2.98292636,
       3.01465962, 3.04639288, 3.07812614, 3.10985939, 3.14159265])

Regardless of the method used, the array of values that we get can be used in the same way.

In fact, NumPy comes with "vectorised" versions of common functions which work element-by-element when applied to arrays:

In [19]:
%matplotlib inline

from matplotlib import pyplot as plt
plt.plot(values, np.sin(values))
Out[19]:
[<matplotlib.lines.Line2D at 0x7f9c40252cd0>]
No description has been provided for this image

So we don't have to use awkward list comprehensions when using these.

Multi-Dimensional Arrays

NumPy's true power comes from multi-dimensional arrays:

In [20]:
np.zeros([3, 4, 2])  # 3 arrays with 4 rows and 2 columns each
Out[20]:
array([[[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]]])

Unlike a list-of-lists in Python, we can reshape arrays:

In [21]:
x = np.array(range(40))
x
Out[21]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])
In [22]:
y = x.reshape([4, 5, 2])
y
Out[22]:
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9]],

       [[10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]],

       [[20, 21],
        [22, 23],
        [24, 25],
        [26, 27],
        [28, 29]],

       [[30, 31],
        [32, 33],
        [34, 35],
        [36, 37],
        [38, 39]]])

And index multiple columns at once:

In [23]:
y[3, 2, 1]
Out[23]:
35

Including selecting on inner axes while taking all from the outermost:

In [24]:
y[:, 2, 1]
Out[24]:
array([ 5, 15, 25, 35])

And subselecting ranges:

In [25]:
y[2:, :1, :]  # Last 2 axes, 1st row, all columns
Out[25]:
array([[[20, 21]],

       [[30, 31]]])

And transpose arrays:

In [26]:
y.transpose()
Out[26]:
array([[[ 0, 10, 20, 30],
        [ 2, 12, 22, 32],
        [ 4, 14, 24, 34],
        [ 6, 16, 26, 36],
        [ 8, 18, 28, 38]],

       [[ 1, 11, 21, 31],
        [ 3, 13, 23, 33],
        [ 5, 15, 25, 35],
        [ 7, 17, 27, 37],
        [ 9, 19, 29, 39]]])

You can get the dimensions of an array with shape:

In [27]:
y.shape
Out[27]:
(4, 5, 2)
In [28]:
y.transpose().shape
Out[28]:
(2, 5, 4)

Some numpy functions apply by default to the whole array, but can be chosen to act only on certain axes:

In [29]:
x = np.arange(12).reshape(4,3)
x
Out[29]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [30]:
x.mean(1)  # Mean along the second axis, leaving the first.
Out[30]:
array([ 1.,  4.,  7., 10.])
In [31]:
x.mean(0)  # Mean along the first axis, leaving the second.
Out[31]:
array([4.5, 5.5, 6.5])
In [32]:
x.mean()  # mean of all axes
Out[32]:
5.5

Array Datatypes

A Python list can contain data of mixed type:

In [33]:
x = ['hello', 2, 3.4]
In [34]:
type(x[2])
Out[34]:
float
In [35]:
type(x[1])
Out[35]:
int

A NumPy array always contains just one datatype:

In [36]:
np.array(x)
Out[36]:
array(['hello', '2', '3.4'], dtype='<U32')

NumPy will choose the least-generic-possible datatype that can contain the data:

In [37]:
y = np.array([2, 3.4])
In [38]:
y
Out[38]:
array([2. , 3.4])

You can access the array's dtype, or check the type of individual elements:

In [39]:
y.dtype
Out[39]:
dtype('float64')
In [40]:
type(y[0])
Out[40]:
numpy.float64
In [41]:
z = np.array([3, 4, 5])
z
Out[41]:
array([3, 4, 5])
In [42]:
type(z[0])
Out[42]:
numpy.int64

The results are, when you get to know them, fairly obvious string codes for datatypes: NumPy supports all kinds of datatypes beyond the python basics.

NumPy will convert python type names to dtypes:

In [43]:
x = [2, 3.4, 7.2, 0]
In [44]:
int_array = np.array(x, dtype=int)
In [45]:
float_array = np.array(x, dtype=float)
In [46]:
int_array
Out[46]:
array([2, 3, 7, 0])
In [47]:
float_array
Out[47]:
array([2. , 3.4, 7.2, 0. ])
In [48]:
int_array.dtype
Out[48]:
dtype('int64')
In [49]:
float_array.dtype
Out[49]:
dtype('float64')

Broadcasting

This is another really powerful feature of NumPy.

By default, array operations are element-by-element:

In [50]:
np.arange(5) * np.arange(5)
Out[50]:
array([ 0,  1,  4,  9, 16])

If we multiply arrays with non-matching shapes we get an error:

In [51]:
np.arange(5) * np.arange(6)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[51], line 1
----> 1 np.arange(5) * np.arange(6)

ValueError: operands could not be broadcast together with shapes (5,) (6,) 
In [52]:
np.zeros([2,3]) * np.zeros([2,4])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[52], line 1
----> 1 np.zeros([2,3]) * np.zeros([2,4])

ValueError: operands could not be broadcast together with shapes (2,3) (2,4) 
In [53]:
m1 = np.arange(100).reshape([10, 10])
In [54]:
m2 = np.arange(100).reshape([10, 5, 2])
In [55]:
m1 + m2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[55], line 1
----> 1 m1 + m2

ValueError: operands could not be broadcast together with shapes (10,10) (10,5,2) 

Arrays must match in all dimensions in order to be compatible:

In [56]:
np.ones([3, 3]) * np.ones([3, 3]) # Note elementwise multiply, *not* matrix multiply.
Out[56]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Except, that if one array has any Dimension 1, then the data is REPEATED to match the other.

In [57]:
col = np.arange(10).reshape([10, 1])
col
Out[57]:
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])
In [58]:
row = col.transpose()
row
Out[58]:
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [59]:
col.shape # "Column Vector"
Out[59]:
(10, 1)
In [60]:
row.shape # "Row Vector"
Out[60]:
(1, 10)
In [61]:
row + col
Out[61]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12],
       [ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13],
       [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
       [ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15],
       [ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       [ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]])
In [62]:
10 * row + col
Out[62]:
array([[ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
       [ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
       [ 2, 12, 22, 32, 42, 52, 62, 72, 82, 92],
       [ 3, 13, 23, 33, 43, 53, 63, 73, 83, 93],
       [ 4, 14, 24, 34, 44, 54, 64, 74, 84, 94],
       [ 5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
       [ 6, 16, 26, 36, 46, 56, 66, 76, 86, 96],
       [ 7, 17, 27, 37, 47, 57, 67, 77, 87, 97],
       [ 8, 18, 28, 38, 48, 58, 68, 78, 88, 98],
       [ 9, 19, 29, 39, 49, 59, 69, 79, 89, 99]])

This works for arrays with more than one unit dimension.

Newaxis

Broadcasting is very powerful, and numpy allows indexing with np.newaxis to temporarily create new one-long dimensions on the fly.

In [63]:
import numpy as np
x = np.arange(10).reshape(2, 5)
y = np.arange(8).reshape(2, 2, 2)
In [64]:
x
Out[64]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
In [65]:
y
Out[65]:
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
In [66]:
x[:, :, np.newaxis, np.newaxis].shape
Out[66]:
(2, 5, 1, 1)
In [67]:
y[:, np.newaxis, :, :].shape
Out[67]:
(2, 1, 2, 2)
In [68]:
res = x[:, :, np.newaxis, np.newaxis] * y[:, np.newaxis, :, :]
In [69]:
res.shape
Out[69]:
(2, 5, 2, 2)
In [70]:
np.sum(res)
Out[70]:
830

Note that newaxis works because a $3 \times 1 \times 3$ array and a $3 \times 3$ array contain the same data, differently shaped:

In [71]:
threebythree = np.arange(9).reshape(3, 3)
threebythree
Out[71]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
In [72]:
threebythree[:, np.newaxis, :]
Out[72]:
array([[[0, 1, 2]],

       [[3, 4, 5]],

       [[6, 7, 8]]])

Dot Products

NumPy multiply is element-by-element, not a dot-product:

In [73]:
a = np.arange(9).reshape(3, 3)
a
Out[73]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
In [74]:
b = np.arange(3, 12).reshape(3, 3)
b
Out[74]:
array([[ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [75]:
a * b
Out[75]:
array([[ 0,  4, 10],
       [18, 28, 40],
       [54, 70, 88]])

To get a dot-product, (matrix inner product) we can use a built in function:

In [76]:
np.dot(a, b)
Out[76]:
array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

Though it is possible to represent this in the algebra of broadcasting and newaxis:

In [77]:
a[:, :, np.newaxis].shape
Out[77]:
(3, 3, 1)
In [78]:
b[np.newaxis, :, :].shape
Out[78]:
(1, 3, 3)
In [79]:
a[:, :, np.newaxis] * b[np.newaxis, :, :]
Out[79]:
array([[[ 0,  0,  0],
        [ 6,  7,  8],
        [18, 20, 22]],

       [[ 9, 12, 15],
        [24, 28, 32],
        [45, 50, 55]],

       [[18, 24, 30],
        [42, 49, 56],
        [72, 80, 88]]])
In [80]:
(a[:, :, np.newaxis] * b[np.newaxis, :, :]).sum(1)
Out[80]:
array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

Or if you prefer:

In [81]:
(a.reshape(3, 3, 1) * b.reshape(1, 3, 3)).sum(1)
Out[81]:
array([[ 24,  27,  30],
       [ 78,  90, 102],
       [132, 153, 174]])

We use broadcasting to generate $A_{ij}B_{jk}$ as a 3-d matrix:

In [82]:
a.reshape(3, 3, 1) * b.reshape(1, 3, 3)
Out[82]:
array([[[ 0,  0,  0],
        [ 6,  7,  8],
        [18, 20, 22]],

       [[ 9, 12, 15],
        [24, 28, 32],
        [45, 50, 55]],

       [[18, 24, 30],
        [42, 49, 56],
        [72, 80, 88]]])

Then we sum over the middle, $j$ axis, [which is the 1-axis of three axes numbered (0,1,2)] of this 3-d matrix. Thus we generate $\Sigma_j A_{ij}B_{jk}$.

We can see that the broadcasting concept gives us a powerful and efficient way to express many linear algebra operations computationally.

Record Arrays

These are a special array structure designed to match the CSV "Record and Field" model. It's a very different structure from the normal NumPy array, and different fields can contain different datatypes. We saw this when we looked at CSV files:

In [83]:
x = np.arange(50).reshape([10, 5])
In [84]:
record_x = x.view(dtype={'names': ["col1", "col2", "another", "more", "last"], 
                         'formats': [int]*5 })
In [85]:
record_x
Out[85]:
array([[( 0,  1,  2,  3,  4)],
       [( 5,  6,  7,  8,  9)],
       [(10, 11, 12, 13, 14)],
       [(15, 16, 17, 18, 19)],
       [(20, 21, 22, 23, 24)],
       [(25, 26, 27, 28, 29)],
       [(30, 31, 32, 33, 34)],
       [(35, 36, 37, 38, 39)],
       [(40, 41, 42, 43, 44)],
       [(45, 46, 47, 48, 49)]],
      dtype=[('col1', '<i8'), ('col2', '<i8'), ('another', '<i8'), ('more', '<i8'), ('last', '<i8')])

Record arrays can be addressed with field names like they were a dictionary:

In [86]:
record_x['col1']
Out[86]:
array([[ 0],
       [ 5],
       [10],
       [15],
       [20],
       [25],
       [30],
       [35],
       [40],
       [45]])

We've seen these already when we used NumPy's CSV parser.

Logical arrays, masking, and selection

Numpy defines operators like == and < to apply to arrays element by element:

In [87]:
x = np.zeros([3, 4])
x
Out[87]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
In [88]:
y = np.arange(-1, 2)[:, np.newaxis] * np.arange(-2, 2)[np.newaxis, :]
y
Out[88]:
array([[ 2,  1,  0, -1],
       [ 0,  0,  0,  0],
       [-2, -1,  0,  1]])
In [89]:
iszero = x == y
iszero
Out[89]:
array([[False, False,  True, False],
       [ True,  True,  True,  True],
       [False, False,  True, False]])

A logical array can be used to select elements from an array:

In [90]:
y[np.logical_not(iszero)]
Out[90]:
array([ 2,  1, -1, -2, -1,  1])

Although when printed, this comes out as a flat list, if assigned to, the selected elements of the array are changed!

In [91]:
y[iszero] = 5
In [92]:
y
Out[92]:
array([[ 2,  1,  5, -1],
       [ 5,  5,  5,  5],
       [-2, -1,  5,  1]])

Numpy memory

Numpy memory management can be tricksy:

In [93]:
x = np.arange(5)
y = x[:]
In [94]:
y[2] = 0
x
Out[94]:
array([0, 1, 0, 3, 4])

It does not behave like lists!

In [95]:
x = list(range(5))
y = x[:]
In [96]:
y[2] = 0
x
Out[96]:
[0, 1, 2, 3, 4]

We must use np.copy to force separate memory. Otherwise NumPy tries its hardest to make slices be views on data.

Now, this has all been very theoretical, but let's go through a practical example, and see how powerful NumPy can be.