
Centre for Advanced Research Computing
Particularly when dealing with “big data”, the size of the datasets and the resources at our disposal impose limits on what computations we can realistically do.
Even when a solution for a problem exists in theory, in practice we are bound by limited resources:
Just because we can solve a problem on a small dataset doesn’t guarantee that we can do it for large ones!
We must also consider the scaling behaviour: a dataset that is 10x larger may require much more than 10x longer to process.
How an algorithm scales with the input size \(N\) can be expressed in asymptotic ‘big-O’ notation. For example:
The scaling behaviour of an algorithm can have a significant effect in how it can handle large input data:

An obvious approach is to throw more resources at the problem: more memory, faster networks, better processors. However:
A paradigm that is becoming very popular recently is the use of cloud resources: storage and computing resources held remotely and available on demand.
We saw an example of this in Lecture 22 when downloading data from S3 (Amazon Web Services).
Major providers include Amazon Web Services, Microsoft Azure and Google Cloud Platform.
Cloud providers offer a variety of resources, such as:
This offers several advantages:
However, there are also considerations to keep in mind:
Instead of using more or bigger resources, we could look at different kinds of technologies and computational approaches:
These usually involve more work in adapting the solution but, if effective, can yield important benefits.
A small resource (e.g. computer) can handle a small task:

Faced with a large task, instead of increasing size of resource…

…break it down into smaller tasks:

This is still not trivial:
How do we efficiently break down tasks and combine results?
Image credit: Ben Congdon, https://github.com/bcongdon/corral
We can construct a very minimal (and non-parallelised) Python implementation of a MapReduce ‘engine’ as follows
As an example task imagine we want to count the number of occurrences of each character in a sentence.
We can apply mapreduce by
In code:
{'T': 1, 'h': 2, 'e': 3, ' ': 8, 'q': 1, 'u': 2, 'i': 1, 'c': 1, 'k': 1, 'b': 1, 'r': 2, 'o': 4, 'w': 1, 'n': 1, 'f': 1, 'x': 1, 'j': 1, 'm': 1, 'p': 1, 's': 1, 'v': 1, 't': 1, 'l': 1, 'a': 1, 'z': 1, 'y': 1, 'd': 1, 'g': 1, '.': 1}
Or sorting the output:
{' ': 8, '.': 1, 'T': 1, 'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 3, 'f': 1, 'g': 1, 'h': 2, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 4, 'p': 1, 'q': 1, 'r': 2, 's': 1, 't': 1, 'u': 2, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1}