The key to Pogo’s speed is the use of GPUs – Graphical Processing Units. The standard CPU (Central Processing Unit) on a computer is very flexible at one-off tasks, such as appear when running a word processing package or a web browser. However, when it comes to graphics rendering, the same (often very simple) calculation is run for each pixel, each time with a slightly different input such as a coordinate location. GPUs were developed to address this problem, featuring typically hundreds of cores which can be run in parallel, with the cores optimised to run the same calculation each time. Additionally, in order to allow for the highest possible frame-rates, GPUs are designed to have high memory bandwidth, i.e. allow memory to be accessed much faster than a traditional CPU can access the system RAM, which is in part achieved through having memory integrated into the GPU itself.
The two key features, lightweight parallel capabilities and high memory bandwidth, enable Pogo’s speed. The explicit time domain solver updates values at every node at each time step, a highly parallel problem requiring fast access to memory.
Pogo uses Nvidia’s CUDA to access the GPU. This is a flexible technology, allowing the software to run on any CUDA capable card. It does mean that only Nvidia cards are suitable for Pogo, however. It is advised that the CUDA compute version is greater than 2.0 (see list on the Nvidia website), although it is not possible to purchase cards these days which are less than this.
The main performance limiting parameter is the memory bandwidth. Pogo (and the explicit time domain method generally) is bandwidth limited; this means that the speed is limited by how quickly the solver can load data from memory rather than how quickly the calculation can be done. In practice this means that the number of GPU cores is irrelevant; instead the run time is directly proportional to the memory bandwidth. Typical good bandwidths are around 300GB/s. Gaming GPUs are a good choice for Pogo, being cheap due to market competition, yet generally possessing high bandwidths. However for more dedicated multi-GPU systems, it may be necessary to go for professional cards such as those from the Tesla range due to practical constraints such as cooling.
The second consideration is model size. A typical GPU has far less memory than the system RAM, and this can limit the size of model which can be stored in the GPU memory. Note that the entire model must fit in GPU memory; transferring data from system RAM to the GPU is very slow and this would have to be done at every time step, so Pogo does not support this. Most 2D models can be run on a single card, however medium-to-large 3D models will need multiple cards.
The amount of time Pogo takes to run is dependent on the GPU (as discussed above this is primary dependent on memory bandwidth - more information can be found in the paper doi:10.1016/j.jcp.2013.10.017). It can also depend on CPU and other system parameters for pre- and post-processing (if these are included in the run-time comparison).
At the time of writing (December 2016) a good individual card for Pogo would be the GTX 1080; this has 8GB of memory and 320GB/s bandwidth. It is available for around £600 in the UK. Given a suitable system, i.e. with adequate cooling, four of these cards could be combined together.
Imperial has several systems which are used to run Pogo. The primary one for the largest jobs consists of 8 K80 cards; these are dual cards of 12GB each, making 8 x 2 x 12 = 192 GB total memory. This has proved suitable for almost all 3D problems (inevitably there is always a PhD student who wants to run the biggest model possible!). Bandwidth for the K80 is 240GB/s per card, making 3840GB/s in total.
Pogo performs well across multiple GPUs, with very little resulting overhead. It achieves this by efficiently splitting the model into separate sections, one for each GPU. The boundary values for each section are calculated first, then these are transferred to the other GPUs. While the transfer is happening, the remaining, internal, sections are calculated, making the transfer transparent in general.
In some cases it may be necessary to use multiple GPUs; as highlighted above this may be the case if a single card would not be big enough to run the model on its own. It is also possible to improve speed by using multiple GPUs, although while the transfer generally has very little overhead associated with it, the final performance is dependent on the entire system.
Clearly there are physical requirements for a system to be able to house multiple GPUs, particularly since high performance GPUs are large. Cooling is also extremely important. When setting up a multi-GPU system, it is therefore advised that users purchase an entire system, including the GPUs, rather than fitting multiple cards into a separately purchased chassis. Such a system will have been designed to cope with the power, cooling and physical requirements, and is likely to be covered by warranty. Most manufactures of such systems will provide a choice of GPUs.
Note that the multiple GPU version is not available under an open source agreement; this version requires special developments specific to each system, which can be acquired on a consultancy basis.