skip to content

School of the Biological Sciences


1) Chris Gilligan, Plant Sciences

The group is carrying out stochastic simulation and parameter estimation using Bayesian methods for epidemic models and also running Lagrangian simulations with heavy demands on meteorological data. They have been using about 80TB of Research Data Store for data processing and 50TB Research Cold Store for longer term storage. In about nine months they have used over 600, 000 hours CPU time, about half paid-for and half free (SL2 and SL3). In order to make an estimate for a grant budget they would look at the amount of resources required to do similar previous work.

2) Chris Illingworth, Genetics

" My work regularly involves statistical analyses of genomic data which can be computationally intensive to perform; as such I make regular use of the HPC. After changes made in the last couple of years the HPC is easy to use and well-managed. I have had a fantastic response from the team who run it whenever I have encountered any problems or had any queries about the system. When applying for grants it can be tough up-front to estimate how much time and storage to apply for, but experience has been useful in working out approximately how much is needed for the daily requirements of my lab; a mixture of central storage, and locally provided storage in the department, has been ideal for my needs."

3) Ben Luisi, Biochemistry

" My colleagues and I have been using HPC extensively to process data from cryo electron microscopy and generate structures of macromolecules with biomedical relevance. We have been using both GPU and CPU modes for extensive calculations and 100TB RDS storage space for handling our large datasets and the intermediate files generated from data processing and analysis. The resources have been invaluable for our work, which has allowed us to elucidate the high resolution structures of molecular machines that transport molecules across biological membranes and enzyme assemblies that use RNA to help regulate the control of gene expression."

4) Russell Hamilton, Centre for Trophoblast Research

"The Centre for Trophoblast Research Bioinformatics Facility operates under a cost recovery model. When determining the bioinformatics costs for a grant proposals we use a standard hourly rate,  set to include compute, storage (short and long term) as well as the analysis time. The number of hours specified in the costing is calculated based on the type (e.g. RNA-Seq) and size of the experiment (number of samples, and groups to be compared)."

5) Aylwyn Scally, Genetics

"In general my process for estimating compute usage and storage is as follows. I first identify previous projects that are most similar in terms of computation needs and usage type. To do this I match projects based on a breakdown into the following categories: pipeline development, algorithm development, data processing, simulation/sampling (including Monte Carlo methods), and large memory tasks. (Most of my jobs are trivially parallel at large scale, but some things like assembly and/or certain simulations need to hold tens of gigabytes in memory.) Then I scale by the amount of data involved or other relevant factor. I rarely make any allowance for increases in CPU/IO efficiency compared to a few years ago, because in my experience the application complexity usually scales as well. Estimating development time is tricky, particularly as a lot of it is done outside the HPC, but one has to budget for some contingency because not all development can be done in this way, particularly for issues specific to the HPC architecture. Also, even after much development, there will always be many 'production' runs at scale which have to be repeated because a bug/error emerges at a late stage, or data issues become clear during postprocessing and analysis. Storage costs are estimated similarly; I count on keeping most intermediate files around during the life of the project (compressed where possible), and then reducing to a set of key outputs afterwards. I always aim to write pipelines which clean up redundant files as they go."

6) Andrea Manica, Zoology

"My group uses the HPC facilities mainly for analysing genetic data and performing demographic and spatial modelling.

Genetic data are currently generated using next-generation sequencing (NGS) technologies and this approach requires several steps (NGS data processing) before we can get a final dataset to work on. For each sample, several raw fastq files representing different libraries are processed in parallel. The process starts with adapter trimming followed by alignment, duplicates removal, local realignment and variant calling. Each of these steps can be computationally time consuming or demands high memory, therefore they are processed on the SL3 or SL2 queue accordingly.  Generally, each raw fastq file is split into smaller files, which are then analysed independently using multiple cores to speed up the process.

HPC facilities are also crucial to perform demographic and spatial modelling. When we reconstruct the history of a population, we designed several demographic models to generate simulations (artificial dataset) aimed to cover many parameters combinations. Parameters ranges can be quite wide and millions of simulations are needed to fully explore all possible values. Therefore, independent runs are generally submitted in parallel to generate multiple simulations at the same time. This approach would not be feasible on just a few cores on a laptop and the HPC facilities allow us to substantially speed up the process. Coupling demographic modelling with climatic and spatial information makes the reconstruction more computationally intensive as we have additional parameters to explore such as the geographical origin of the demographic event under study. Therefore, having the possibility of using several cores for multiple jobs helps us to test different scenarios in a reasonable time.

Each project requires some test runs to estimate the amount of resources and computational time needed. These tests are generally carried out on a few samples and the estimates are then scaled up for the whole dataset in order to provide an overall picture of the resources needed for each project. Storage is getting more and more important specifically to process genetic samples, considering their file size. Therefore, batches of samples are generally processed on the HPC facility and then transferred back to local storage. Besides the storage, other key factors such as run time, memory and the number of cores available should be considered before deciding on the most appropriate service level for each project."

7) Jenny Barna, Biological Sciences

" There are a number of cases where the conventional HPC and available storage options did not meet needs provided by the School's computing facility. In order to close this down in 2017 we needed to move the last few use cases to virtual servers (VMs). These included software with a web interface and a web-based method of sharing data. Other cases include provision of a VM to groups who have put in a private fibre to link the new Cryo-EM facility to the HPCS. More recently, to aid a group moving during building works, we have provided virtual infrastructure so at least some of the infrastructure can stay in one place while the group has to move to and fro (details under 8). These are all example of the use of the OpenStack research computing cloud service."

8) Cambridge Centre for Proteomics

The Cambridge Centre for Proteomics strives for the development of robust proteomics technology, which it applies to a wide variety of biological questions, making new technologies available to collaborators of CCP and customers of the Core Facility. In order to carry out such tasks, a fundamental process is to identify proteins using mass spectrometry data. The proprietary software that is the main work-horse for the group is now housed on an OpenStack VM. This enables searching 'in the cloud' irrespective of where the group is located.  The OpenStack infrastructure allows use of multiple cpu cores, extensive RAM and utilization of TBs of data storage. An advantage of this technology is that additional computing resources can be added  in the future.


School scheme to apply for pump-priming or stop-gap resources

Note: ample resources remain to be claimed under the scheme. The only resource heavily used is Research Data Store.