6.1. How much memory does a de novo assembly take?
The amount of memory required for any particular de novo assembly will depend on a combination of genome size, data type, data quality and data volume. In simple (and vague) terms, the larger the de Bruijn graph gets, the more memory is needed. Data quality can have an impact on the size of the graph, as can features in the data such as lots of repeats, high heterozygosity, and so on.
Our White Paper about the de novo assembly provides details of the tool as well as giving details of example assemblies, including the memory of the systems that the assemblies were run on:
The white paper outlines the memory requirements when using the Assembly Cell. For users of the Genomics Workbench, there are three phases when running a de novo assembly where you have requested Simple Contig output: a pre-processing phase, a computational phase and a post-processing phase. The Assembly Cell de novo assembly program is the same as the computational phase of running a de novo assembly via the Genomics Workbench. There is thus some additional overhead when running a de novo assembly via the Genomics Workbench. In terms of what this means for considering memory requirements for the Workbench relative to the memory values reported in the White Paper:
- If nothing else will be run on the machine except the de novo assembly via the Genomics Workbench, you can try adding approximately 1/4 to 1/3 again the amount of memory to that reported as used in the White Paper when considering the minimum amount that you would need on your system.
- More normally, you may also wish your machine to be able to be used for other small tasks, in which case you should considering adding 1/2 again, or maybe a full times the amount of memory over that reported in the White Paper.
For example, if you decided that your assembly task would likely require about 24Gb of memory for the computational phase, based on the White Paper information, then you should be considering a system with 32 to 48Gb of memory.
We cannot make guarantees of course, as so much depends on your particular data set.
We outline our recommended system requirements for running the Genomics Workbench on our website, but for the reasons outlined above, this does not include any specifics about requirements for de novo assembly:
If you are running out of memory, then it may be that you do not have enough memory to run the assembly you are trying to run. However, there are some things that can improve the situation with regards to memory use. Some of the suggestions below (especially 2 and 3) can affect the quality of the output and the speed at which the assembly will complete as well.
Are other tasks running on the same machine at the same time? This could be other de novo assemblies, other read mappings, or other jobs not related to your CLC software that require memory to run. The computational phase of the de novo assembly will assume it has access to as much memory as it needs and does not account for situation where other tasks are running at the same time. If multiple things have been running on the system at the same time when the error occurred, please try running a single de novo assembly again, when other things are not running on the system.