Background

The amount of memory required for any particular de novo assembly will depend on a combination of
 

  • genome size,
  • data type,
  • data quality and
  • data volume.

In simple (and vague) terms, the larger the de Bruijn graph gets, the more memory is needed. Data quality can have an impact on the size of the graph, as can features in the data such as lots of repeats, high heterozygosity, and so on.

Our White Paper about the de novo assembly provides details of the tool as well as giving details of example assemblies, including the memory of the systems that the assemblies were run on:
 

http://resources.qiagenbioinformatics.com//white-papers/White_paper_on_de_novo_assembly_4.pdf
 

For users of QIAGEN CLC Genomics Workbench, there are three phases when running a de novo assembly where you selected Simple Contig output option:
 

  • a pre-processing phase,
  • a computational phase and
  • a post-processing phase.
 

The post processing phase is not discussed in the White Paper. There is thus some additional overhead when running a de novo assembly via CLC Genomics Workbench to be expected.

In terms of what this means for considering memory requirements for the Workbench relative to the memory values reported in the White Paper:

  • If nothing else will be run on the machine except the de novo assembly via CLC Genomics Workbench, you can try adding approximately 1/4 to 1/3 again the amount of memory to that reported as used in the White Paper when considering the minimum amount that you would need on your system.
  • More normally, you may also wish your machine to be able to be used for other small tasks, in which case you should considering adding 1/2 again, or maybe a full times the amount of memory over that reported in the White Paper.


For example, if you decided that your assembly task would likely require about 24Gb of memory for the computational phase, based on the White Paper information, then you should be considering a system with 32 to 48Gb of memory.

We cannot make guarantees of course, as so much depends on your particular data set.

We outline our recommended system requirements for running CLC Genomics Workbench on our website, but for the reasons outlined above, this does not include any specifics about requirements for de novo assembly:

https://digitalinsights.qiagen.com/technical-support/system-requirements/
 

Suggestions

If you are running out of memory, then it may be that you do not have enough memory to run the assembly you are trying to run. However, there are some things that can improve the situation with regards to memory use. Some of the suggestions below (especially 2 and 3) can affect the quality of the output and the speed at which the assembly will complete as well.


Suggestion I

Are other tasks running on the same machine at the same time? This could be other de novo assemblies, other read mappings, or other jobs not related to your CLC software that require memory to run. The computational phase of the de novo assembly will assume it has access to as much memory as it needs and does not account for situation where other tasks are running at the same time. If multiple things have been running on the system at the same time when the error occurred, please try running a single de novo assembly again, when other things are not running on the system.

 
Suggestion II
 
Have you trimmed off all adapters in your sequence? This can also make a big difference to the resource demands of de novo analysis as well as to the quality of outputs.
 
Adapter trimming is covered in the manual in this section:
 
http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Adapter_trimming.html
 
 
Suggestion III
 
Have you trimmed your reads for quality? Entering only high quality data can make a big difference to the resource demands of de novo analysis.
 
Quality trimming is covered in the manual in this section:
 
http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Quality_trimming.html
 
 
 

Other things to check

These things may or may not help with the memory use, but they will aid in getting quality results:
 
Parameters

Suggestions for adjusting assembly parameters can be found in the Running the Assembly section of the Best practices guide for de novo assemblies:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Best_practices.html