Why should a CLC de novo assembly not be run in stages?

Why should a CLC de novo assembly not be run in stages?
Go Back
This FAQ described why including all data in one de novo assembly is recommended.

Our recommendation, instead of e.q. using de novo assembled contigs as input for a new de novo assembly as a "staged approach", is to include all high quality data in one de novo assembly.

Assembly of all sequencing data is recommended

We would recommend assembling all the data in a single assembly, so that the full graph information can be used, and the reads themselves are the representatives of what is in your sample. Detailed information for why we make this recommendation is in the Background section below.

When using a combination of high quality data along with lower quality, paired data, please consider using the latter for guidance only. This is an option presented to you in the Wizard when you are setting up your de novo analysis. When the Guidance only option is used, only the pairing distance information from those reads will be used for the scaffolding step. That is, the construction of the word table and the graph will not be based on these reads.

Assembling all sequencing data together may require a machine with a substantial amount of memory, depending on the volume of data you have. If you have very limited memory resources we would recommend performing this larger assembly on a more powerful machine. If this is absolutely not possible, then you could try one of the following options:

Choose Simple contig output for your assembly, rather than mapping reads back your reads to your contigs. You may still not have enough memory to run a large assembly, but it may be worth trying this.
If the above idea does not work, then your only choice is to try the idea of making multiple copies of contigs you already have and entering those as reads. Please keep in mind the limitations of such an approach described in the Background.

We recommend referring to our manual to learn more about how de novo assembly in CLC software:
https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=De_Novo_sequencing.html

Background

There are two primary reasons we do not recommend using contig results as input for new assemblies.

1. Input contigs are treated as read equivalents

If you carry out an assembly step by step, it involves taking the contigs of an earlier assembly, and inputting them as "reads" into another assembly, along with sequence reads from another dataset.

The input contig is seen as a single read. If you had a situation where one of the data sets did not have very deep coverage of some regions, it might look like certain reads (in this case, one of your contigs from earlier assemblies), were only seen once, and thus would not be considered strong evidence for a region existing by the assembler. These would be dropped from the assembly. In this case, there is a good chance of losing data that you really did have evidence for in the reads from in an earlier assembly.

You could decide to make multiple copies of your initial contig set, and then use those as input to a de novo assembly, along with other sequence reads to account for the point above. Apart from the fact that you have lost the graph information, duplicating contig sequences implies that you have a lot more confidence in the existence and accuracy of your contig set than perhaps is warranted. Each of the contigs you enter will be seen as a read: the contig sequence is seen as a real measure of your sample. Of course, it isn't - it's a sequence generated from a prior assembly.

2. Input contigs do not contain underlying assembly information

When you enter your contigs as reads, they are treated as reads - that is, they are treated as a single "true" measurement, and there is no knowledge of the coverage or graph paths that were traversed to create them. This means that you lose some information that could be important, and might be especially useful with paired data sets.

IPA

CLC Software

HGMD

QCI

OmicSoft Suite

OmicSoft Lands

Assembly of all sequencing data is recommended

Background

1. Input contigs are treated as read equivalents

2. Input contigs do not contain underlying assembly information