6.6. Do you have de novo assembly improvement suggestions?
We highly recommend referring to the de novo assembly section of the manual before proceeding. The information there describes how the assembler works and should help in considering the parameters that may suit your assembly.
In this FAQ entry, we provide some tips that have helped others with their de novo assemblies. We cannot give specific advice or guarantees though, as so much depends on the input data and the genome being assembled.
Before you begin your Assembly
In the case of de novo assembly, more data does not always lead to a better result. Very high coverage in a location will increase the probability this location is seen as a sequencing error. This is not good because overlapping sequencing errors can result in poor assembly quality. We therefore recommend to use data sets with an average read coverage less than 100x.
If you expect the average coverage of your genome to be greater than 100x, you can use the Sample Reads tool to reduce coverage. To determine how many reads you need to sample to obtain a maximum average coverage of 100x, you can do the following calculation:
- Obtain an estimated size of the genome you intend to assemble.
- Multiply this genome size by 100. This value will be the total number of bases you should use as input for assembly.
- Divide the total input bases the average length of your sequencing reads.
- Use this number as input for the number of reads to obtain as output from the Sample Reads tool
Running your Assembly: Parameter Setting
The two parameters that can be adjusted to improve assembly quality are
- Word Size and
- Bubble Size.
The default values for these parameters can work reasonably well on a range of data sets, but we recommend that you choose and evaluate these values based on what you know about your data.
If your data is of high quality (e.g. good Illumina reads) where you expect long regions of high quality, a larger Word Size, such as a value above 30 is recommended. If your data has a higher error rate, e.g. in data where homopolymer errors are common, a Word Size below 30 is recommended. Whenever possible, the Word Size should be less than the expected number of bases between sequencing errors.
When adjusting Bubble Size, the repeat structure of your genome should be considered in conjunction with the sequence quality. If you do not expect a repetitive genome you may wish to choose a higher bubble size to improve contiguity. If you anticipate more repeats, you may wish to use a smaller Bubble Size to reduce the possibility of collapsing repeat regions. In cases where the sequence quality is not high a larger bubble size may make more sense for your data.
If you are not sure of what parameters would be best suited for your data, we recommend identifying optimal settings for your de novo assembly empirically. To do so, you may run multiple assembly jobs with different parameters and compare the results. Comparing the results of multiple assemblies is often a challenge. For example, you may have one assembly with a large N50 and another with a larger total contig length. How do you decide which is better? Is the one with the large contig sizes better or the one with more total sequence? Ultimately, the answer to these questions will depend on what the goal of your downstream analysis is. To help with this comparison, we provide some basic guidelines in the sections below.
Evaluating and Refining your Assembly
Assessing Assembly Quality
Contiguity: How many contigs are there?
A high N50 and low number of contigs, relative to your expected number of chromosomes is ideal. If you aren't sure what type of N50 and contig number might be reasonable to expect, you could try to get an idea by looking at existing assemblies of a similar genome, should these exist. For an even better sense of what would be reasonable for your data, you could make comparisons to an assembly of a similar genome, assembled using a similar amount and type of data.
If your assembly results include a large number of very small contigs, it may be that you set the minimum contig length filter too low. Very small contigs, particularly those of low coverage, can generally be ignored.
Completeness: How much of the genome is captured in the assembly?
If a total genome length of 5MB is expected based on existing literature or similar genomes that have already been assembled, but the sum of all contig lengths is only 3.5MB, you may wish to reconsider your assembly parameters.
Two common reasons for an assembly output that is shorter than expected are:
A Word Size that is higher than optimal for your data: A high Word Size will increase the probability of discarding words because they overlap with sequencing errors. If a word is seen only once, the unique word will be discarded, even if there exist many other words that are identical except for one base (eg. a sequencing error). A discarded word will not be considered in constructing the assembly graph and will therefore be excluded from the assembly contig sequences.
A Bubble Size that is higher than optimal for your data: A high Bubble Size will increase the probability that two similar sequences are classified as a repeat and thus collapsed into a single contig. It is sometimes possible to identify collapsed repeats by looking at the mapping of your reads to the assembled contigs. A collapsed repeat will show as a high peak of coverage in one location.
Depending on the resources available for the organism you are working on, you might also assess assembly completeness by mapping the assembled contig sequences to a known reference. You can then check for regions of the reference genome that have not been covered by the assembled contigs. Whether this is sensible depends on the sample and reference organisms and what is known about their expected differences.
Correctness: Do the contigs that have been assembled accurately represent the genome?
One key question in assessing correctness is whether the assembly is contaminated with any foreign organism sequence data. To check this, you could run a BLAST search using your assembled contigs as query sequences against a database containing possible contaminant species data.
In addition to BLAST, checking the coverage can help to identify contaminant sequence data. The coverage of a contaminant contig is often different from the desired organism so you can compare the potential contaminant contigs to the rest of the assembled contigs. You may check for these types of coverage differences between contigs by:
- Map your reads use as input for de novo assembly to your contigs (if you do not already have mapping output)
- Create a Detailed Mapping Report
- In the Result handling step of the wizard, check the option to Create separate table with statistics for each mapping
- Review the average coverage for each contig in this resulting table.
If there are contigs that have good matches to a very different organism and there are discernable coverage differences, you could either consider removing those contigs from the assembly, or run a new assembly after removing the contaminant reads. One way to remove the contaminant reads would be to
- run a read mapping against the foreign organism's genome and
- check the option to Collect un-mapped reads.
The un-mapped reads Sequence List should now be clean of the contamination. You can then use this set of reads in a new de novo assembly.
Assessing the correctness of an assembly also involves making sure the assembler did not join segments of sequences that should not have been joined - or checking for misassemblies. This is more difficult. One option for identifying mis-assemblies is to try running the InDels and Structural Variants tool. If this tool identifies structural variation within the assembly, that could indicate an issue that should be investigated.
Assembly quality assessment guidelines have been provided by: Sarah Young, Assistant Director, Microbial Informatics, The Broad Institute
Post assembly improvements
If you are working with a smaller genome, the CLC Genome Finishing Module may be of interest to you. It has been developed to help finishing small genomes, such as microbes, eukaryotic parasites, or fungi, in order to reduce the extensive workload associated with genome finishing and to facilitate as many steps in the procedure as possible. The PDF version of the user manual for this module can be downloaded from the following link:
The CLC Genome Finishing Module is a commercial product available as a plugin. If you download and install the plugin into your CLC Workbench through the Plugins manager, you can request a free 2 week trial license. The Plugins manager is launched by clicking on the Plugins button in the top toolbar. Please note that to install Plugins, you need to be running your Workbench as an administrative user.
Once installed, you may wish to try the following:
- Improving Contiguity: The Join Contigs tool can help to reduce the number of contigs in your assembly by joining contigs that are likely to belong to one contiguous sequence.
- Improving Completeness: The Align Contigs tool can be used to map your contigs to a reference. This will help to identify areas of the genome that are not covered by your assembled contigs.
- Improving Correctness: The Analyze Contigs tool can be used for detecting misassemblies by analyzing reads mapped to contigs.