5.10. How do I remove contaminant sequences from my set of reads?
There are instances when you may know that your sample includes contaminant sequence, or you may suspect there is contaminant sequence in it. In these cases, it would be preferable if downstream analysis could be performed with the contaminant removed. Here we outline the steps you can take for removing contaminant data in the Biomedical Genomics Workbench (BxWB) or CLC Genomics Workbench (GWB) when knowing what the contaminating sequence is. The procedure for removing contaminating data described for BxWB can also be used in GWB, if working with Tracks.
Furthermore, we outline how contaminating sequence from an unknown organism can be removed. This, section is only relevant for CLC Genomics Workbench as it involves de novo assembly and BLAST, which are tools unique to that Workbench.
Please note that we are unable to guarantee complete removal of all contaminant sequence data.
- Remove known contaminating sequence
- Remove unknown contaminating sequence in CLC Genomics Workbench
To remove your contaminant sequence, we recommend that you map your reads to the full reference for the biological sample in one mapping step. This means that you map your reads against the full known reference as well as the known contaminant reference sequence. In the subsequent step you extract the reads mapping only to the known reference sequence.
- Obtain the sequence of the reference genome and contaminating sequence in fasta format. The human reference genome can be obtained as a fasta file by downloading it through Data Management and then exporting it from the Workbench in fasta format, or by downloading it from NCBI or ENSEMBL ftp site outside the Workbench.
- Import all the fasta files containing the reference for the biological sample at the same time through Import tracks to obtain a single Genome Track including all chromosomes/contigs. Save the Genome Track to a folder in your Navigation Area. In rest of this FAQ we will refer to this Genome Track as the Combined Genome Track.
- Map the reads to the Combined Genome Track using Map Reads to Reference tool. You may wish to collect the un-mapped reads for further analysis.
- Create a BED file with annotations covering the chromosomes of the standard reference genome, e.g. hg19 or hg38. The BED file format is described in the following link: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
- Import the BED file using Import tracks tool and the Combined Genome as Reference Track.
- Extract the reads mapping to the chromosomes of the standard reference genome using Extract reads based on overlap tool with the BED track as overlap track. This will create a new Reads Track, which have the genomic coordinates of the Combined Genome Track, but only including the reads mapping to the standard chromosomes.
- Extract the reads mapping to the standard chromosomes to a new sequence list using Extract sequences tool.
You have now obtained a sequence list including only the reads mapping to the reference of interest, which can then be used as input for a new mapping or workflow using the standard version of the reference genome of your interest.
- Create a reference Sequence List that includes the known contaminant sequence(s) as well as the desired reference.
- Map all reads to this combined reference, choosing the options to Create stand-alone read mappings and Collect un-mapped reads in the Result handling step of the wizard.
- When the mapping completes, open the resulting Mapping Table and select the rows corresponding to your desired reference, then click the button at the bottom of the table to Extract subset.
- Use the new subset mapping as input for the Extract Sequences tool and choose the option to create a new Sequence List rather than individual sequences.
- The resulting extracted Sequence List, in addition to the Sequence List of un-mapped reads can then be used as input for downstream analysis.
The reason we recommend mapping to both the contaminant and desired sequence data at the same time is as follows:
When mapping, the reference sequences used should reflect the source the sample was generated from. When mapping reads generated from potentially contaminated data, it is possible that there are reads in the sample set representing this contaminant sequence. If this occurs, and you map your reads against a reference that only includes the desired reference, it is possible contaminant reads could be mapped to your desired reference sequence because the read mapper will still try to map all reads.
Alternatively, if you are mapping only to the contaminant reference, to remove contaminant reads, it is possible that desired reads could map to this contaminant sequence and be incorrectly removed. Chances are that at least some of these non-contaminant desired reads will map to the known contaminant reference. This is especially likely when the contaminant sequence is similar to the desired sequence, for example, when attempting to remove mouse DNA from a human sample. Specifically, when reads that represent desired sequence are not included in the reference you provide, they may still map to the contaminant reference, if they map well enough (according to the parameters you provide). Such mapped reads often do not match as well as they would have to their true source region, so if the true source region had been available to map to, the reads likely would have mapped preferentially there. In this example, where a read that is biologically from the desired sequence, maps to the known contaminant reference, that read would be discarded from future analysis.
If you do not have a reference for the contaminant sequence you could try to first identify what is it, then follow the instructions in the above section. To do so, please try the following:
- Using very stringent parameters, Map all reads to the known reference. To increase parameter stringency, you may increase the length and similarity fraction values.
- In the Result handling step of the wizard choose the options to Collect un-mapped reads.
- When the mapping completes, use the un-mapped reads Sequence List as input for de novo assembly. In the Select mapping options step of the wizard, choose the Create simple contig sequences (fast) option. If you are uncertain of appropriate parameters for your data, you may use the defaults.
- When the assembly completes, BLAST the resulting contig sequences to a database of potential contaminants, or if you are not sure of what it could be, then you may wish to BLAST to all of NR. If you have a very large number of contigs, then you may wish to select the largest ~20 contig sequences for this BLAST job.
- When the BLAST results return, select a few of the largest contig results from the Overview BLAST table and click the button to Open BLAST Output
- For each BLAST table view, with the top hit selected, click the Download and Save button.
- Use these saved sequences as the contaminant reference sequences and follow the instructions above.