There are instances when you may know that your sample includes contaminant reads, or you may suspect there is contaminant reads in it. In these cases, it would be preferable if downstream analysis could be performed with the contaminant removed.

Here we outline the steps you can take for removing know and unknown contaminant reads.

Please note that we are unable to guarantee complete removal of all contaminant sequence data.


Remove known contaminating sequence

To remove your contaminant reads, we recommend that you map your reads to the full reference for the biological sample in one mapping step. This means that you map your reads against the full reference genome of interest, as well as the known contaminant reference genome. In the subsequent step you extract the reads mapping only to the reference genome of interest.

In QIAGEN CLC Genomics Workbench it is possible to do this in two ways, one using Tracks and one using Stand-alone objects. Here we outline both options.
 

Remove reads from a known contaminant using Tracks

  1. Obtain the sequence of the reference genome and contaminating genome in fasta format.
  2. Import all the fasta files containing the reference for the biological sample at the same time through Import tracks to obtain a single Genome Track including all chromosomes/contigs. Save the Genome Track to a folder in your Navigation Area. In rest of this FAQ we will refer to this Genome Track as the Combined Genome Track.
  3. Map the reads to the Combined Genome Track using Map Reads to Reference tool. Collect un-mapped reads in the Result handling step of the wizard.
  4. Create a BED file with annotations covering the chromosomes of the reference genome of interest. The BED file format is described in the following link: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
  5. Import the BED file using Import tracks tool and the Combined Genome as Reference Track.
  6. Extract the reads mapping to the chromosomes of interest to a new Seqence List using Extract Reads tool with the imported BED file as overlap track.
The resulting extracted Sequence List, in addition to the Sequence List of un-mapped reads can then be used as input for downstream analysis.

 

Remove known contaminating sequence using stand-alone objects

  1. Create a reference Sequence List that includes the reference genome of interest as well as the known contaminant reference genome.
  2. Map all reads to this combined reference, choosing the options to Create stand-alone read mappings and Collect un-mapped reads in the Result handling step of the wizard.
  3. When the mapping completes, open the resulting Mapping Table and select the rows corresponding to your desired reference, then click the button at the bottom of the table to Extract subset.
  4. Use the new subset mapping as input for the Extract Sequences tool and choose the option to create a new Sequence List.
The resulting extracted Sequence List, in addition to the Sequence List of un-mapped reads can then be used as input for downstream analysis.
 

Background

The reason we recommend mapping to both the reference genome of interest and the known contaminant genome at the same time is as follows:

When mapping, the reference sequences used should reflect the source the sample was generated from. When mapping reads generated from potentially contaminated data, it is possible that there are reads in the sample set representing this contaminant sequence. If this occurs, and you map your reads against a reference that only includes the reference genome of interest, it is possible contaminant reads could be mapped to this genome, because the read mapper will still try to map all reads.

Alternatively, if you are mapping only to the contaminant reference, to remove contaminant reads, it is possible that desired reads could map to this contaminant sequence and be incorrectly removed. Chances are that at least some of these non-contaminant desired reads will map to the known contaminant reference. This is especially likely when the contaminant sequence is similar to the desired sequence, for example, when attempting to remove mouse DNA from a human sample. Specifically, when reads that represent desired sequence are not included in the reference you provide, they may still map to the contaminant reference, if they map well enough (according to the parameters you provide). Such mapped reads often do not match as well as they would have to their true source region, so if the true source region had been available to map to, the reads likely would have mapped preferentially there. In this example, where a read that is biologically from the desired sequence, maps to the known contaminant reference, that read would be discarded from future analysis. 

 

Remove unknown contaminating sequence

If you do not have a reference for the contaminant sequence you could try to first identify what is it, then follow the instructions in the above section. To do so, please try the following:

  1. Using very stringent parameters, Map all reads to the known reference genome. To increase parameter stringency, you may increase the length and similarity fraction values.
  2. In the Result handling step of the wizard choose the options to Collect un-mapped reads.
  3. When the mapping completes, use the un-mapped reads as input for a de novo assembly. In the Select mapping options step of the wizard, choose the Create simple contig sequences (fast) option. If you are uncertain of appropriate parameters for your data, you may use the defaults. 
  4. When the assembly completes, BLAST the resulting contig sequences to a database of potential contaminants, or if you are not sure of what it could be, then you may wish to BLAST to all of NR. If you have a very large number of contigs, then you may wish to select the largest ~20 contig sequences for this BLAST job.
  5. When the BLAST results return, select a few of the largest contig results from the Overview BLAST table and click the button to Open BLAST Output
  6. For each BLAST table view, with the top hit selected, click the Download and Save button.
  7. Use these saved sequences as the contaminant reference sequences and follow the instructions above.