How are mappings to circular references handled when exporting to or importing from SAM/BAM files?

The manner QIAGEN CLC Genomics Workbench handles export to or import of mappings from SAM/BAM format reflects the fact that there is no explicit convention for the handling of reads that span the origin of a circular reference in the SAM/BAM format (https://samtools.github.io/hts-specs/SAMv1.pdf). Our aim is to represent the information as correctly as possible with this in mind.

 

Things of note when exporting mappings with a circular reference to SAM/BAM format

  • Reads that map across the origin of a circular reference sequence are labeled as unmapped (flag 4) when exported to SAM/BAM format from QIAGEN CLC Genomics Workbench.

If your primary aim is to export a mapping to SAM/BAM format, then you may wish to mark circular references as linear  before running the read mapping.  Then the reads that would have mapped across the origin of the circular reference, will map to the end of the linearized reference where it best matches, assuming that minimum length and similarity fraction requirements are met. The part of the read that matches the other end of the linearized reference will extend beyond the end of the reference.

In general, we recommend that circular genomes are marked as circular  so that reads that span the origin can be mapped and viewed in the Workbench accordingly.

 

Things of note when importing mappings with a circular reference from SAM/BAM format

  • Mappings imported into the Workbench from a SAM/BAM file will be presented as if the references were linear, even when the references selected when importing the SAM/BAM file are marked as circular.
  • Reads flagged as unmapped in the SAM/BAM file can be imported to a sequence list by checking the "Import unmapped reads" option in the SAM/BAM Mapping Files import tool
  • Reads that originally mapped across the origin of a circular reference will be marked as unmapped if you exported the mapping from a Workbench. Such reads would therefore not be present in the re-imported mapping. You would likely notice a drop in the coverage level at either end of the reference sequence in such a case.

When working with SAM/BAM files and where one or more reference sequence is circular, we recommend that you:

The images below demonstrate some differences observed when mapping to a circular reference that has been either marked as circular or linear in the Workbench.

Screenshot 2021-04-23 at 09.05.43.png
Figure 1: Visualizations of mappings to circular or linear versions of a reference sequence in a CLC Workbench. The top images show the left and right hand ends of a mapping done against a reference marked as linear. The corresponding images for a mapping done against a reference marked as circular are shown in the bottom images.

Linear reference, top images: A read that would have mapped across the origin here maps to one end of the linearized reference, and extends beyond it, as indicated by a < to the left of the read at the 5' end, or a > symbol to the right of the read at the 3' end. In this example 10 reads mapped to the 5' end of the reference, while 6 reads mapped to the 3' end. One read that mapped to the circular reference could not map to either end of the linearized version.

Circular reference, bottom images: The origin of the circular reference is at position 0 in the view. Reads that mapped across the origin are indicated by a << symbol at the left hand side and a >> symbol at the right hand side. Here, 17 reads map across the origin, and the coverage at each end of the reference is thus 17.

 Screenshot 2021-04-22 at 17.23.31.png

Figure 2: Comparison between mappings done with circular or linear versions a reference in a CLC Workbench before and after export and re-import from a BAM file.
 

Track 1, top track: The left hand end of a mapping against a reference marked as linear. Reads that extended beyond the end of the reference sequence are present, as illustrated by the < symbols at the left hand side.

Track 2: The left hand end of the mapping shown in track 1, after it was exported to BAM format and reimported into the Workbench. The reads that extended beyond the end of the reference sequence are still present after re-import, as illustrated by the < symbols at the left hand side.

Track 3: The left hand end of a mapping against a reference marked as circular. Reads that map across the origin of the reference sequence are present, as illustrated by the << symbols at the left hand side.

Track 4, bottom track: The left hand end of the mapping shown in track 3, after it was exported to BAM format and reimported into the Workbench. Reads that spanned the circular reference origin in the original mapping are marked as unmapped in the exported BAM file. They are thus not present in the mapping after re-import into the Workbench.