2.5. How are mappings to circular references handled when exporting to or importing from SAM/BAM files?
How are mappings to circular references handled when exporting to or importing from SAM/BAM files?
The manner the CLC Workbenches handle export to or import of mappings from SAM/BAM format reflects the fact that there is no explicit convention for the handling of reads that span the origin of a circular reference in the SAM/BAM format (https://samtools.github.io/hts-specs/SAMv1.pdf). Our aim is to represent the information as correctly as possible with this in mind.
Things of note when exporting mappings with a circular reference to SAM/BAM format
- Reads that map across the origin of a circular reference sequence are labeled as unmapped (flag 4) when exported to SAM/BAM format from the CLC Genomics Workbench and Biomedical Genomics Workbench.
If your primary aim is to export a mapping to SAM/BAM format, then you may wish to mark circular references to linear instead before running the read mapping. If this is done, then reads that would have mapped across the origin of the circular reference, will map to the end of the linearized reference where it best matches, assuming that minimum length and similarity fraction requirements are met. The part of the read that matches the other end of the linearized reference will extend beyond the end of the reference.
Otherwise, we would generally recommend that circular genomes are marked as circular so that reads that span the origin can be mapped and viewed accordingly.
Things of note when importing mappings with a circular reference from SAM/BAM format
- Mappings imported into the CLC Workbench from a SAM/BAM file will be presented as if the references were linear, even when the references selected when importing the SAM/BAM file are marked as circular.
- Reads flagged as unmapped in the SAM/BAM file can be imported to a sequence list by checking the "Import unmapped reads" option in the SAM/BAM Mapping Files import tool
- Reads that originally mapped across the origin of a circular reference will be marked as unmapped if you exported the mapping from a CLC Workbench. Such reads would therefore not be present in the re-imported mapping. You would likely notice a drop in the coverage level at either end of the reference sequence in such a case.
When working with SAM/BAM files and where one or more reference sequence is circular, we recommend that you:
- import the SAM/BAM file into the CLC Workbench as a sequence list. Here you only
- ensure any circular references in the CLC Workbench are marked as circular, and
- run a new read mapping job, to map the reads against these references.
The images below demonstrate some differences that can be observed when mapping to a circular reference that has been marked in the CLC Workbench as circular or has been marked as linear.
Figure 1: Visualizations of mappings to circular or linear versions of a reference sequence in a CLC Workbench. The top images show the left and right hand ends of a mapping done against a reference marked as linear. The corresponding images for a mapping done against a reference marked as circular are shown in the bottom images.
Linear reference, top images: A read that would have mapped across the origin here maps to one end of the linearized reference, and extends beyond it, as indicated by a < to the left of the read at the 5' end, or a > symbol to the right of the read at the 3' end. In this example 10 reads mapped to the 5' end of the reference, while 6 reads mapped to the 3' end. One read that mapped to the circular reference could not map to either end of the linearized version.
Circular reference, bottom images: The origin of the circular reference is at position 0 in the view. Reads that mapped across the origin are indicated by a << symbol at the left hand side and a >> symbol at the right hand side. Here, 17 reads map across the origin, and the coverage at each end of the reference is thus 17.
Figure 2: Comparison between mappings done with circular or linear versions a reference in a CLC Workbench before and after export and re-import from a BAM file.
Track 1, top track: The left hand end of a mapping against a reference marked as linear. Reads that extended beyond the end of the reference sequence are present, as illustrated by the < symbols at the left hand side.
Track 2: The left hand end of the mapping shown in track 1, after it was exported to BAM format and reimported into the CLC Workbench. The reads that extended beyond the end of the reference sequence are still present after re-import, as illustrated by the < symbols at the left hand side.
Track 3: The left hand end of a mapping against a reference marked as circular. Reads that map across the origin of the reference sequence are present, as illustrated by the << symbols at the left hand side.
Track 4, bottom track: The left hand end of the mapping shown in track 3, after it was exported to BAM format and reimported into the CLC Workbench. Reads that spanned the circular reference origin in the original mapping are marked as unmapped in the exported BAM file. They are thus not present in the mapping after re-import into the CLC Workbench.