HomeCLC FAQ - Import, export, and downloadsImport and Export of SAM/BAMWhy do I run out of memory when importing a BAM file containing paired data?

2.2. Why do I run out of memory when importing a BAM file containing paired data?

Memory consumption for BAM import depends mainly on the number of broken pairs. A member of a broken pair is kept in memory until its mate is found. If your BAM file contains many broken pairs, then this could account for the high amount of memory being used.

If you have a SAM/BAM file containing many broken pairs and are running out of memory when trying to import it, please try sorting the file on readname and then importing. One can sort on readname using the samtools command samtools -sort -n.

Importing the sorted file should mean that broken pair mates are found earlier, thus releasing the broken pair member from memory earlier.

The above assumes that the paired data in your SAM/BAM file meets the SAM specification in terms of the naming of the mates of the pairs. In section 1.4 of the SAM specification has the following information:

"QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template"

What this means is that the expectation in a SAM/BAM file is that members of a pair will have the same name.

We have seen a couple of instances where importing a BAM file containing mapped data with paired data failed with an out of memory error because each member of a pair in the BAM file had a different name. Due to the different names, each such read is seen as part of a pair, but a pair for for which the mate can never be found as it has a different name. In the case then, many sequences would be held in memory, which can eventually lead to an out of memory error.

For example, if you had a read pair with names like this:

CAAAA_7_0013_5111_1
CAAAA_7_0013_5111_2

instead of both members of that pair having the name CAAAA_7_0013_5111, then you will likely run into an out of memory error.

If you had enough memory to import a file with the sort of naming shown above without encoutering an out of memory error, then the mapping that results from that import will likely not be what you want. That is, all such reads would have been recorded as single reads, instead of members of a pair, because when the mates were never found, the sequence would have been marked as single by the CLC Genomics Workbench.

If you are able to generate a BAM format file that meets the SAM specifications, then the CLC Workbench should be able to import the resulting BAM file.

 

Further details about SAM/BAM formats and the CLC Genomics Workbench

Information on what the flags the CLC bio Workbench uses for SAM/BAM format files is outlined in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Flags.html

and the SAM specification is at:

http://samtools.sourceforge.net/SAM1.pdf

One open source tool that can be useful in checking a SAM or BAM format file is the open source tool "samtools". We cannot provide support for such third party tools, but if you are interested in it, then the samtools package is available from:

http://sourceforge.net/projects/samtools/files/samtools

Knowledge Tags

This page was: Helpful | Not Helpful