How can I import a BAM file containing data mapped to the hg19 UCSC genome?

How can I import a BAM file containing data mapped to the hg19 UCSC genome?
Go Back
How to import a BAM file mapped to hg19 UCSC genome; differences between UCSC vs. Ensembl or Genbank obtained hg19 genomes

To import a BAM format file for which the UCSC hg19 reference was used for the mapping process, it is necessary to have the UCSC reference sequences selected in the import wizard of QIAGEN CLC Genomics Workbench.

This reference sequences is different from the hg19 reference obtained through the Reference Data Manager tool in QIAGEN CLC Genomics Workbench.

To successfully import your UCSC hg19 based BAM file it is necessary to:

Download and import the UCSC hg19 reference sequences into the Workbench
Import your BAM file with the UCSC hg19 reference sequences selected during the import step

Detailed information about how to download the UCSC hg19 reference sequence as well as background information regarding this is described below.

Obtaining the UCSC hg19 reference sequences

If you do not have the UCSC hg19 reference sequences you may obtain them through one of the proposed methods:

Request the original reference used to generate the BAM
Download the reference data from UCSC
Search for the UCSC hg19 reference sequence from NCBI

Request the original reference used to generate the BAM

The best way to ensure you are using the proper reference sequences would be to ask the provider of your BAM file. The person or group who ran the mapping job to generate the BAM file will likely have access to the original fasta file that was used as reference for the creation of the BAM file. After you have obtained these sequences, you may then import them into the Workbench.

Download the reference data from UCSC

The UCSC provides their hg19 reference sequence data on their website. You may download this data directly from the UCSC. To obtain this data directly from the UCSC:

Go to the UCSC hg19 directory of chromosome data:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/
Download each chromosome represented in your BAM files by scrolling to the bottom of the page and clicking each link. The standard set of chromosomes that are most likely included are as follows:
- chr1.fa.gz
- chr2.fa.gz
- chr3.fa.gz
- chr4.fa.gz
- chr5.fa.gz
- chr6.fa.gz
- chr7.fa.gz
- chr8.fa.gz
- chr9.fa.gz
- chr10.fa.gz
- chr11.fa.gz
- chr12.fa.gz
- chr13.fa.gz
- chr14.fa.gz
- chr15.fa.gz
- chr16.fa.gz
- chr17.fa.gz
- chr18.fa.gz
- chr19.fa.gz
- chr20.fa.gz
- chr21.fa.gz
- chr22.fa.gz
- chrM.fa.gz
- chrX.fa.gz
- chrY.fa.gz
Import all downloaded files into the Workbench by selecting all the gz fasta files in the Import tracks wizard.

More general information about the UCSC provided human data can be found on their webpage:
http://hgdownload.soe.ucsc.edu/downloads.html#human

Download the UCSC hg19 reference from NCBI

Download and import the 22 human autosomes and both sex chromosomes from hg19/GRCh37 and the older mitochondrial sequence (NC_001807), with annotations, from Genbank.
- To do this, use the tool at Download | Search for Sequences at NCBI
- Copy and paste the following text as search term as shown in the image below.
  
  NC_000001.10, NC_000002.11, NC_000003.11, NC_000004.11, NC_000005.9, NC_000006.11, NC_000007.13, NC_000008.10, NC_000009.11, NC_000010.10, NC_000011.9, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NC_000018.9, NC_000019.9, NC_000020.10, NC_000021.8, NC_000022.10, NC_000023.10, NC_000024.9, NC_001807.4
- Select all the rows that result from this search and then click on the button labeled Download and Save.
  
  Note that this search will give you the 24 normal chromosomes and the mitochondrial chromosome only. If you wish to get a reference set including the ChrUn clone contigs you will need to look up the NCBI accession numbers for these as well.
Change the names for each reference sequence so that they will match what has been used in your SAM/BAM mapping file. If the standard UCSC references have been used, then it is likely that chromosome 1 will be named chr1, chromosome 2 will be name chr2 and so on. A table linking the NCBI chromosome names with the corresponding standard UCSC names is provided below. You can do this manually or you can make use of the Rename Sequences in Lists tool within the Workbench.
Once the sequences have the correct names, you can convert these reference sequences to Tracks, as described in our manual here:
https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_Tracks.html
It is the Genbank files that are downloaded from NCBI, thus you can choose to create annotation tracks as well. Please click on the green + symbol beside the Annotation tracks box in the Wizard to choose the annotation types you wish to create tracks for.

Background information

The reference sequences in the Workbench must match, both names and lengths, to import mapping data from SAM/BAM files into the Workbench. If a reference sequence differs in either name or length from what is reported in the BAM file, then the Workbench will not see this as a match. This is because reads need to be placed in the right position against the right reference sequence.

For people working with the hg19 human genome reference, different versions of the mitochondrial sequence are in common use. For example, the hg19 reference sequences provided by Ensembl or Genbank are using the hg19 mitochondrial sequence (length 16569bp, Genbank accession NC_012920). UCSC uses a different mitochondrial reference (length 16571bp, GenBank accession NC_001807.4) For further information about the UCSC's decision to use a different mitochondrial sequence, please see the UCSC Note on chrM:

http://genome-euro.ucsc.edu/cgi-bin/hgGateway?hgsid=187301261&clade=mammal&org=Human&db=hg19&redirect=auto&source=genome.ucsc.edu

If your mappings are against the hg19 reference sequence and you are seeing warnings about the mitochondrial sequence when importing a SAM or BAM mapping file into the CLC Genomics Workbench, the most common cause for this issue is that your mapping was done using UCSC references and the reference set in the Workbench is from Ensembl or Genbank. These Ensembl and Genbank versions of hg19 include the newer mitochondrial reference, NC_012920, rather than the older one included in the UCSC version, NC_001807.4. If you used the Reference Data Manager tool the hg19 reference genome is from Ensembl and thus has the newer hg19 mitochondrial sequence version (length 16569).

Linking of Genbank GRCh37 accession numbers, sequence names and UCSC hg19 reference sequences
NCBI Accession	NCBI name	UCSC name
NC_000001.10	NC_000001	chr1
NC_000002.11	NC_000002	chr2
NC_000003.11	NC_000003	chr3
NC_000004.11	NC_000004	chr4
NC_000005.9	NC_ 000005	chr5
NC_000006.11	NC_000006	chr6
NC_000007.13	NC_000007	chr7
NC_000008.10	NC_000008	chr8
NC_000009.11	NC_000009	chr9
NC_000010.10	NC_000010	chr10
NC_000011.9	NC_000011	chr11
NC_000012.11	NC_000012	chr12
NC_000013.10	NC_000013	chr13
NC_000014.8	NC_000014	chr14
NC_000015.9	NC_000015	chr15
NC_000016.9	NC_000016	chr16
NC_000017.10	NC_000017	chr17
NC_000018.9	NC_000018	chr18
NC_000019.9	NC_000019	chr19
NC_000020.10	NC_000020	chr20
NC_000021.8	NC_000021	chr21
NC_000022.10	NC_000022	chr22
NC_000023.10	NC_000023	chrX
NC_000024.9	NC_000024	chrY
NC_001807.4	NC_001807	chrM

IPA

CLC Software

HGMD

QCI

OmicSoft Suite

OmicSoft Lands

Obtaining the UCSC hg19 reference sequences

Request the original reference used to generate the BAM

Download the reference data from UCSC

Download the UCSC hg19 reference from NCBI

Background information