To import a BAM format file for which the UCSC hg19 reference was used for the mapping process, it is necessary to have the UCSC reference sequences selected in the import wizard of QIAGEN CLC Genomics Workbench.

This reference sequences is different from the hg19 reference obtained through the Reference Data Manager  tool in QIAGEN CLC Genomics Workbench. 

To successfully import your UCSC hg19 based BAM file it is necessary to:

  1. Download and import the UCSC hg19 reference sequences into the Workbench
     
  2. Import your BAM file with the UCSC hg19 reference sequences selected during the import step 

 Detailed information about how to download the UCSC hg19 reference sequence as well as background information regarding this is described below.
 

Obtaining the UCSC hg19 reference sequences

If you do not have the UCSC hg19 reference sequences you may obtain them through one of the proposed methods:

 

Request the original reference used to generate the BAM

The best way to ensure you are using the proper reference sequences would be to ask the provider of your BAM file. The person or group who ran the mapping job to generate the BAM file will likely have access to the original fasta file that was used as reference for the creation of the BAM file. After you have obtained these sequences, you may then import them into the Workbench.

 

Download the reference data from UCSC

The UCSC provides their hg19 reference sequence data on their website. You may download this data directly from the UCSC. To obtain this data directly from the UCSC:

  1. Go to the UCSC hg19 directory of chromosome data: 
         http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/
     
  2. Download each chromosome represented in your BAM files by scrolling to the bottom of the page and clicking each link. The standard set of chromosomes that are most likely included are as follows:
     
    • chr1.fa.gz
    • chr2.fa.gz
    • chr3.fa.gz
    • chr4.fa.gz
    • chr5.fa.gz
    • chr6.fa.gz
    • chr7.fa.gz
    • chr8.fa.gz
    • chr9.fa.gz
    • chr10.fa.gz
    • chr11.fa.gz
    • chr12.fa.gz
    • chr13.fa.gz
    • chr14.fa.gz
    • chr15.fa.gz
    • chr16.fa.gz
    • chr17.fa.gz
    • chr18.fa.gz
    • chr19.fa.gz
    • chr20.fa.gz
    • chr21.fa.gz
    • chr22.fa.gz
    • chrM.fa.gz
    • chrX.fa.gz
    • chrY.fa.gz
       
  3. Import all downloaded files into the Workbench by selecting all the gz fasta files in the Import tracks wizard.

More general information about the UCSC provided human data can be found on their webpage: 
     http://hgdownload.soe.ucsc.edu/downloads.html#human 

 

Download the UCSC hg19 reference from NCBI 

  1. Download and import the 22 human autosomes and both sex chromosomes from hg19/GRCh37 and the older mitochondrial sequence (NC_001807), with annotations, from Genbank.
     
    • To do this, use the tool at  Download | Search for Sequences at NCBI
       
    • Copy and paste the following text as search term as shown in the image below. 

      NC_000001.10, NC_000002.11, NC_000003.11, NC_000004.11, NC_000005.9, NC_000006.11, NC_000007.13, NC_000008.10, NC_000009.11, NC_000010.10, NC_000011.9, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NC_000018.9, NC_000019.9, NC_000020.10, NC_000021.8, NC_000022.10, NC_000023.10, NC_000024.9, NC_001807.4
       
    • Select all the rows that result from this search and then click on the button labeled Download and Save

      Note that this search will give you the 24 normal chromosomes and the mitochondrial chromosome only. If you wish to get a reference set including the ChrUn clone contigs you will need to look up the NCBI accession numbers for these as well.

      Screenshot 2021-04-23 at 09.12.16.png
  2. Change the names for each reference sequence so that they will match what has been used in your SAM/BAM mapping file. If the standard UCSC references have been used, then it is likely that chromosome 1 will be named chr1, chromosome 2 will be name chr2 and so on. A table linking the NCBI chromosome names with the corresponding standard UCSC names is provided below. You can do this manually or you can make use of the Rename Sequences in Lists  tool within the Workbench.
  3. Once the sequences have the correct names, you can convert these reference sequences to Tracks, as described in our manual here: 
    https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_Tracks.html
  4. It is the Genbank files that are downloaded from NCBI, thus you can choose to create annotation tracks as well.  Please click on the green + symbol beside the Annotation tracks box in the Wizard to choose the annotation types you wish to create tracks for.

 

Background information

The reference sequences in the Workbench must match, both names and lengths, to import mapping data from SAM/BAM files into the Workbench. If a reference sequence differs in either name or length from what is reported in the BAM file, then the Workbench will not see this as a match. This is because reads need to be placed in the right position against the right reference sequence.

For people working with the hg19 human genome reference, different versions of the mitochondrial sequence are in common use. For example, the hg19 reference sequences provided by Ensembl or Genbank are using the hg19 mitochondrial sequence (length 16569bp, Genbank accession NC_012920). UCSC uses a different mitochondrial reference (length 16571bp, GenBank accession NC_001807.4) For further information about the UCSC's decision to use a different mitochondrial sequence, please see the UCSC Note on chrM:

http://genome-euro.ucsc.edu/cgi-bin/hgGateway?hgsid=187301261&clade=mammal&org=Human&db=hg19&redirect=auto&source=genome.ucsc.edu

If your mappings are against the hg19 reference sequence and you are seeing warnings about the mitochondrial sequence when importing a SAM or BAM mapping file into the CLC Genomics Workbench, the most common cause for this issue is that your mapping was done using UCSC references and the reference set in the Workbench is from Ensembl or Genbank. These Ensembl and Genbank versions of hg19 include the newer mitochondrial reference, NC_012920, rather than the older one included in the UCSC version, NC_001807.4.  If you used the Reference Data Manager tool the hg19 reference genome is from Ensembl and thus has the newer hg19 mitochondrial sequence version (length 16569).


Linking of Genbank GRCh37 accession numbers, sequence names and UCSC hg19 reference sequences
NCBI AccessionNCBI nameUCSC name
NC_000001.10NC_000001chr1
NC_000002.11NC_000002chr2
NC_000003.11NC_000003chr3
NC_000004.11NC_000004chr4
NC_000005.9NC_ 000005chr5
NC_000006.11NC_000006chr6
NC_000007.13NC_000007chr7
NC_000008.10NC_000008chr8
NC_000009.11NC_000009chr9
NC_000010.10NC_000010chr10
NC_000011.9NC_000011chr11
NC_000012.11NC_000012chr12
NC_000013.10NC_000013chr13
NC_000014.8NC_000014chr14
NC_000015.9NC_000015chr15
NC_000016.9NC_000016chr16
NC_000017.10NC_000017chr17
NC_000018.9NC_000018chr18
NC_000019.9NC_000019chr19
NC_000020.10NC_000020chr20
NC_000021.8NC_000021chr21
NC_000022.10NC_000022chr22
NC_000023.10NC_000023chrX
NC_000024.9NC_000024chrY
NC_001807.4NC_001807chrM