HomeCLC FAQ - Import, export, and downloadsImport and Export of SAM/BAMHow can I import a BAM file containing data mapped to the hg19 UCSC genome?

2.3. How can I import a BAM file containing data mapped to the hg19 UCSC genome?

If you are attempting to import a BAM format file where the UCSC hg19 reference was used for the mapping process, it is necessary to have the UCSC reference sequences selected in the import wizard of the Workbench. This is different from the hg19 reference obtained through the Download Reference Genome tool in Genomics Workbench and Data Management in Biomedical Genomics Workbench. Information in this FAQ page assumes that the BAM file you are attempting to import was generated by using the UCSC hg19 sequences as the reference for the mapping job that created the BAM file. 

To successfully import your UCSC hg19 based BAM file it is necessary to:

  1. Obtain and import the UCSC hg19 reference sequences into the Workbench

  2. Import your BAM file with the UCSC hg19 reference sequences selected

 Detailed information about how to obtain the UCSC hg19 reference sequence as well as background information regarding this is described below.

 


 

Obtaining the UCSC hg19 reference sequences

If you do not have the UCSC hg19 reference sequences you may obtain them through one of the proposed methods:

 

Request the original reference used to generate the BAM

The best way to ensure you are using the proper reference sequences would be to ask the provider of your BAM file. The person or group who ran the mapping job to generate the BAM file will likely have access to the original fasta that was used as reference for the creation of the BAM. After you have obtained these sequences, you may then import them into the Workbench.

 

Download the reference data from UCSC

The UCSC provides their hg19 reference sequence data on their website. You may download this data directly from the UCSC. To obtain this data directly from the UCSC:

  1. Go to the UCSC hg19 directory of chromosome data: 
         http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/

  2. Download each chromosome represented in your BAM files by scrolling to the bottom of the page and clicking each link. The standard set of chromosomes that are most likely included are as follows:

    • chr1.fa.gz
    • chr2.fa.gz
    • chr3.fa.gz
    • chr4.fa.gz
    • chr5.fa.gz
    • chr6.fa.gz
    • chr7.fa.gz
    • chr8.fa.gz
    • chr9.fa.gz
    • chr10.fa.gz
    • chr11.fa.gz
    • chr12.fa.gz
    • chr13.fa.gz
    • chr14.fa.gz
    • chr15.fa.gz
    • chr16.fa.gz
    • chr17.fa.gz
    • chr18.fa.gz
    • chr19.fa.gz
    • chr20.fa.gz
    • chr21.fa.gz
    • chr22.fa.gz
    • chrM.fa.gz
    • chrX.fa.gz
    • chrY.fa.gz

  3. Import all downloaded files into the Workbench by selecting all the gz fasta files in the Import tracks wizard.

More general information about the UCSC provided human data can be found on their webpage: 
     http://hgdownload.soe.ucsc.edu/downloads.html#human 

 

Download the UCSC hg19 reference from NCBI (Only CLC Genomics Workbench)

  1. Download and import the 22 human autosomes and both sex chromosomes from hg19/GRCh37 and the older mitochondrial sequence (NC_001807), with annotations, from Genbank.

    • To do this, use the tool at  Download | Search for Sequences at NCBI

    • Copy and paste the following text as search term as shown in the image below.

      NC_000001.10, NC_000002.11, NC_000003.11, NC_000004.11, NC_000005.9, NC_000006.11, NC_000007.13, NC_000008.10, NC_000009.11, NC_000010.10, NC_000011.9, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NC_000018.9, NC_000019.9, NC_000020.10, NC_000021.8, NC_000022.10, NC_000023.10, NC_000024.9, NC_001807.4

    • Select all the rows that result from this search and then click on the button labeled Download and Save.

      Note that this search will give you the 24 normal chromosomes and the mitochondrial chromosome only. If you wish to get a reference set including the ChrUn clone contigs you will need to look up the NCBI accession numbers for these as well.

      NCBI Search

  2. Change the names for each reference sequence so that they will match what has been used in your SAM/BAM mapping file. If the standard UCSC references have been used, then it is likely that chromosome 1 will be named chr1, chromosome 2 will be name chr2 and so on. A table linking the NCBI chromosome names with the corresponding standard UCSC names is provided below.

    You can do this by hand within the Workbench or you can make use of the Batch Rename plugin.

  3. Once the sequences have the correct names, you can convert these reference sequences to Tracks, as described in our manual here:

    http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_tracks.html

    It is the Genbank files that are downloaded from NCBI, thus you can choose to create annotation tracks as well.  Please click on the green + symbol beside the Annotation tracks box in the Wizard to choose the annotation types you wish to create tracks for.

 

Configuring Data Management in Biomedical Genomics Workbench to use UCSC hg19

If you are working with the Biomedical Genomics Workbench, please follow the instructions in the FAQ: How can I use a different reference genome in the Biomedical Genomics Workbench?

 

Background information

The reference sequences in the Workbench must match, both names and lengths, to import mapping data from SAM/BAM files into the Workbench. If a reference sequence differs in either name or length from what is reported in the BAM file, then the Workbench will not see this as a match. This is because reads need to be placed in the right position against the right reference sequence.

For people working with the hg19 human genome reference, different versions of the mitochondrial sequence are in common use. For example, the hg19 reference sequences provided by Ensembl or Genbank are using the hg19 mitochondrial sequence (length 16569bp, Genbank accession NC_012920). UCSC uses a different mitochondrial reference (length 16571bp, GenBank accession NC_001807.4) For further information about the UCSC decision to use a different mitochondrial sequence, please see the UCSC Note on chrM:

http://genome-euro.ucsc.edu/cgi-bin/hgGateway?hgsid=187301261&clade=mammal&org=Human&db=hg19&redirect=auto&source=genome.ucsc.edu

If your mappings are against the hg19 reference sequence and you are seeing warnings about the mitochondrial sequence when importing a SAM or BAM mapping file into the CLC Genomics Workbench, the most common cause for this issue is that your mapping was done using UCSC references and the reference set in the Workbench is from Ensembl or Genbank. These Ensembl and Genbank versions of hg19 include the newer mitochondrial reference, NC_012920, rather than the older one included in the UCSC version, NC_001807.4.  If you used the Download Reference Genome Data tool or Data Management, the hg19 reference genome is from Ensembl and thus has the newer hg19 mitochondrial sequence (length 16569).

  

 

Linking of Genbank GRCh37 accession numbers, sequence names and UCSC hg19 reference sequences
NCBI AccessionNCBI nameUCSC name
NC_000001.10 NC_000001 chr1
NC_000002.11 NC_000002 chr2
NC_000003.11 NC_000003 chr3
NC_000004.11 NC_000004 chr4
NC_000005.9 NC_ 000005 chr5
NC_000006.11 NC_000006 chr6
NC_000007.13 NC_000007 chr7
NC_000008.10 NC_000008 chr8
NC_000009.11 NC_000009 chr9
NC_000010.10 NC_000010 chr10
NC_000011.9 NC_000011 chr11
NC_000012.11 NC_000012 chr12
NC_000013.10 NC_000013 chr13
NC_000014.8 NC_000014 chr14
NC_000015.9 NC_000015 chr15
NC_000016.9 NC_000016 chr16
NC_000017.10 NC_000017 chr17
NC_000018.9 NC_000018 chr18
NC_000019.9 NC_000019 chr19
NC_000020.10 NC_000020 chr20
NC_000021.8 NC_000021 chr21
NC_000022.10 NC_000022 chr22
NC_000023.10 NC_000023 chrX
NC_000024.9 NC_000024 chrY
NC_001807.4 NC_001807 chrM

Knowledge Tags

Related Pages
This page was: Helpful | Not Helpful