To import a BAM format file for which the UCSC hg19 reference was used for the mapping process, it is necessary to have the UCSC reference sequences selected in the import wizard of QIAGEN CLC Genomics Workbench.
This reference sequences is different from the hg19 reference obtained through the Reference Data Manager tool in QIAGEN CLC Genomics Workbench.
To successfully import your UCSC hg19 based BAM file it is necessary to:
- Download and import the UCSC hg19 reference sequences into the Workbench
- Import your BAM file with the UCSC hg19 reference sequences selected during the import step
Detailed information about how to download the UCSC hg19 reference sequence as well as background information regarding this is described below.
Obtaining the UCSC hg19 reference sequences
If you do not have the UCSC hg19 reference sequences you may obtain them through one of the proposed methods:
The best way to ensure you are using the proper reference sequences would be to ask the provider of your BAM file. The person or group who ran the mapping job to generate the BAM file will likely have access to the original fasta file that was used as reference for the creation of the BAM file. After you have obtained these sequences, you may then import them into the Workbench.
The UCSC provides their hg19 reference sequence data on their website. You may download this data directly from the UCSC. To obtain this data directly from the UCSC:
- Go to the UCSC hg19 directory of chromosome data:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/
- Download each chromosome represented in your BAM files by scrolling to the bottom of the page and clicking each link. The standard set of chromosomes that are most likely included are as follows:
- chr1.fa.gz
- chr2.fa.gz
- chr3.fa.gz
- chr4.fa.gz
- chr5.fa.gz
- chr6.fa.gz
- chr7.fa.gz
- chr8.fa.gz
- chr9.fa.gz
- chr10.fa.gz
- chr11.fa.gz
- chr12.fa.gz
- chr13.fa.gz
- chr14.fa.gz
- chr15.fa.gz
- chr16.fa.gz
- chr17.fa.gz
- chr18.fa.gz
- chr19.fa.gz
- chr20.fa.gz
- chr21.fa.gz
- chr22.fa.gz
- chrM.fa.gz
- chrX.fa.gz
- chrY.fa.gz
- Import all downloaded files into the Workbench by selecting all the gz fasta files in the Import tracks wizard.
More general information about the UCSC provided human data can be found on their webpage:
http://hgdownload.soe.ucsc.edu/downloads.html#human
- Download and import the 22 human autosomes and both sex chromosomes from hg19/GRCh37 and the older mitochondrial sequence (NC_001807), with annotations, from Genbank.
- To do this, use the tool at Download | Search for Sequences at NCBI
- Copy and paste the following text as search term as shown in the image below.
NC_000001.10, NC_000002.11, NC_000003.11, NC_000004.11, NC_000005.9, NC_000006.11, NC_000007.13, NC_000008.10, NC_000009.11, NC_000010.10, NC_000011.9, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NC_000018.9, NC_000019.9, NC_000020.10, NC_000021.8, NC_000022.10, NC_000023.10, NC_000024.9, NC_001807.4
- Select all the rows that result from this search and then click on the button labeled Download and Save.
Note that this search will give you the 24 normal chromosomes and the mitochondrial chromosome only. If you wish to get a reference set including the ChrUn clone contigs you will need to look up the NCBI accession numbers for these as well.
- Change the names for each reference sequence so that they will match what has been used in your SAM/BAM mapping file. If the standard UCSC references have been used, then it is likely that chromosome 1 will be named chr1, chromosome 2 will be name chr2 and so on. A table linking the NCBI chromosome names with the corresponding standard UCSC names is provided below. You can do this manually or you can make use of the Rename Sequences in Lists tool within the Workbench.
- Once the sequences have the correct names, you can convert these reference sequences to Tracks, as described in our manual here:
https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_Tracks.html - It is the Genbank files that are downloaded from NCBI, thus you can choose to create annotation tracks as well. Please click on the green + symbol beside the Annotation tracks box in the Wizard to choose the annotation types you wish to create tracks for.
Background information
The reference sequences in the Workbench must match, both names and lengths, to import mapping data from SAM/BAM files into the Workbench. If a reference sequence differs in either name or length from what is reported in the BAM file, then the Workbench will not see this as a match. This is because reads need to be placed in the right position against the right reference sequence.
For people working with the hg19 human genome reference, different versions of the mitochondrial sequence are in common use. For example, the hg19 reference sequences provided by Ensembl or Genbank are using the hg19 mitochondrial sequence (length 16569bp, Genbank accession NC_012920). UCSC uses a different mitochondrial reference (length 16571bp, GenBank accession NC_001807.4) For further information about the UCSC's decision to use a different mitochondrial sequence, please see the UCSC Note on chrM:
http://genome-euro.ucsc.edu/cgi-bin/hgGateway?hgsid=187301261&clade=mammal&org=Human&db=hg19&redirect=auto&source=genome.ucsc.edu
If your mappings are against the hg19 reference sequence and you are seeing warnings about the mitochondrial sequence when importing a SAM or BAM mapping file into the CLC Genomics Workbench, the most common cause for this issue is that your mapping was done using UCSC references and the reference set in the Workbench is from Ensembl or Genbank. These Ensembl and Genbank versions of hg19 include the newer mitochondrial reference, NC_012920, rather than the older one included in the UCSC version, NC_001807.4. If you used the Reference Data Manager tool the hg19 reference genome is from Ensembl and thus has the newer hg19 mitochondrial sequence version (length 16569).
Linking of Genbank GRCh37 accession numbers, sequence names and UCSC hg19 reference sequencesNCBI Accession | NCBI name | UCSC name |
---|
NC_000001.10 | NC_000001 | chr1 |
NC_000002.11 | NC_000002 | chr2 |
NC_000003.11 | NC_000003 | chr3 |
NC_000004.11 | NC_000004 | chr4 |
NC_000005.9 | NC_ 000005 | chr5 |
NC_000006.11 | NC_000006 | chr6 |
NC_000007.13 | NC_000007 | chr7 |
NC_000008.10 | NC_000008 | chr8 |
NC_000009.11 | NC_000009 | chr9 |
NC_000010.10 | NC_000010 | chr10 |
NC_000011.9 | NC_000011 | chr11 |
NC_000012.11 | NC_000012 | chr12 |
NC_000013.10 | NC_000013 | chr13 |
NC_000014.8 | NC_000014 | chr14 |
NC_000015.9 | NC_000015 | chr15 |
NC_000016.9 | NC_000016 | chr16 |
NC_000017.10 | NC_000017 | chr17 |
NC_000018.9 | NC_000018 | chr18 |
NC_000019.9 | NC_000019 | chr19 |
NC_000020.10 | NC_000020 | chr20 |
NC_000021.8 | NC_000021 | chr21 |
NC_000022.10 | NC_000022 | chr22 |
NC_000023.10 | NC_000023 | chrX |
NC_000024.9 | NC_000024 | chrY |
NC_001807.4 | NC_001807 | chrM |