HomeCLC FAQ - Analyses-related questionsReference GenomeHow do I create a custom reference track containing a subset of sequences from a larger track set?

3.1. How do I create a custom reference track containing a subset of sequences from a larger track set?

The information in this entry can be relevant when working with amplified sequences originating from only one chromosome and where you wish to map your reads to just that chromosome.

The information here is not appropriate in the case of data such as exon or amplicon data. Please see the associated FAQ "Should I use a masked reference when working with exon or amplicon sequences?" for more information about this.

Assuming that it makes sense for the analysis to include only a subset of a larger reference set, there are three general routes you can take:

  1. Use individual annotated reference sequences from the repository of your choice
  2. Use individual unannotated reference sequences and annotation files from the repository of your choice
  3. Get the reference information of interest from a larger track set, for example a track set generated using the Download Genomes tool of the CLC Genomics Workbench

For any of the methods described, more than one reference sequence can be put into the custom reference set.

 


Use individual annotated reference sequences from the repository of your choice

  • Download your annotated reference(s) from the repository of your choice. Common annotated sequence formats include GenBank and EMBL formats.
  • Import the saved .zip file using the Standard Import option. Now your reference will be imported as a DNA sequence with annotations
  • Convert the DNA sequence to a track using the Convert to Tracks tool

If your reference is deposited at NCBI you can use the Search for sequences at NCBI tool in the Genomics Workbench instead of going to the respiratory.

  • First go to Download | Search for sequences at NCBI (Image search_ncbi_sequence_16_n_p) in the top panel of the Genomics Workbench
  • Type in the accession number of the reference chromosome that you would like to download, e.g. NC00017 (human chr17), NC_006119 (chicken chr32), etc.
  • Highlight the hit and press Download and Save. Now the reference will be downloaded as a DNA sequence with annotations
  • Convert the DNA sequence to a track using the Convert to Tracks tool

Please notice that you do not get any variant annotations using this approach.

 

Use individual unannotated reference sequences and annotation files from the repository of your choice

  • First, go to the FTP download site at a repository of your choice, e.g. Ensembl, NCBI, UCSC, WormBase etc.
  • Download your reference chromosome (DNA sequence) in .fasta format
  • Download the annotations in .gff3, .gtf/.gff2 or .bed format
  • Download the variant annotations in .gvf, .vcf, wiggel, UCSC Variation database table dump (.txt), or COSMIC variation database (.tsv) format 
  • First, import the reference chromosome (.fasta file) using the Track Import option
  • Next, import the annotations (one file type at a time) using the Track Import option. On import you should specify the reference which is why you need to import this file first.

An example of a repository is Ensembl. This repository holds most of the genomes that you can download though the download genome function in the workbench.

The ftp area of Ensembl can be found at following the link:

http://www.ensembl.org/info/data/ftp/index.html

 

Get the reference information of interest from a larger track set

If you have already downloaded the full reference genome using the download genome option in the workbench or imported the full reference genome, a reference consisting of one or a few chromosomes can be created as follows.

  • If necessary, convert all tracks of interest (including annotation tracks) to stand-alone sequences using the Convert from Tracks tool for each reference genome track.
  • For each stand-alone reference open the Sequence List in table view. Select the sequences you wish to be included in your final reference and click the button below the table to Create New Sequence List.
  • Once you have created multiple Sequence Lists or single sequences that you wish to be included in your new reference, combine all of them into a new single Sequence List.
  • Use the new Sequence List consisting of all desired reference sequences as input for the Convert to Tracks tool. Make sure to select all annotation types of interest in the Select tracks to create step of the wizard.

When the process has completed, you will now have a reference genome track along with selected annotation tracks. 

Please note that variations cannot be included when creating a reference subset in this way.

Knowledge Tags

This page was: Helpful | Not Helpful