HomeCLC software: Important notificationsIssues affecting only versions of products released prior to June 2017Double gene and transcript annotations from Ensembl gtf files from late February to March 10, 2014

3.15. Double gene and transcript annotations from Ensembl gtf files from late February to March 10, 2014

Description

With the release of Ensembl version 75 came substantial changes to the annotation format. The CLC Genomics Workbench is not able to appropriately import the annotations from version 75 of Ensembl as yet. The symptom of the problem is that annotations such as genes and transcripts are duplicated. That is, each gene or each transcript appears twice in the Workbench. (See image below.)

The settings for the Download Genomes tool in the Genomics Workbench version 7.0 and 6.x have been altered to retrieve data from Ensembl version 74 rather than version 75. The annotation format used in Ensembl version 74 is interpreted correctly by the CLC Genomics Workbench.

This issue has been addressed for the Genomics Workbench 7.0.1 and newer.

Recommended action if your data is affected

If you have used the Download Genomes tool to retrieve annotations from Ensembl since late February, or if you have yourself downloaded gtf annotation files from Ensembl version 75 and imported this into the Workbench, then we recommend that:

  • You delete these annotations. (See information at the bottom of this page.)
  • You download and import version 74 of the Ensembl annotations, either by using Download Genomes or by downloading the gtf file from Ensembl and import it using the import tools of the Workbench.
  • If you have run RNA-seq analyses or other analyses that depend on these annotations, please re-run these after importing Ensembl version 74 annotations.

Who this affects

  • People who have imported gtf files from Ensembl version 75 into the Workbench using the Annotate with GFF tool or the Import Tracks tool
  • People who have used Download Genomes to import data from Ensembl. This includes annotations for human, mouse and many other of the genomes offered via the Download Genomes tool.

This does not affect people using plant or bacterial genomes provided via Ensembl, as these do not come from the same source.

This does not affect people using annotations from other resources, for example, TAIR.

How can you tell if your annotations are affected?

1) Check the history information for our annotation track. Do this by opening the annotation track and clicking on the small icon that looks like a clock (version 7) or a book with a bookmark (earlier versions) at the bottom of the window. Here, you can see if the version of the annotations used was 75 or not. If it is 74 or earlier, then your data is not affected.

2) Zoom in on an annotation track. For example, the gene track. Each annotation will be duplicated, as shown in the image below.

 

3) If you are working with RNA-seq analyses results and you sort the gene table output on the feature ID, you will notice that genes are present twice: once with their expected name and once with the name with a -1 attached.

 

What does it mean for your results?

The duplicate Gene annotations will pose a problem if you run the RNA-Seq Analysis tool and set the 'Maximum number of hits for a read' to '1'. The reference used by the RNA-Seq tool is a list of all Gene annotation sequences. The issue that all Gene annotations are duplicated will mean that each Gene sequence will be present twice in that list. Consequently, a read that maps to a specific gene will be seen as mapping equally well to two positions in the reference - the two identical Gene sequences. This in turn means that with the parameter setting 'Maximum number of hits for a read'= '1' no reads will map.

In addition, the if you run the tool 'Annotate with overlap Information' to add Gene annotation information to e.g. a variant track then the duplicate Gene annotation issue will result in duplicate entries in resulting annotation columns of the variant track table view. 

We recommend that you re-run any analyses that have depended on annotations from Ensembl version 75.

 

How to delete annotations

If you are working with tracks, you just need to move the relevant track objects to the trash.

If you are working with stand-alone reference sequences or sequence lists, then information on deleting annotations is provided in the manual here:

http://www.clcsupport.com/clcgenomicsworkbench/current/index.php?manual=Removing_annotations.html

This page was: Helpful | Not Helpful