HomeCLC FAQ - Analyses-related questionsPrinter Friendly Version

CLC FAQ - Analyses-related questions

Questions related to the analysis tasks that can be run using CLC Workbenches and Servers

1. General

1.1. What do the navigation area icons mean?

The icons in the navigation area signifies specific data object types.

The screen shots below show a non-exhaustive list of the type of data each icon represents in the Workbench. The type of data object is indicated with the text to the right of the icon. These data types are divided into different types of analysis categories. Some categories are only relevant to specific Workbenches or Modules.

General:

 

RNA-Seq and Small RNA Analysis:

 

 

Microarray:

 

Molecular Biology and Classical Sequence Analysis:

 

Whole Genome Alignment plugin:

 

Microbial Genomics and MLST Module:

 

Genome Finishing Module:

 

Retired tools:

Not recognized files:

1.2. How do I save and apply customized View Settings?

To save the view settings in the side panel, so they are applied the next time that you open an object of the same type, please follow the instructions in this FAQ.

Examples of view settings that are often customized include, but are not limited to:

  • Selected columns in table views
  • View of restriction enzymes or motifs
  • Personalized color settings

 

To save and apply your view settings please:

  • Select the view settings that you prefer in the side panel
  • Click on the Save View... button in the lower right corner of the workbench (Figure 1)
  • Enter a name for the view settings in the text box
  • Check the Save for all ... views check box (Figure 2)
  • Click Save button to save the settings
  • Select the saved view setting to apply from the drop down list
  • Check the Use as standard view settings for a ... view check box (figure 3)
  • Click the Apply button
  • Click the Close button to close the window

After saving the view settings in this way, they will automatically be applied the next time that you open an object of this type.

 

Figure 1. Click on the Save View ... button to Save view settings. In this example the we Save table view settings after editing the columns to show.

 

Figure 2. Enter a name for the view settings and click Save.

 

 

Figure 3.  Customize the standard settings using the option Use as standard view setting for a ... view.

 

For more information about Saving, removing and applying saved settings please see the manual following the links below:

CLC Main Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcmainworkbench/current/index.php?manual=View_settings_Side_Panel.html

CLC Genomics Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=View_settings_Side_Panel.html

1.3. Why is the progress bar stalled/why is the progress bar at 100% but my job is still running?

You can view the progress of tasks running in the Workbench, or on the CLC Server when the job was launched via the Workbench, in the Processes tab at the bottom left hand side of the Workbench. You can see a progress bar indicating the stage the job is at, and just above that, you can see text describing the stage the task is at.

The percentage progress reported in the Workbench should be interpreted as an indicator of the stage an analysis or action is at rather than an estimate of how much time it will take before that analysis or action is complete.  The progress bar is not overly sensitive to the progression of a particular phase of an analysis or action.

This means that you can often see a rapid progression of the progress bar, or jumps to a higher percentage, or it can seem like the progress bar is stalled at a certain percentage level for quite some time.  Usually the job is running fine, and when the phase the task is on finishes, the progress bar will jump to a higher percentage level.

Simply put, the progress bar is an approximate measure of how far an analysis or action has proceeded in terms of completed steps rather than compute time.

One known issue is that for some tasks, the progress percentage is marked as 100%, while the text describing the process suggests that data is still being written out. The text describing what is happening is more reliable than the percentage completion in this case.

 

1.4. How can I view trace data in the Workbench?

Trace data associated with sequences can be viewed when working with sequence lists or stand alone read mappings in the CLC Workbenches.

Trace data cannot be viewed within track-based data types.

Trace data can be viewed, while running unlicensed CLC Workbenches using the View Mode. In contrast trace data could not be viewed in the now discontinued free CLC Sequence Viewer.


The importer to use to when importing data with trace information

The Standard Importer brings in the trace information. The NGS Sanger importer does not.

Traditional Sanger sequencing data should be imported using the standard import option. In the CLC Genomics Workbench this importer is found at

Import | Standard Import

The  CLC Genomics Workbench import option Import | Sanger is designed for import of large, NGS-scale, data sets and removes trace information. If you want to retain trace data this is NOT the importer to use. This Sanger data importer is explained in further detail in the user manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sanger_sequencing_data.html

 

 

Viewing traces in a sequence or sequence list

To view trace information for sequences, either a lone sequence or sequences in a sequence list, open the sequence or sequence list in the viewing area and use the viewing options, usually found at the right hand side of the viewing area. Open the section called Nucleotide info and check the relevant options under the Trace data section, as shown in the image below.

 This, and more details about the Nucleotide info view settings, can be found in the manual at :

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Nucleotide_info.html

 

Viewing traces in a stand-alone mapping

The same information presented above for viewing trace data is relevant if you have mapped Sanger reads to a reference and you wish to view the trace data in a resulting stand-alone mapping object.  Here, one additional step is necessary:

  • Under the Read Layout section, the Compactness is set to Not compact

  • As described above, under the Nucleotide Info section, the option to view the Traces is checked and all the base traces are checked.

 

 

Viewing data for track-based mappings

Trace data associated with reads in reads tracks, that is track-based mappings, cannot be viewed while in track format.

To view trace information for such reads within the mapping, please convert the reads track (track-base mapping) to stand alone format using the tool:

Toolbox | Track Tools | Convert From Tracks

Then use the information above about viewing traces in a stand-alone read mapping.

1.5. How can I change the decimal notation used by the Workbench?

The notation used by Workbench in reports, tables and wizards is determined by the 'locale' settings of your system. By default, the Workbench stores the locale of your computer as the locale to use. This locale setting determines whether a dot (.) or a comma (,) will be used as the decimal separator.

So, for example, if your machine has a Danish locale setting, then the locale setting of your Workbench will, by default, be Danish as well - and the decimal separator will consequently be a comma.

If you wish to change the locale setting for your Workbench you can do so under

Edit | Preferences | General - Locale Setting

 

 

 

 

2. Sequences and sequence lists

2.1. How do I mark a sequence as circular or linear?

Sequences have an attribute that marks them as circular or linear. If this feature is defined in the original GenBank file imported into the Workbench, the information is retained. This information will be shown when you open your sequence in text view (see the screen shot below).

If this attribute is not part of the sequence metadata on import, then the sequence will be considered as linear.


A single sequence can be marked as circular or linear either from the sequence or table view. More or all sequences in a sequence list can be marked as circular or linear at the same time from the table view. The following paragraphs describe the different options:

 

Marking sequence as circular or linear from the table view

If you have a list of sequences for which you wish to mark one or all sequences as circular or linear, then it can be done in the following way:

  • Open up your sequence list in table view
  • Select the row with the sequence to mark as circular. To mark all rows of the table by clicking a single row, then press Ctrl+A or ⌘+A on Mac
  • Move the mouse cursor over the column called Linear
  • Depress the mouse button and choose the option called Edit Linear
  • Choose the option from the menu that will appear:
    • Update linear to Circular or Linear

 

  • Click the Update button
  • Save the edits to the sequence list by clicking Save button in the toolbar or by using the keyboard shortcut Ctrl+S or ⌘+S on Mac

 

Marking sequences as circular or linear from the sequence view

If you have a single sequence object or an individual sequence in a sequence list, that you wish to mark as circular or linear, then it can be done in the following way:

  • Open your sequence or sequence list in a graphical view
  • Move the mouse cursor over the name of the sequence you would like to change from linear to circular or vice versa
  • Depress the mouse button and choose the option from the menu that will appear:
    • Make Sequence Circular or Make Sequence Linear

  • Save the edits to the sequence list by clicking Save button in the toolbar or by using the keyboard shortcut Ctrl+S or ⌘+S on Mac

 

 

What to do if I have a sequence track?

Only stand-alone sequence can be marked as circular or linear. Hence, to mark the mitochondrial chromosome as circular in a genome sequence track, please convert to a stand-alone sequence or sequence list, follow the steps outlined above, and then convert the final objects back to track format. Conversions to and from tracks can be done using the tools in the Genomics Workbench under:

Toolbox | Track Tools | Track Conversion

2.2. How can I make subsets of a Sequence List?

One of the most commonly used data objects within the Workbench is the Sequence List. Detailed information about Sequence Lists can be found at the following manual page: 

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sequence_Lists.html

Subsets of Sequences Lists can be created different ways.


Create a subset of a Sequence List via the table view

  • Open the Sequence List in table view by clicking on the small icon of a table at the bottom of the view.
  • Select the rows representing the sequences you wish to include in the new Sequence List.
  • Click on the button at the bottom of the view labelled Create New Sequence List.

This will create a new Sequence List consisting only of those filtered/desired sequences. You will need to explicitly save this new Sequence List. 

 

How to do this is explained in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sequence_Lists.html

 

Create a subset of sequences by filtering for the desired sequences based on an attribute

Example 1: Using basic filtering based on text within an attribute shown in the Sequence List table view

To create a subset of sequences based on common text found within  a sequence description or any other attribute found within the Sequence List:

  • Open the Sequence List in table view (see image in section Manually Creating a Subset of  Sequence List above).
  • Type the text you wish to use to select the subset of sequences into the filtering text box. 
  • When only the sequences of interest are in view select all the rows of the table. To select all rows click any row in the table, then press Ctrl+A or ⌘+A on Mac.
  • Click the Create New Sequence List button found below the table.

This will create a new Sequence List consisting only of those filtered/desired sequences. You will need to explicitly save this new Sequence List.

 

Example 2: Using advanced filtering  based on text within an attribute shown in the Sequence List table view

Advanced filtering also allows you to extract a sublist of sequences based on a list of sequence names, or other attribute. This list should contain text unique to the sequences of interest separated by new lines, commas, or semicolons.

      • Open the Sequence List in table view by clicking on the small icon of a table at the bottom of the view.
      • Expand the Filtering view at the top of the table for Advanced Filtering.
      • Select the column name from the drop down menu that represents the text in your list, such as "Name".
      • In the drop down menu to the right of the column, choose the option "is in list".
      • Copy and paste the list into the filter text box and click the Filter button.
      • Select all rows in view by clicking a single row, then pressing Ctrl+A or ⌘+A on Mac.
      • Click on the button at the bottom of the view labelled Create New Sequence List.

This will create a new Sequence List consisting only of those desired sequences. You will need to explicitly save this new Sequence List.

 

More information about filtering tables can be found in our manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Filtering_tables.html

Create a subset of sequences based on length using the Trim Sequences tool


In the CLC Genomics Workbench, the Trim Reads tool can be used to create a subset of sequences that fall within a specified length range.

This tool is lanched by going to:

Toolbox | NGS Core Tools | Trim Reads

The steps involved here are:

      • Specify the sequence list(s) you wish to use.
      • Uncheck all the Quality Trimming options when you reach that stage of the Wizard.
      • Ensure no Trim Adapter List is selected when you read the Adapter trimming section of the Wizard.

        If one is listed at this stage, please click the arrow button in the bottom left of the wizard dialogue box to go back to the default settings for this, thereby deselecting the Adapter trim list.

      • The Sequence Filtering stage (image below), is the key stage for this process. Here:
        • Uncheck options in the Trim Bases section.
        • Check one or both of the Filter on Length options.
        • Enter the minimum and/or maximum length sequences you wish to include in your new subset of reads.

This is shown in the image below, where a minimum threshold of 75bp and a maximum threshold of 125 bp are specified.

 

After running the Trim Reads tool in this way, a new sequence list will be created. It will have the same name as the Sequence List you provided  with the text "trimmed" appended to it. The sequences within this new Sequence List are those from the original list that had a length within the size range you specified.

If you are working with paired reads, make sure to check the Save broken pairs option in the 5th step of the wizard. Selecting this option will generate a second Sequence List consisting of what are now single reads that fall within the specified length range but the other read in the pair did not fall within that length range.

  

Create subsets of sequences based on sequence name using the Sort Sequences by Name tool

The Sort Sequences by Name tool creates multiple Sequence List subsets based on sequence names. The sorted sequences lists are non-redundant, and combined, they include the entirety of the original set of sequences.  This differs from other tools described here, where only a single subset is created.

The Sort Sequences by Name is launched in the CLC Main Workbench by going to:

Toolbox | Sequencing Data Analysis | Sort Sequences by Name

 

In the CLC Genomics Workbench, it is launched by going to:

Toolbox | Molecular Biology Tools | Sequencing Data Analysis | Sort Sequences by Name

 

In the Biomedical Genomics Workbench, it is launched by going to:

Toolbox | Sanger Sequencing| Sequencing Data Analysis | Sort Sequences by Name

 

A description of how to make use of this tool can be found in the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sort_sequences_name.html

 

 

Extract a random subsample of sequences

Generate a random subset of sequences via the CLC Genomics Workbench

A tool that can generate a random subset of sequences is the Sample Reads tool. Instructions for making use of this tool are found in the manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sample_reads.html

This Sample Reads tool first became available in Genomics Workbench 7.5. Old version of the Workbench can make use of the Sample Reads tool included within the CLC Genome Finishing Module. This is a commerical module, which can be downloaded and installed in the Workbench as a plugin. You can read more about this particular tool in the Genome Finishing Module manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomefinishing/current/User_Manual.pdf

You may download and install this plugin through the plugin manager of the Workbench. 

Generate a random subset of sequences with the CLC Assembly Cell

A random sample of sequencing reads can be selected from a larger set using the clc_sample_reads tool, distributed with the CLC Assembly Cell product. The Assembly Cell  is a collection of binary executables run via the command line.

Information about the clc_sample_reads tool can be found in our online manual here:

http://resources.qiagenbioinformatics.com/manuals/clcassemblycell/current/index.php?manual=Options_clc_sample_reads.html

General information about the CLC Assembly Cell can be found here:

https://www.qiagenbioinformatics.com/products/clc-assembly-cell/

2.3. How can I concatenate sequence lists and when do I need to?


How to concatenate fastq files from different lanes

From QIAGEN CLC Genomics Workbench 20 and onward Fastq files from the same Illumina sequencing run but from different lanes can be merged into a single sequences list during import if selecting the option Join reads from different lanes.

This functionality is described on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Illumina.html

 

How to concatenate sequence lists together

How to concatenate two or more sequence lists together is covered in our Workbench manuals. For Genomics Workbench, the relevant manual link is: 

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sequence_Lists.html

The information there is pertinent to all CLC Workbenches.

 

Concatenating sequence lists is not necessary in most cases

Analysis tools that accept sequence lists as input can accept two or more sequence lists at once. Thus, there is no need to concatenate the lists prior to analysis of data that should be analyzed together.

For example, if you have two or more sets of sequence reads from a single sample that you wish to enter into a mapping, de novo assembly or other tool, you just select the relevant sequence lists in the Wizard, as shown below:

If you have two or more sets of sequence reads from each sample and wish to analyze the samples using the Batch option. Then, this is possible setting up a folder structure with a top folder and a folder for each sample containing the sequence list to be included in the analysis. If checking Batch and selecting the top folder, then the content of each folder will be analyzed as one batch unit. That is, that all the reads from the sequence lists in the sample folder will be analyzed as if they came from one large sequence list:

 

If running a workflow in batch the batch units can either be defined based on folder structure or metadata as described on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Running_workflows_in_batch_mode.html

 

Concatenate two or more sequence lists makes sense when...

Cases where concatenating sequence lists can be useful are:

1) Viewing annotations across a sequence set

If you wished to view and search all annotations on all the sequences in a set, then those sequences would need to be in a single sequence list. Relevant manual links include:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=View_Annotations_in_sequence_views.html

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=View_Annotations_in_table.html

 

2) Organization (convenience)

One could store many sequence lists in a folder, but an alternative would be to concatenate them into one sequence list.

Please note that we recommend that this action be taken only on smaller lists (e.g. thousands of sequences or less) and not very large sequence lists, such as lists of high throughput sequencing data.

2.4. Why are my sequences shown as "No label" in the navigation area of the Workbench?

Sequences in the Navigation area of the Workbench can sometimes be shown as something like "No label". This can occur if your sequence representation is set to something other than the default Name. If so, objects showing "No label" do not have information for the sequence representation that is currently selected.

It is possible to change the sequence representation in two ways. One is via the navigation area of the Workbench by right clicking a sequence and selecting the Sequence Representation | Name option from the drop down menu.  This is described in the manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Change_how_sequences_are_displayed.html

 

You may also change this setting through the user preferences dialogue. To do so first, go to the user preferences dialogue as described in the manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=User_preferences_settings.html

Then, adjust the sequence representation setting to be the default option Name

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Change_how_sequences_are_displayed.html 

3. Reference Genome

3.1. How do I create a custom reference track containing a subset of sequences from a larger track set?

The information in this entry can be relevant when working with amplified sequences originating from only one chromosome and where you wish to map your reads to just that chromosome.

The information here is not appropriate in the case of data such as exon or amplicon data. Please see the associated FAQ "Should I use a masked reference when working with exon or amplicon sequences?" for more information about this.

Assuming that it makes sense for the analysis to include only a subset of a larger reference set, there are three general routes you can take:

  1. Use individual annotated reference sequences from the repository of your choice
  2. Use individual unannotated reference sequences and annotation files from the repository of your choice
  3. Get the reference information of interest from a larger track set, for example a track set generated using the Download Genomes tool of the CLC Genomics Workbench

For any of the methods described, more than one reference sequence can be put into the custom reference set.

 


Use individual annotated reference sequences from the repository of your choice

  • Download your annotated reference(s) from the repository of your choice. Common annotated sequence formats include GenBank and EMBL formats.
  • Import the saved .zip file using the Standard Import option. Now your reference will be imported as a DNA sequence with annotations
  • Convert the DNA sequence to a track using the Convert to Tracks tool

If your reference is deposited at NCBI you can use the Search for sequences at NCBI tool in the Genomics Workbench instead of going to the respiratory.

  • First go to Download | Search for sequences at NCBI (Image search_ncbi_sequence_16_n_p) in the top panel of the Genomics Workbench
  • Type in the accession number of the reference chromosome that you would like to download, e.g. NC00017 (human chr17), NC_006119 (chicken chr32), etc.
  • Highlight the hit and press Download and Save. Now the reference will be downloaded as a DNA sequence with annotations
  • Convert the DNA sequence to a track using the Convert to Tracks tool

Please notice that you do not get any variant annotations using this approach.

 

Use individual unannotated reference sequences and annotation files from the repository of your choice

  • First, go to the FTP download site at a repository of your choice, e.g. Ensembl, NCBI, UCSC, WormBase etc.
  • Download your reference chromosome (DNA sequence) in .fasta format
  • Download the annotations in .gff3, .gtf/.gff2 or .bed format
  • Download the variant annotations in .gvf, .vcf, wiggel, UCSC Variation database table dump (.txt), or COSMIC variation database (.tsv) format 
  • First, import the reference chromosome (.fasta file) using the Track Import option
  • Next, import the annotations (one file type at a time) using the Track Import option. On import you should specify the reference which is why you need to import this file first.

An example of a repository is Ensembl. This repository holds most of the genomes that you can download though the download genome function in the workbench.

The ftp area of Ensembl can be found at following the link:

http://www.ensembl.org/info/data/ftp/index.html

 

Get the reference information of interest from a larger track set

If you have already downloaded the full reference genome using the download genome option in the workbench or imported the full reference genome, a reference consisting of one or a few chromosomes can be created as follows.

  • If necessary, convert all tracks of interest (including annotation tracks) to stand-alone sequences using the Convert from Tracks tool for each reference genome track.
  • For each stand-alone reference open the Sequence List in table view. Select the sequences you wish to be included in your final reference and click the button below the table to Create New Sequence List.
  • Once you have created multiple Sequence Lists or single sequences that you wish to be included in your new reference, combine all of them into a new single Sequence List.
  • Use the new Sequence List consisting of all desired reference sequences as input for the Convert to Tracks tool. Make sure to select all annotation types of interest in the Select tracks to create step of the wizard.

When the process has completed, you will now have a reference genome track along with selected annotation tracks. 

Please note that variations cannot be included when creating a reference subset in this way.

4. Tracks

4.1. How can I check that a set of tracks are compatible?

All information in tracks is tied to genomic positions. Thus, all tracks being used for a particular analysis or in a track list must be based on the same coordinate system. That coordinate-system is provided by a reference genome. Tracks based on the same coordinate system, that is, the same reference genome, can be analyzed and viewed together in a meaningful way.

This means that it is vital that files containing genome coordinates, such as BED files, be imported against a reference track with the same coordinate system.

This FAQ addresses how to check if tracks are compatible and how to obtain compatible tracks if they are not.

 

Check compatibility of existing tracks by making a track list

If you can make a track list with a set of tracks, then those tracks are compatible, and could be used together in an analysis. To do this, go to: 

File | New | Tracklist

and select the tracks of interest.

If a track is incompatible with others selected, you will see a message in the wizard saying "Select tracks from same genome".

 

What to do if a track is incompatible with other tracks of interest

For standard reference data, such as hg38 annotations, we suggest one of these routes:

  • Use the CLC Genomics Workbench Reference Data Manager, and download the reference data of interest. This may be either individual elements from the relevant reference data set, or a full reference data set. Within a set, all the tracks are compatible.

    OR

 

For custom data, for example target regions of interest:

  • Import the data as a track, specifying the relevant reference genome track during import, and
  • After import, check that the target regions are located where you expect by making a track list containing the newly imported track and other relevant tracks.
    For example, for a target region track imported from a BED file, include it in a track list with mRNA or CDS annotation tracks. Check the target regions are placed where you expected, particularly near the end of the chromosomes, as any coordinate mismatches will be most obvious there. 

 

4.2. What is the difference between stand-alone objects and tracks?

The CLC Workbenches and Servers use two different object types, called Tracks and Stand-alone objects.

A key difference between a Stand-alone and Track-based sequence object is that a Stand-alone sequence contains all information relevant to that sequence in a single object. This includes any annotations as well as the sequence information itself.

In contrast, a Track-format object for a sequence contains very little information apart from the sequence itself. Annotations associated with that sequence are held in separate tracks. For example, you might have a Track for the Gene annotations, a Track for the CDS annotations, and etc. This provides greater flexibility in that you can view sets of Tracks in Track Lists, and thus can easily view together particular sets of annotations, or other information (e.g. Reads Tracks, Variant Tracks, etc.), at any given time.

Tracks are particularly useful when working with resequencing projects where several samples are analyzed against the same reference sequence. Therefore, this object type is used for RNA-Seq, Epigenomics and Resequencing.

Tracks are described in more detail in the following manual page:

     http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Tracks.html

 

Stand-alone objects are on the other hand more flexible as they are not tied to a common reference and are used for Sanger sequence analysis, cloning, de novo assembly, genome finishing, metagenomics, etc.

Based on the nature of the analysis that our different Workbenches can be used for, Biomedical Workbench primarily works using Tracks, CLC Genomics Workbench uses both Tracks and Stand-alone objects, while the Main Workbench only uses Stand-alone objects.

In CLC Genomics Workbench, objects can be converted to Tracks or from Tracks using tools in the Track Tools section of the Workbench Toolbox.

The simplest way to determine if an object is a Track or a Stand-alone object is by looking at the icon for that object in the Navigation Area of your Workbench.  The icons for track-based objects include a small blue bar graph at the bottom. Therefore, objects without these blue bars are not in track format.

 

The following images include examples of Track and Stand-alone data objects.

 

 

Example of a Track list:

 


Example of a Stand-alone mapping viewed along with an annotated variant table:

4.3. Can chromosomes downloaded from NCBI be used as a track-based reference?

Sequences downloaded and imported using the tool:

   Download | Search for Sequences at NCBI

are imported as stand-alone sequence objects or sequence lists.

These are easily converted to tracks by using the tool:

   Toolbox | Tracks | Convert to tracks

You can download any set of sequences and convert them to Tracks if you wish.

 

Relevant manual links

Searching and downloading sequences from GenBank - the NCBI Entrez database:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Search_Sequences_at_NCBI.html

 

Converting to tracks:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_Tracks.html

 

5. Read mapping

5.1. How can I view the bases for mapped reads that extend beyond the end of a reference sequence?

This information is relevant for

  • stand-alone mappings with reference and consensus sequences
  • stand-alone mappings with just a contig sequence, such as produced by mapping reads back to de novo assembled contigs. For such data, please just replace the word "reference" with "contig" in the instructions below.

This information does not pertain to track-based objects.

 

For reads that map at the end of a reference sequence, any section of the read extending beyond the end of the reference sequence will be not visible. The existence of read information extending beyond the end of reference sequence is indicated by an arrow (>). This is shown in figure 1 of the pdf attached to this page.

The steps you need to take to be able to view the bases in such unaligned read ends are:

  1. Edit the reference sequence, adding a reasonably long stretch of N characters to the end. How long depends on how long your reads are and how long the unaligned ends you wish to view are.

  2. Re-run the mapping using this edited reference.

 

Editing your reference sequence

If you wished to add a series of Ns to the start of a reference sequence, then you would:

  • Open the reference sequences object and highlight the first nucleotide.
  • Right click the highlighted nucleotide and select "Edit Selection" in the menu that appears. This is shown in figure 4 of the attachment.
  • Type in at least as many Ns as your reads are long and click the button labeled "OK" to save this change.

To add Ns at the othe end of the reference, just select the last base of the reference and repeat the above actions.

 

 

Working with mappings with extended reference sequences

After mapping to your edited reference, the ends of any reads extending into the area with the Ns, will be considered non-matching, and therefore appear in faded colours.

If no un-mapped ends appear then please make sure that you have selected to show the sequence ends in the side panel. This is illustrated in figure 2 of the pdf attached to this page.

To get the consensus sequence for such regions, you need to manually drag the end point for each read to be considered. To do this:

  • Make sure you are working with the Selection cursor. To make sure you are, click on the button marked with an arrow and labelled Selection in the top toolbar of the Workbench.
  • Please ensure the compactness setting for the mapping is set to "Not compact"
  • Put the selection cursor on the the grey vertical line between the faded and non-faded parts and depress the left mouse button.
  • Keeping the mouse button depressed, for each relevant read, drag to the left to extend the 5 prime end, or to the right to extend the 3' end.

All read ends that you want to include in the consensus (or contig) calculation should be dragged to include them in the mapping in the above manner.

The consensus sequence will appear after you have released the mouse button after dragging.

 

 

Additional information about viewing options for mappings

If you select the compactness view called Packed in the side panel and then choose to Show mismatches, the nucleotides are colored in a way that may make it easier for your to get an idea about how well the consensus is representing your reads.

This is illustrated in figure 4 of the attachment.

Alternatively, if you want to include all reads, but without viewing the quality scores, you may wish to try the compactness view called Low.

You can find further details about viewing options for read mappings in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=View_settings_in_Side_Panel.html

5.2. What are the memory requirements for the read mapper?

The amount of memory used for mapping is linked with the size of the genome you are mapping to. On our system requirements page we give some suggestions for how much memory you will need to map to some organisms of different size.

You can find our system requirements page following the link below:

https://www.qiagenbioinformatics.com/system-requirements/

 

5.3. Why is the total memory used for my read mapping more than specified as the maximum for the java process?

There are three phases when running a read mapping job:

  • a pre-processing phase
  • a computational phase
  • a post-processing phase.

The pre- and post-processing phases are part of the CLC java process, and are thus subject to the heap size restrictions you place on java using the Xmx setting in the vmoptions file. How to alter this setting in the Workbench is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/workbenchdeployment/current/index.php?manual=Setting_amount_memory_available_JVM.html

The computational phase, however, is run via a native binary. It is thus not subject to any java-related settings and will use the amount of memory it needs.

The system requirements for the Workbench are provided on our website (see link below). The information includes a guide to the amount of memory required for running mappings against certain sample genomes. The amount of memory required is linked with the size of the genome you are mapping to.

https://www.qiagenbioinformatics.com/system-requirements/

5.4. How can I see all my sequences when viewing high coverage areas of a mapping?

Working with read mapping tracks

To see all individual reads when viewing high coverage areas, you will need to:

  • Zoom into the mapping within a region of interest.
  • Hold down the Alt key and scroll with the mouse wheel.

This will allow you to scroll down the mapped reads in the view.

You may also wish to increase the depth of the read track when viewing the read data within a mapping. To do this, place the mouse cursor in the name area of the open map track, and hover over the lower boundary of the track. You should see the cursor change to a double-headed arrow. When in that state, you can depress the left mouse button and drag the height of the track upwards or downwards.

A full list of shortcuts, including the Alt+Scroll Wheel shortcut mentioned above, can be found in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=List_shortcuts.html

 

Working with stand-alone read mapping data

To be able to scroll through all reads in a stand-alone read mapping, please choose any of the compactness levels except for the Packed option within the Read Layout section of the the side panel when working with stand-alone mapping objects.

When inspecting a read mapping, different compactness levels may be selected within the Read Layout section of the the side panel when working with stand-alone mapping objects. The default compactness level for a mapping is the Packed setting.

While all other compactness settings will stack the reads on top of each other, such that no two reads lie side by side, the Packed setting uses the available space more efficiently, allowing more reads to be seen in the same view. The benefit of this setting is a better overview when scrolling laterally through a read mapping.

However as a consequence of the organization of the Packed setting is that not necessarily all reads covering a given site are displayed. To aid with this, the vertical space used can be adjusted in the Side Panel settings in the section  Read layout | Packed read height.

To display all the reads covering a given site, you can either increase the level of Packed read height, or, you can switch from the Packed view to another compactness level, for instance Low.

Listed below are a few characteristics of the Packed setting:

  • When there are more reads than the specified packed read height allows for, a grey overflow graph will be displayed below the reads in the Packed setting.
  • When zoomed in to 100% in the Packed setting, the individual residues can be seen. But when zoomed out the reads will be represented as lines just as with the Compact setting.
  • Please note that the packed mode is special because it does not allow any editing of the read sequences and selections, and furthermore the color coding that can be specified elsewhere in the Side Panel does not take effect.

A more detailed description of the available compactness settings can be found in our online manual under this link:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Mapping_view_settings.html

 

Note that there is a shortcut for changing the compactness. Just press and hold the Alt key while you scroll using your mouse wheel or touchpad.

5.5. Why are there ambiguity codes and N's in the paired reads of my mapping?

The presence of ambiguity codes and N's in mapped reads can both relate to the use of overlapping paired reads for which there is a conflicting base, insertion or deletion between the two reads. When paired reads overlap they will appear as one concatenated blue read, for which conflicts will be shown with an ambiguity code and insertion/deletions with an N.

 

Ambiguity code in mapped reads:

An ambiguity code will be assigned to any base where the two members of a pair of reads overlap, but do not agree on the base call.

N's in mapped reads:

An N will be assigned to any position where the two members of a pair of reads overlap, but there is an insert or deletion in one of the reads on the base call.

 

 

To view the conflicts in the individual forward and reverse reads you can go to the settings panel and check the box for Disconnect paired reads. When selecting the Disconnect paired reads option in the settings panel the individual forward and reverse reads will be shown revealing the deletion, conflict, and insertion.

 

 

How is this information handled by the Workbench variant callers:

The Fixed Ploidy and Low Frequency Variant Detection tools will ignore overlapping reads that do not agree about the variant base, whereas the Basic Variant Detection tool will consider the position if only one of the reads passes the quality filter.

 

More general information about how overlapping paired reads are handled can be found on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Detailed_information_about_overlapping_paired_reads.html

 

 

5.6. Why do I see a grey box with text saying "Too much data for rendering" when I view a mapping?

For deep mappings, stand-alone or reads tracks when zoomed out, you may see a grey box with text saying "Too much data for rendering" or "Too much data for rendering. Zoom in to view data"presented on a grey background.

The conditions where there is too much data to show in the view are given in more detail further down in this entry.

To view your data in detail, please zoom further in. As the amount of data covered becomes less, you will be able to see the mapping in more detail.

How to use the zoom tools is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Zoom_selection_in_View_Area.html

 

If the main reason to wish to view the mapping zoomed fully out is to get a general idea of coverage across the mapping,  the tool "Create Mapping Graph" tool could be of interest. It can be found in the CLC Genomics Workbench toolbox at: 

Track Tools | Graphs | Create Mapping Graph

Using that tool, one can specify the particular coverage type of interest, for example, maybe it is read coverage, or perhaps just paired read coverage, or unaligned ends coverage, etc. Then a graph of the mapping is made. As it is a graph of the type of coverage specified, it contains much less information than the full reads mapping track. This can be viewed easily when fully zoomed out.

To view one or more mapping graph tracks at once, perhaps with the reads mapping track as well, a track list could be created.

If coverage is of interest, another tool that might be of interest is the Coverage Analysis tool, described in the manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Track_lists.html

 

Conditions leading to the message "Too much data for rendering"

The message "Too much data for rendering" on top of the grey box appears when:

  • You are working with track-based mappings and there are more than 500,000 reads in the (visible) interval you can see in the Workbench. Paired reads count as 1 read for this limit.

  • You are working with a stand-alone mapping and:

    a) There are more than 200,000 reads in the interval you can see in the Workbench viewing area. Paired reads count as 1 read for this limit.

    OR

    b) There region of the reference sequence in the Workbench view is longer than 200,000 bases.

The intervals counted over are 100 base pairs in both directions. To illustrate the meaning of "interval" here, if you set the text size to huge and resized the Workbench window to be very, very thin, so you could only see 10 bases of the reference, then the number of reads being considered will be those over a 210 base region.

 

5.7. How can I identify low or zero coverage regions in a mapping?

If you have a read mapping in track format, you could do the following to get gap regions annotated by using this tool:

Toolbox | Track Tools | Graphs | Create Mapping Graph Track

followed by this tool:

Toolbox | Track Tools | Graphs | Identify Graph Threshold areas

where you can set the minimum and maximum coverage to look for. This tool will then generate an annotation track of those regions.

These tools are described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Create_Mapping_Graph.html

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Identify_Graph_Threshold_Areas.html

 

Below is an image of a track list including a read mapping  track and the track outputs of the tools above, where the aim was to identify zero coverage areas in the mapping. 

 

By opening the track containing the low coverage annotations in table view, you can see information, like the coordinates of the annotations. From within a track list, you can open a particular annotation track in table view by just double clicking on the name of the track. If you open just the track, not within a track list, just click on the little icon that looks like a table at the bottom of the view.

You may also be interested in the Coverage Analysis tool. This does something somewhat different, but may still be of interest. It is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Coverage_analysis.html

5.8. How can I export the coverage for each position in my mapping?

There are two options for exporting coverage values for every position of a reference.

Export Mapping Coverage

Available for: Genomics Workbench 7, Main Workbench 7, Biomedical Genomics Workbench 2.1 and later versions of these workbenches.

Input: Read mapping tracks, stand-alone read mappings and mapping graph tracks.

Description: This export option sends detailed coverage information to a Tab Separate Values (.tsv) file.

To export the detailed coverage information use the File | Export menu, the Export button from the Workbench toolbar, or right-click on the input file in your navigation area and select Export. Select the Mapping Coverage as export format. A description of the information exported using this tool is provided in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Read_mapping_formats.html

 

Export just coverage values to comma separate text file

Available for: All CLC Workbenches that support read mappings.

Input: Stand-alone read mappings. (Not read mapping tracks.)

Description: This route leads to the generation of a comma delimited text file containing the coverage values for every position of a reference in a stand-alone read mapping object.

How to do this is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Export_graph_data_points_file.html#sec:exportgraphincsvformat

If you cannot see the coverage graph in your stand-alone read mapping, then please check the viewing settings in the right hand panel of the Workbench. The Graph setting is described in our manual on this page in the *Alignment Info* section:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Mapping_view_settings.html

If you are working with a track-based mapping, please consider using the export tool and choosing the Mapping coverage option, as described above, instead.

If you really want to use the option described here to get a file containing coverage values for a read mapping track, then you must first convert it to a stand-alone object using the Convert From Tracks tool found at:

Toolbox | Track Tools | Convert From Tracks


5.9. Should I use a masked reference when working with exome or amplicon data?

Based on the reason described in the paragraph below we recommend that you map exome and amplicon reads to the full genome representing the biological source of those reads.

After mapping to the full reference genome you can restrict InDel and Structural Variant Detection, Variant Detection, and Quality reporting to the target regions.

Therefore, our exome and targeted amplicon sequencing Ready-to-use Workflows in Biomedical Genomics Workbench do by default not include the option for reference masking during mapping, but only during the downstream steps.

Having said this, then there can be situations were masking is appropriate, as this can depend on the organism and amplicon design. To investigate if masking is appropriate in your case, please run the analyses both with and without reference masking to compare the results. This is also described in our manual at:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=References_masking.html

 

The reasoning

When mapping reads generated from defined regions of the genome, for example in the case of amplicon or exome data, it is possible that there are reads in the sample set representing regions of your genome that were not among the intended targets. If this occurs, and you map your reads against a reference that has been masked so only the intended target regions are available, then the mapper will try to map all reads. This includes, of course, any reads that were generated from regions outside the intended regions.

In this case, chances are that at least some of these reads, generated from regions of the genome outside the intended regions, will map to the masked reference. That is, reads that represent source regions not are not included in the reference you provide may still map to the reference, if they map well enough (according to the parameters you provide). Such mapped reads often do not match as well as they would have  to their true source region in the genome, so if the true source region had been available to map to, the reads likely would have mapped preferentially there.

5.10. How do I remove contaminant sequences from my set of reads?

There are instances when you may know that your sample includes contaminant sequence, or you may suspect there is contaminant sequence in it. In these cases, it would be preferable if downstream analysis could be performed with the contaminant removed. Here we outline the steps you can take for removing contaminant data in the Biomedical Genomics Workbench (BxWB) or CLC Genomics Workbench (GWB) when knowing what the contaminating sequence is. The procedure for removing contaminating data described for BxWB can also be used in GWB, if working with Tracks.

Furthermore, we outline how contaminating sequence from an unknown organism can be removed. This, section is only relevant for CLC Genomics Workbench as it involves de novo assembly and BLAST, which are tools unique to that Workbench.

Please note that we are unable to guarantee complete removal of all contaminant sequence data.

 

Remove known contaminating sequence

To remove your contaminant sequence, we recommend that you map your reads to the full reference for the biological sample in one mapping step. This means that you map your reads against the full known reference as well as the known contaminant reference sequence. In the subsequent step you extract the reads mapping only to the known reference sequence.

Remove known contaminating sequence in Biomedical Genomics Workbench

  1. Obtain the sequence of the reference genome and contaminating sequence in fasta format. The human reference genome can be obtained as a fasta file by downloading it through Data Management and then exporting it from the Workbench in fasta format, or by downloading it from NCBI or ENSEMBL ftp site outside the Workbench.
  2. Import all the fasta files containing the reference for the biological sample at the same time through Import tracks to obtain a single Genome Track including all chromosomes/contigs. Save the Genome Track to a folder in your Navigation Area. In rest of this FAQ we will refer to this Genome Track as the Combined Genome Track.
  3. Map the reads to the Combined Genome Track using Map Reads to Reference tool. You may wish to collect the un-mapped reads for further analysis.
  4. Create a BED file with annotations covering the chromosomes of the standard reference genome, e.g. hg19 or hg38. The BED file format is described in the following link: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
  5. Import the BED file using Import tracks tool and the Combined Genome as Reference Track.
  6. Extract the reads mapping to the chromosomes of the standard reference genome using Extract reads based on overlap tool with the BED track as overlap track. This will create a new Reads Track, which have the genomic coordinates of the Combined Genome Track, but only including the reads mapping to the standard chromosomes.
  7. Extract the reads mapping to the standard chromosomes to a new sequence list using Extract sequences tool.

You have now obtained a sequence list including only the reads mapping to the reference of interest, which can then be used as input for a new mapping or workflow using the standard version of the reference genome of your interest.

Remove known contaminating sequence in CLC Genomics Workbench

  1. Create a reference Sequence List that includes the known contaminant sequence(s) as well as the desired reference.
  2. Map all reads to this combined reference, choosing the options to Create stand-alone read mappings and Collect un-mapped reads in the Result handling step of the wizard.
  3. When the mapping completes, open the resulting Mapping Table and select the rows corresponding to your desired reference, then click the button at the bottom of the table to Extract subset.
  4. Use the new subset mapping as input for the Extract Sequences tool and choose the option to create a new Sequence List rather than individual sequences.
  5. The resulting extracted Sequence List, in addition to the Sequence List of un-mapped reads can then be used as input for downstream analysis.

Background regarding references

The reason we recommend mapping to both the contaminant and desired sequence data at the same time is as follows:

When mapping, the reference sequences used should reflect the source the sample was generated from. When mapping reads generated from potentially contaminated data, it is possible that there are reads in the sample set representing this contaminant sequence. If this occurs, and you map your reads against a reference that only includes the desired reference, it is possible contaminant reads could be mapped to your desired reference sequence because the read mapper will still try to map all reads.

Alternatively, if you are mapping only to the contaminant reference, to remove contaminant reads, it is possible that desired reads could map to this contaminant sequence and be incorrectly removed. Chances are that at least some of these non-contaminant desired reads will map to the known contaminant reference. This is especially likely when the contaminant sequence is similar to the desired sequence, for example, when attempting to remove mouse DNA from a human sample. Specifically, when reads that represent desired sequence are not included in the reference you provide, they may still map to the contaminant reference, if they map well enough (according to the parameters you provide). Such mapped reads often do not match as well as they would have to their true source region, so if the true source region had been available to map to, the reads likely would have mapped preferentially there. In this example, where a read that is biologically from the desired sequence, maps to the known contaminant reference, that read would be discarded from future analysis. 

 

Remove unknown contaminating sequence in CLC Genomics Workbench

If you do not have a reference for the contaminant sequence you could try to first identify what is it, then follow the instructions in the above section. To do so, please try the following:

  1. Using very stringent parameters, Map all reads to the known reference. To increase parameter stringency, you may increase the length and similarity fraction values.
  2. In the Result handling step of the wizard choose the options to Collect un-mapped reads.
  3. When the mapping completes, use the un-mapped reads Sequence List as input for de novo assembly. In the Select mapping options step of the wizard, choose the Create simple contig sequences (fast) option. If you are uncertain of appropriate parameters for your data, you may use the defaults. 
  4. When the assembly completes, BLAST the resulting contig sequences to a database of potential contaminants, or if you are not sure of what it could be, then you may wish to BLAST to all of NR. If you have a very large number of contigs, then you may wish to select the largest ~20 contig sequences for this BLAST job.
  5. When the BLAST results return, select a few of the largest contig results from the Overview BLAST table and click the button to Open BLAST Output
  6. For each BLAST table view, with the top hit selected, click the Download and Save button.
  7. Use these saved sequences as the contaminant reference sequences and follow the instructions above.

6. De novo assembly

6.1. How much memory does a de novo assembly take?

Background

The amount of memory required for any particular de novo assembly will depend on a combination of genome size, data type, data quality and data volume. In simple (and vague) terms, the larger the de Bruijn graph gets, the more memory is needed. Data quality can have an impact on the size of the graph, as can features in the data such as lots of repeats, high heterozygosity, and so on.

Our White Paper about the de novo assembly provides details of the tool as well as giving details of example assemblies, including the memory of the systems that the assemblies were run on:

http://resources.qiagenbioinformatics.com//white-papers/White_paper_on_de_novo_assembly_4.pdf

The white paper outlines the memory requirements when using the Assembly Cell. For users of the Genomics Workbench, there are three phases when running a de novo assembly where you have requested Simple Contig output: a pre-processing phase, a computational phase and a post-processing phase. The Assembly Cell de novo assembly program is the same as the computational phase of running a de novo assembly via the Genomics Workbench. There is thus some additional overhead when running a de novo assembly via the Genomics Workbench. In terms of what this means for considering memory requirements for the Workbench relative to the memory values reported in the White Paper:

  • If nothing else will be run on the machine except the de novo assembly via the Genomics Workbench, you can try adding approximately 1/4 to 1/3 again the amount of memory to that reported as used in the White Paper when considering the minimum amount that you would need on your system.
  • More normally, you may also wish your machine to be able to be used for other small tasks, in which case you should considering adding 1/2 again, or maybe a full times the amount of memory over that reported in the White Paper.

For example, if you decided that your assembly task would likely require about 24Gb of memory for the computational phase, based on the White Paper information, then you should be considering a system with 32 to 48Gb of memory.

We cannot make guarantees of course, as so much depends on your particular data set.

We outline our recommended system requirements for running the Genomics Workbench on our website, but for the reasons outlined above, this does not include any specifics about requirements for de novo assembly:

https://www.qiagenbioinformatics.com/system-requirements/

 

Suggestions

If you are running out of memory, then it may be that you do not have enough memory to run the assembly you are trying to run. However, there are some things that can improve the situation with regards to memory use. Some of the suggestions below (especially 2 and 3) can affect the quality of the output and the speed at which the assembly will complete as well.

 

Suggestion I

Are other tasks running on the same machine at the same time? This could be other de novo assemblies, other read mappings, or other jobs not related to your CLC software that require memory to run. The computational phase of the de novo assembly will assume it has access to as much memory as it needs and does not account for situation where other tasks are running at the same time. If multiple things have been running on the system at the same time when the error occurred, please try running a single de novo assembly again, when other things are not running on the system.

 
Suggestion II
 
Have you trimmed off all adapters in your sequence? This can also make a big difference to the resource demands of de novo analysis as well as to the quality of outputs.
 
Adapter trimming is covered in the manual in this section:
 
 
 
Suggestion III
 
Have you trimmed your reads for quality? Entering only high quality data can make a big difference to the resource demands of de novo analysis.
 
Quality trimming is covered in the manual in this section:
 
 
 
 
Information on how to improve your de novo assembly can be found in our best practices guides in the manual page:
 
 

6.2. Why is the progress bar for my assembly stuck at 78%?

For users of the Genomics Workbench, there are three phases when running a de novo assembly: a pre-processing phase, a computational phase and a post-processing phase. The pre- and post-processing phases are part of a java process. The computational phase is run via a native binary.

During the computational phase, there are several stages. The final stage is that of optimizing the de Bruijn graph. Up to that stage, the computational phase is able to run on multiple cores. This final, optimization stage however, currently runs only on 1 core.

For an analysis where Simple Contig output has been chosen, you will likely see this final stage start when the Workbench progress bar is at approximately 78%. The length of this optimization phase depends greatly on the quality of the data and the complexity of the genome being assembled. This phase can take a long time, especially in comparison to the amount of time that was taken up to that point.

The progress bar is, unfortunately, not very sensitive to the progression of this particular phase of the analysis. The lack of change to the progress bar (e.g. seemingly hanging at, say 78% or 80%) is normal and does not mean the analysis has stopped. (The percentage progress reported in the Workbench should be interpreted more a statement about what stage of the analysis you are on than an estimate of how much time it will take before the analysis is complete.)

Our advice would be to leave the analysis running. Assuming no other problems later on, the analysis should finish normally.

Our developers are aware of the fact that the graph optimization phase would benefit from being able to be run on multiple cores and this is something they plan to work on in future. Associated with this, we hope the progress bar will also be worked on such that it provides better feedback on the progression of the analysis.

6.3. Why do I get contigs shorter than the minimum contig length I specified?

If when you set up your de novo assembly in the Genomics Workbench, you either chose Simple Contig output, or you chose to map your reads back to the contigs and unchecked the box labelled "Update contigs",  then you should not see contigs shorter than the minimum contig length you requested.

If however, when you set up your de novo assembly in the Genomics Workbench, you chose to map your reads back to the contigs and also checked the box labelled "Update contigs", then you may have contigs shorter than the minimum length you specified returned to you.

This is because in this case, after the assembly of contigs is done, all contigs that meet the length restriction you set are kept, and these are passed to the read mapping tool. Then all the reads are mapped back to those contigs. The option to update the contigs means that:

  • any contigs with no reads mapping to them are thrown away, and
  • any regions with no evidence in contigs are thrown away

For the latter situation, let's say there was a long contig, and there were many reads mapping to the 5' end, and many reads mapping to the 3' end, but none for a long tract in the middle. That middle bit will get chopped out. Similarly, regions at ends with no coverage by reads would be trimmed away.

The removal of regions that no reads map back to can result in the final list of contigs generated containing members shorter than the minimum length you designated for the output of the assembly itself.

If you choose to continue with the contigs that were returned to you after the mapping, and after the updating of the contigs, you can create a sublist of those meeting any new size restriction you set by using the filtering tools on the table of results, and creating a new data object containing just the results you are interested in.

Information on filtering tables can be found in our manual here:

http://www.clcsupport.com/clcgenomicsworkbench/current/index.php?manual=Filtering_tables.html

 

6.4. What do N charaters represent in the output of my de novo assembly?

When considering submission of your genomic assembly to a public repository such as NCBI, it is important to know what the N characters in the assembly stand for. 

N characters in de novo assembly outputs can represent two things, depending on the de novo assembly parameters that were used. 

 

If the de novo assembly was performed with the option "Perform Scaffolding" turned OFF, then the N characters can represent:

  1. Positions where all the input sequencing reads themselves contained Ns.

 

If the de novo assembly was performed with the option "Perform Scaffolding" turned ON, then the N characters can represent:

  1. Positions where all the input sequencing reads themselves contained Ns.
  2. ​Regions between scaffolded contigs. Here, the number of Ns represents the approximate distance between contigs in the reported scaffolding.

The first option should not occur often, and it can be confirmed by checking whether there is a scaffold annotation associated with tracts of Ns in the assembly output. One way to do this is illustrated in the following figure:

 

For more information regarding how scaffolding can be used to optimize the graph using paired reads, please refer to the following manual section:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Optimization_graph_using_paired_reads.html#sec:scaffolding

 

Direct export to AGP format, suitable for submission to NCBI, is available in CLC Genomics Workbench. It is described in the following manual section:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=AGP_export.html

 

To close the gap with the Ns we recommend using CLC Genome Finishing Module. More information regarding this module can be found on the QIAGEN Bioinformatics webpage:

https://www.qiagenbioinformatics.com/products/clc-genome-finishing-module/

6.5. Should I run my assembly in stages?

We are aware of users who have attempted using de novo assembled contigs as input for a new de novo assembly. Rather than staging your assembly in this way, we would recommend including all high quality data in one big assembly

 

Assembly of all sequencing data is recommended

We would recommend assembling all the data in a single assembly, so that the full graph information can be used, and the reads themselves are the representatives of what is in your sample. Detailed information for why we make this recommendation is in the Background section below.

When using a combination of high quality data along with lower quality, paired data, please consider using the latter for guidance only. This is an option presented to you in the Wizard when you are setting up your de novo analysis. When the Guidance only option is used, only the pairing distance information from those reads will be used for the scaffolding step. That is, the construction of the word table and the graph will not be based on these reads.

Assembling all sequencing data together may require a machine with a substantial amount of memory, depending on the volume of data you have. If you have very limited memory resources we would recommend performing this larger assembly on a more powerful machine. If this is absolutely not possible, then you could try one of the following options:

  1. Choose Simple contig output for your assembly, rather than mapping reads back your reads to your contigs. You may still not have enough memory to run a large assembly, but it may be worth trying this.

  2. If the above idea does not work, then your only choice is to try the idea of making multiple copies of contigs you already have and entering those as reads. Please keep in mind the limitations of such an approach described in the Background.

We recommend referring to our manual to learn more about how de novo assembly in CLC software:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=De_Novo_sequencing.html

You can also refer to this same information in the pdf version of the manual, available for download here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/User_Manual.pdf

 

Background

There are two primary reasons we do not recommend using contig results as input for new assemblies.

1. Input contigs are treated as read equivalents

If you carry out an assembly step by step, it involves taking the contigs of an earlier assembly, and inputting them as "reads" into another assembly, along with sequence reads from another dataset.

The input contig is seen as a single read. If you had a situation where one of the data sets did not have very deep coverage of some regions, it might look like certain reads (in this case, one of your contigs from earlier assemblies), were only seen once, and thus would not be considered strong evidence for a region existing by the assembler. These would be dropped from the assembly. In this case, there is a good chance of losing data that you really did have evidence for in the reads from in an earlier assembly.

You could decide to make multiple copies of your initial contig set, and then use those as input to a de novo assembly, along with other sequence reads to account for the point above. Apart from the fact that you have lost the graph information, duplicating contig sequences implies that you have a lot more confidence in the existence and accuracy of your contig set than perhaps is warranted. Each of the contigs you enter will be seen as a read: the contig sequence is seen as a real measure of your sample. Of course, it isn't - it's a sequence generated from a prior assembly.

 

2. Input contigs do not contain underlying assembly information

When you enter your contigs as reads, they are treated as reads - that is, they are treated as a single "true" measurement, and there is no knowledge of the coverage or graph paths that were traversed to create them. This means that you lose some information that could be important, and might be especially useful with paired data sets.

6.6. How can contig coverage be included in exported fasta headers?

This FAQ page addresses how you can export a fasta file that includes coverage information along with the sequence name. This is a common requirement for downstream applications such as RAST or MT-RAST.

If you have not run the De Novo Assembly tool...

To get coverage information included in the fasta header please take the following steps:

  1. Run the De novo assembly tool with the mapping options:
    • Map reads back to the contigs (slow).
    • Update contigs.

  2. Extract the consensus sequences to a Sequence List
    • Select all rows in the mapping table by clicking a single row, then pressing Ctrl+A or ⌘+A on Mac.
    • Click the Extract Contigs button

     


  3. Export the Sequence List to a fasta format file. The average coverage will then be included in the header.

 

 

If you have already run the De Novo Assembly tool with the Create simple contig sequences (fast) option selected...

If you have chosen the the option Create simple contig sequences (fast) for the De novo assembly, then you can run the mapping using the Map reads to contigs tool to get the Mapping table with the coverage information. In the case please use the options:

  • Update contigs
  • Create stand-alone read mappings

Then Extract the contigs and export as shown in steps 2 and 3 above.

If you have NOT chosen to Update Contigs, then a button labelled Extract Consensus will be present in the Mapping table instead of the Extract Contigs button. In this case the coverage will not be included upon extraction, as then the consensus is extracted, rather than the contig sequence. In this case you will need to re-run the De novo assembly as described above.

 

Background

During the De novo assembly, coverage information regarding the read sequences contributing to each contig is not calculated. This is because contigs are built from the words, or k-mers, used to generate the de bruijn graph, rather than original reads. This means that the reads that map to the contig sequences are not necessarily the original reads contributing to that contig sequence. The coverage information that is produced from the steps above provide the coverage resulting from mapping to the contig sequences.

For more information on how the De novo assembly works please see the manual as follows:

The CLC de novo assembly algorithm

 

 

 

6.7. How can I do a Hybrid Assembly of Long and Short Reads?

It is possible to do a hybrid assembly of long reads, e.g. PacBio or Oxford nanopore, and short reads, e.g. Illumina, in two different ways using QIAGEN CLC software.

  1. Assemble long reads using the De Novo Assemble Long Reads tool, followed by polishing using short reads with the Polish with Reads tool. Both tools come with the free Long Reads Support (beta) plugin.
  2. Assemble the short reads using the De Novo Assembly tool build into the Genomics Workbench. After which contigs can be joined using Join Contigs tool that comes with the commercial Genome Finishing Module.

A small benchmark shown below, shows that option 1 is in general the better approach. However, if option 1 does not give good results on your data we recommend trying option 2 instead.

 

Benchmark comparing options for hybrid assembly using QIAGEN CLC software:

AP = Alignment percentage

ANI = Average Nucleotide Identity 

Note: Join Contigs tool cannot use reads longer than 99,999 base pairs

 

On the images below you can find example workflow for the two options:

 

Option 1: De Novo Assemble Long Reads and Polish with Reads Workflow

 

Option 2: De Novo Assembly Short Reads and Join Contigs using Long Reads

 

7. BLAST

7.1. Can I run a local BLAST search against multiple blast databases simultaneously?

By default, you will be offered single BLAST databases to search via the CLC Workbench BLAST interface.

By creating a simple text file outside the Workbench, and saving it the appropriate location, you can search multiple databases in a single search. To do this:

1) Create a text file and name it with the suffix .nal if your databases were created from nucleotide sequences, or the suffix .pal if your databases were created from peptide sequences.

Attached is a file showing the type of format this file should take. Just replace the names of the databases in that file (which are in quotes), with the names of databases you wish to use. You can add as many databases to that list as you like, just as long as you keep the same type of formatting as in the example file.

2) Put this file into the same folder on your system as your databases are stored in.

Please note that steps 1 and 2 are done outside the Workbench.

Now, when you go to run a search using your Workbench, one of the databases offered to you for blastn, blastx or tblastx should be the same name of the .nal file you created (but without the .nal suffix). If you run a blastp or tblastx search, then one of the databases offered to you should have the same name as the .pal file you created (but without the .pal suffix).

7.2. Why are BLAST searches at the NCBI taking so long?

Some parameter choices can affect how long a BLAST job will take, but in the case of launching searches that will run on the NCBI Servers, the time it takes for BLAST searches to run and complete will greatly depend on how busy the NCBI BLAST Servers are. When submitting BLAST searches to the NCBI BLAST Servers using the CLC Workbenches, you will likely see better execution time during off-peak hours. (Unfortunately, it is hard to predict exactly when these hours are, but generally speaking, outside US office hours is probably better than within them.)

Our tests indicate that the time required for a search submitted from CLC Workbenches to the NCBI to be complete can vary substantially and that these times do not necessarily reflect how long the same job would take if run directly at the NCBI BLAST website. The NCBI controls the the availability of server time and prioritization of BLAST jobs submitted via different routes. These are not things we have control over.

Read more in our manual about remote BLAST searches.

Read about running BLAST searches locally via the Workbench.

If you are running a BLAST search using a single sequence list with many sequences, then one way you could try speeding things up would be to split up your sequence list into a few smaller sequence lists, and then submit each of those lists as a separate BLAST search job. It may be useful to note, when deciding if this is something you wish to do, that the results for each BLAST job you launch are separate. There is currently no way to merge these outputs within the Workbench.

If you are interested in making a few sequences lists from a larger list, you can find information about this in our Frequently Asked Questions (FAQ) area here:

How can I make subsets of a Sequence List?
 

7.3. Why are my local BLAST searches taking so long?

When your BLAST search takes a long time, it is likely that it is still running as expected.
 

What can affect the speed of a local blast search?

How long a BLAST job takes to run depends on many things, including:

  • The size of the query set, i.e. how much sequence data is there, and also the nature of your query set; a few very large sequences can take longer than many smaller sequences.

  • The size of the database you are searching

  • The parameter choices, such as the number of threads you specified, the word size, expect value and number of hits to be reported.  The default number of threads specified in the Workbench is the number of cores on your machine. Please note however, beyond 4 to 6 threads, you may not see much benefit in speed. The specifics of this depend on your exact search, for example searching with many small sequences may scale better than searching with a few large sequences.

  • The type of BLAST search you are running. For example, blastx searches take much longer than blastn searches as the entire query set has to be translated in 6 frames and then a search for each of those frames is executed. The tblastx and tblastn searches will take even longer for similar reasons.

  • How busy the disks where the databases are stored are.
    In our experience running BLAST searches, via CLC software or using NCBI BLAST+ commands directly, if disks where the databases are stored are very busy, a search that takes only a few minutes at quiet times can take up to hours when demand on the disks is very high.

 

Checking whether blast is still running

When you launch a local blast job on a CLC Workbench or CLC Genomics Server, an NCBI BLAST+ program is run on your local system in the background for the actual searching. If you are concerned about whether a blast job you launched from the CLC Workbench is still running, please try checking for the relevant executable among the processes your machine is running.

Checking the running processes on a system can be done using the Task Manager (Windows), Activity Manager (Mac) or checking the process table (Linux). The BLAST executables have the same name as the type of blast search launched. For example, if you ran a blastn search, look for a running process with the name blastn. For a blastx search, look for a running process with the name blastx.

Running multiple blast searches simultaneously

For much of the time during a search, BLAST+ programs do not use all threads available. Thus, for large query sets, the overall search time for a large sequence list can sometimes be decreased by splitting the query set into several smaller sequence lists, and then launching separate blast searches so that several blast searches are running at the same time.

Considerations when running BLAST searches simultaneously 

  • There is no guarantee time will be saved using this route.
  • Each CLC Workbench BLAST job will report the results separately.
  • The number of threads per search job, defined when setting up each search, should take into account the available resources of the machine. (The number may need to be decreased per job when several jobs will be run at the same time.)
  • Disk I/O and memory are required for each blast job being run. For example, each search requires that the database be read into memory, which means memory and disk I/O.  This is a particularly important consideration when working with large databases.

When trying this route, we would generally recommend limiting the number of simultaneous blast jobs to something relatively conservative (e.g. 2 to 4 jobs or so) in the first instance, and testing the impact on performance as you increase from there.

 You can find information about how to split up sequence lists in our Frequently Asked Questions (FAQ) area here:

How can I make subsets of a Sequence List?

7.4. Why does the progress bar report 0% when running a local blast search?

It is not uncommon for a CLC Workbench progress bar under the Processes tab to report 0% progress for most of the time the search is running.

When you launch a local blast job on a CLC Workbench or CLC Genomics Server, an NCBI BLAST+ program is run on your local system in the background for the actual searching. The NCBI BLAST+ programs do not report the details of their progress. Thus, the Workbench cannot reliably update its progress bar until the background blast task completes. For many searches, this will be very shortly before the Workbench job completes and the blast results are returned to you.

When launching a Workbench blast job using a large query set (>100 megabases at time of writing in 2018), subsets of a maximum size of 100 megabases are temporarily created and each of these subsets is searched with sequentially. The results of each of these searches are merged to create a single blast report that is returned to you just before the Workbench task completes. In a case like this, the Workbench progress bar remains at 0% during the the blast search of the first query subset. After it completes, the progress bar is updated and the search with the second query subset commences. The progress bar is next updated when that second blast run completes, and so on.

 

 

7.5. What does the error CPU usage limit exceeded mean when I run a search at the NCBI?

The NCBI offers their web-based BLAST search service publicly, so they need to set priorities and limits. One limit they set is the amount of time any one job is allowed to run. Currently, this is about one hour of joint CPU time, as outlined on the NCBI website here:

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#sigxcpu

If you see the error "CPU usage limit exceeded" when you run a BLAST search at the NCBI, you have exceeded this limit.

 

Limits and priority settings at the NCBI are not something that we control.

7.6. How to extract BLAST hit regions?

To extract the BLAST hit region for a single hit:

  1. Go to the BLAST Graphics. In case of a Multi BLAST, open the BLAST Graphics for the query of interest.
  2. Right click the name of the hit for which you would like to extract the hit region.
  3. Select Open Copy of Hit Region in New View in the menu that appears.

The sequence for the hit region will now open in a new window of the workbench. After which you can save the sequence by dragging it to the navigation area or by going to Toolbar | File | Save as.

 

 

 

To extract the BLAST hit regions for all hits:

  1. Go to the BLAST Graphics. In case of a Multi BLAST, open the BLAST Graphics for the query of interest.
  2. Then go to: Toolbox | Classical Sequence Analysis (Image gene_and_protein_analysis) | General Sequence Analysis (Image generalsequenceanalyses)| Extract Sequences (Image extractsequences)
  3. The open BLAST Graphics will automatically be selected in the Extract Sequences wizard.
  4. Click Next, choose to Extract to new Sequence list and finish the wizard.

You will now get a sequence list including all hit regions for the query.

 

8. RNA-seq

8.1. How can I create a Metadata Table for my RNA-Seq experiment?

How can I create a Metadata Table for my RNA-Seq experiment?

This FAQ gives advice and shows examples how to create a Metadata Table to be used as input for the Differential Expression for RNA-Seq tool. The Metadata Table also affects how results can be visualized using PCA for RNA-Seq, Create Heat Map for RNA-Seq, Create Expression Browser, and Create Venn Diagram for RNA-Seq.

It is possible to create a Metadata Table within the Workbench, but typically it will be imported from an Excel file. Thus, this FAQ will focus on the option generating the Metadata using Excel.

The full manual chapter about Metadata can be found as follows:

Metadata

 

The FAQ includes the following sections:

 

Data association

The first column, also called the Key ID column, in the metadata table is used to identify to which sample the metadata should be added to. It therefore needs to include a unique Sample ID which can be used to recognize the sample.

The Sample ID can be an exact match of the sample name. However, in many cases, the sample name may contain extra information added by sequencing machine e.g. _S1_L001_R1_001 or the Workbench, e.g. (paired). While importing metadata, both exact and partial match options for association are available. For partial matches, the sample name will be divided into parts based on delimiters. Therefore, the first whole part of the name must match the metadata entry exactly for the association to work.

 

 

 

Defining groups to compare

Following the Sample ID, the metadata table should contain at least one more column to describe the groups of samples to compare for detection of differential expression, e.g. Treatment vs Control.

It is possible to added as many columns as you wish depending on the number of factors and confounding factors in the experiment. If you wish to compare groups of samples, where multiple factors vary, then you also need a column showing such groups.

For more information on this please see the related FAQ:

How can I identify differential expression between groups affected by more than factor?

 

Confounding factors

In ideal case scenario, RNA-seq samples being analyzed together should be handled similarly in terms of sample preparation and sequencing such that the only differing factor(s) are those of interest for the study.

However, this may not always be possible in practice and therefore it is possible to control for confounding factor(s) that may affect gene expression.

To control for a confounding factor, the grouping of samples affected by the confounding factor needs to be different than the groups used to test for the main experimental factor. Thus, possible confounding factors need to be considered early during experimental design and recorded for each sample. Furthermore, there need to be at least two samples affected by a confounding factor to control for it during differential expression.

Some examples of confounding factors are listed below.

  • Paired experimental design. Genes expression between different individuals may vary due to their genetic background and environment. Thus, if conducting a study where samples from the same individuals are compared, e.g. before and after treatment or different tissues from the same individual, then it is possible to control for the individual. This may also be referred to as sample pairing.
  • Batch effects during the experiment. Consider an experiment examining the gene expression differences between two plant genotypes growing in a greenhouse. Due to space limitations, all plants cannot stay close to the window where there is more light. Thus, if placing one genotype close to the window and the other away from the window, it would not be possible to identify expression differences due to 'genotype' from that attributed to 'position in the greenhouse'. However, if randomizing/mixing where the plants are placed in the greenhouse regardless of their genotype, it would be possible to check using PCA-plot if the position in the greenhouse affects gene expression and then control for the effect (if there was any) in the 'Differential Expression for RNA-Seq tool'.
  • Batch effect during wet lab procedure. In experiments, where it is not practical to extract mRNA from all samples at the same time (e.g. tumor biopsy from different patients).

All experiments are not necessarily affected by confounding factors. If no confounding factors are recorded or no confounding factors influence the gene expression (seen from PCA plot), then the "While controlling for" box in the Differential Expression for RNA-Seq tool can be left blank.

 

Metadata used for visualization

The provided metadata is not only used to specify the groups to compare in Differential Expression for RNA-Seq tool. The metadata table also affects how results can be visualized using PCA for RNA-Seq, Create Heat Map for RNA-Seq, Create Expression Browser, and Create Venn Diagram for RNA-Seq. For easy interpretation of the figures, especially if they are to be used in a presentation or publication, the naming in the metadata table should be carefully considered, as the words included in the metadata will be shown in the figures. Thus, we recommend using words describing the samples rather than yes/no or numbering (1, 2, and 3) in the metadata.

Example metadata with two uninformative columns and three informative:

 

Expression browser based on each of the metadata columns B and D. The information is the same, but using words describing the samples makes the data easier to interpret:

 

 

Heatmap showing each of the five metadata columns:

 

Order of samples in the Metadata

If wishing to use the comparison options All group pairs or Across groups (Anova-like) the order of the samples (rows) in the metadata can used to control the order of sample comparison. More about this can be found in the related FAQ:

How can I change the order of groups being compared in the Differential Expression for RNA-seq tool?

8.2. How can I identify differential expression between groups affected by more than factor?

The Differential Expression for RNA-Seq tool only allows for selection of one column from the metadata for Test differential expression due to:

 

For comparing groups of samples affected by several factors an additional column describing these groups should be included in the metadata.

An example of this could be a time series experiment with the factors:

  • Treatment (No treatment, Mock Treatment and Treatment)
  • Time points (0 hours, 12 hours and 24 hours)

To compare 'Treatment 12 hours' vs 'Mock treatment 12 hours' a column describing both the treatment and the time should be included in a separate column in the metadata:

For visualization purpose you may also wish to include the information is separate columns.

 

8.3. How can I change the order of groups being compared in the Differential Expression for RNA-seq tool?

Option: All group pairs or Against control group

When running the Differential Expression for RNA-Seq tool with the option All group pairs or Against control group, the order of the comparison is study vs. control.

Hence, the Fold Change value in a statistical comparison track will tell you how expression levels in the study (the group listed before vs.) is relative to that in the control (the group listed after vs.).

  • If expression values in study are twice as large as in control, the fold change will be +2.
  • If expression values in control are twice as large as in study, the fold change will be -2.

 

The easiest way to secure that the desired group is used as control, is to choose the comparison option Against control group.


If using the option All group pairs the order of the comparison depends on the order of the input tracks selected in the tool wizard.

Thus, if selecting samples in the order 1 to 4:


 

You will get comparison against the lower numbers:

 

Whereas, if you select the samples in the wizard in reverse order from 4 to 1:

 

You get the comparison against the higher numbers:

 

Option: Across groups (ANOVA-like)

 

For the comparison option Across groups (ANOVA-like), the output is a single Statistical Comparison track called "Due to..."

 

For this comparison option, the Fold change reports the maximum pairwise fold change between any group pairs.

 

The order of the comparison depends on the order of the input tracks selected in the tool wizard.

For CLC Genomics Workbench version 10.00 to 12.0.x and the Advanced RNA-Seq plugin (installed to CLC Genomics Workbench 9.xx) the order of comparison for Across groups (ANOVA-like) was reverse to the order of the option All group pairs. Hence, if selecting the same samples in the same order in the wizard for the All group pairs and Across groups (ANOVA-like) options the fold changes reported by these two tests had opposite signs.

For CLC Genomics Workbench version 20 this was changed. The All group pairs and Across groups (ANOVA-like) comparisons in Differential Expression for RNA-Seq now compare expressions in the same direction. The direction is as explained for All group pairs.

 

Use metadata to select sample in desired order  

The metadata can be used to control the order of sample selection in the Differential Expression for RNA-Seq tool wizard. 

This can be done using the "Find in Navigation Area" option to have the data pre-selected when starting the wizard or by using the Right-click option to start the tool from the metadata.

Find in Navigation Area option:

 

Launch tool from metadata table option:

 

 

In the example below the following metadata was used to control the order of the samples for analysis with the All group pairs option:

 

 

 

 

 

8.4. Why do I get the error message "Inconsistent input: The provided metadata does not describe the input data." in the Differential Expression for RNA-Seq wizard?

The error message "Inconsistent input: The provided metadata does not describe the input data." indicates a mismatch between the selected metadata and the selected Expression Tracks.

 

 

 

The reasons for this can be:

  • A Metadata Table for a different experiment was selected
  • Expression Tracks (GE) or (TE) for a different experiment (Metadata Table) were selected
  • Data was never associated with the selected Metadata Table created within the Workbench
  • A copy of the Metadata Table was selected
  • The Metadata Table and associated data was moved to a different location
  • Association with the Metadata Table was lost by a persistence issue on the hard drive

The first thing to check is if the correct Metadata Table was selected (Section A). If the correct Metadata Table was selected, then continue to check if the Expression Tracks are associated with the Metadata Table (Section B).

 

Section A: Is the correct Metadata Table selected?

To check if a different Metadata Table was selected, please follow these steps:

  1. Open the window to browse to the Metadata Table in the Experimental design and comparisons step of the Differential Expression for RNA-Seq wizard.
  2. Right-click on the selected Metadata Table and click Show Location.

The selected Metadata Table will now be highlighted in the Navigation Area window of the wizard, revealing if the expected Metadata Table was indeed selected.

 

 

To solve the issue, navigate to the Metadata Table describing the selected input tracks.

By default, the Workbench always name the Metadata Table "Samples" on import. To avoid selecting the wrong Metadata Table you may rename the object with a unique name for easy recognition.

 

Section B: Check if the Expression Tracks are associated with the Metadata Table

To check if the Expression Tracks are associated with the Metadata Table please follow the steps:

  1. Open the Metadata Table.
  2. Select all rows of the table. To do this click a single row, then press Ctrl+A or ⌘+A on Mac to select all rows.
  3. Click the button labelled Find Associated Data.

 

If Expression Tracks (GE) or (TE) are associated with the Metadata Table, then continue as follows:

  1. Filter to the (GE) or (TE) tracks depending on which ones you wish to analyze.
  2. Click the button labelled Find in Navigation Area.
  3. Launch the Differential Expression for RNA-Seq tool wizard again. This time the associated Expression Tracks will be pre-selected in the wizard and the error should not show again. Hence, un-associated data must have been selected in the previous launch.

 

If getting No data found message:

 

 

The no data found message can arise for four reasons:

 

  1. The Metadata Table was created within the Workbench, but data was never associated with it.

To solve the issue, continue to associate data as described in the following manual section: Associating data elements with metadata

 

  1. A copy of the Metadata Table was selected.

If copying a Metadata Table and its associated data within a location, e.g. to move all data into a subfolder, then the copied data (Expression track etc.) will be associated with the original Metadata Table and not with the copy.

To solve the issue please deleted the copied Metadata Table and use the original. If the original Metadata Table has been thrown in the recycle bin it can be restored again. To avoid such issues in the future please move the data instead of copying it.

Relevant manual pages regarding restoring and moving data can be found at:

 

  1. If Metadata Table and associated data has been moved to a different location using older version of the Workbench.

When moving data between locations, the original data is kept. This means that you are essentially doing a copy instead of a move operation. Hence, the data moved to the new file location is associated with the Metadata Table in the original Location.

 

Thus, if moving data to a new Location, the data needs to be re-associated with the Metadata Table. Information on how to associate data with a Metadata Table is described in the following manual section: Associating data elements with metadata

From CLC Genomics Workbench version 20.0 Metadata tables can be moved to a new File Location while maintaining metadata associations.

 

  1. If you are sure data was associated with the Metadata Table and its not a copy, then try to rebuild the index for the location. If re-indexing does not solve the issue, then you will need to associate the data again.

Information on how to rebuild index is described in the following section of the manual for the Genomics Workbench and Server.

 

 

 

 

 

 

 

 

8.5. How can I analyze and visualize strand specific RNA-Seq data?

If your RNA-Seq data is prepared using a strand specific protocol, you can analyze it by selecting either the strand specific option Forward or Reverse in the Mapping options step of the RNA-Seq Analysis wizard. If selecting the option Forward, the forward strand for each gene will be used as reference, whereas if selecting the option Reverse, the reverse strand for each gene will be used. The choice of option depends on whether you have used a forward or reverse strand protocol. The option Both will use both strands as reference and should be applied if you have non-strand specific RNA-Seq data.

The strand specific option only applies to the annotated gene regions. In the inter-genic regions both the Plus and Minus strand of the Reference Genome Track will be used as reference.

The results from the RNA-Seq analysis can be visualized in a Track List/Genome Browser view. Here, the reads are depicted against the Plus strand in the Reference Genome Track. Thus, if you have used the option Forward, the single or paired reads will be visualized map in their forward direction to genes on the Plus strand and in reverse complement direction to genes on the Minus strand. If the option Reverse was selected, this would be opposite.

When working with single reads, you can, by default, see the orientation of the reads in the Reads Track, as reads mapped in the forward direction are colored Green, while reads mapped in their reverse orientation are colored  Red. Likewise, you can by default see the orientation of paired in the Reads Track. Paired reads that mapped in their forward orientation will be shown in blue, while the paired reads that mapped in reverse orientation will be colored light blue.

Example screen shots with single and paired reads prepared with a strand specific protocol for the Forward strand can be found in the sections below.

 

Single reads:

 

Figure 1: Single reads mapped with the strand specific option Forward to a gene on the Plus strand. The reads are depicted in their forward orientation, which is visualized by the color green.

 

 

Figure 2: Single reads mapped with the strand specific option Forward to a gene on the Minus strand. The reads are depicted in their reverse orientation, which is visualized by the color red.

 

Figure 3: Single reads mapped with the strand specific option Forward to two overlapping genes on each strand. As a strand specific protocol has been used it is now possible to know from which gene the reads originate. The green are from the gene on the Plus strand and the red are from the gene on the Minus strand.

 

 

Paired reads:

 

Figure 4: Paired reads mapped with the strand specific option Forward to a gene on the Plus strand. The paired reads are depicted in their forward orientation, which is visualized by the color blue, also when the option Highlight reverse paired reads in the side panel of the Reads Track is checked.

 

Figure 5: Paired reads mapped with the strand specific option Forward to a gene on the Minus strand. The paired reads are depicted in their reverse orientation, which can be visualized by the color light blue, when the option Highlight reverse paired reads in the side panel of the Reads Track is selected.

 

Figure 6: Paired reads mapped with the strand specific option Forward to two overlapping genes on each strand. As a strand specific protocol has been used it is now possible to know from which gene the reads originate. The blue read are from the gene on the Plus strand and the light blue are from the gene on the Minus strand.

 

 

 

Background


If using a strand specific protocol for creating the RNA-seq library, then the strand information will be maintained.

This information can then be used to ensure that the reads map to the correct gene from which the originate. This is important in cases where you have overlapping genes located on different strands or paralogous genes located on different strands of the same chromosome. Without a strand specific protocol, this would not be possible. For more information on this please see Parkhomchuk et al., 2009.

8.6. How to add additional information from the annotation tracks to my RNA-Seq results?

This FAQ include two sections:

-----------------------

How to add additional information from the Annotation Tracks to Expression Tracks (GE) or (TE)

The available feature information that is carried over from the Annotation Tracks (Gene and mRNA/RNA) to the Expression Tracks (GE and TE) depends on the information that is available in the Annotation Track. In the list below is the maximal list of information columns that is carried over:

  • Name
  • Chromosome
  • Region
  • GeneID/Transcript ID
  • Hyperlink to available database, e.g. ENSEMBL, RefSeq, etc.
  • Biotype

In some cases there might be more relevant information in the annotation track, that you wish to add to your RNA-Seq results. This can for example be the information from the "Description" column, which often is available from a GFF3 file or the "Product" information that may be available if the reference is originating from a Genbank file.

Additional information that is available in the Annotation Track, can be added to the Expression Track (GE and/or TE), as well as the Statistical Comparison Track, by using the Annotate with Overlap Information tool found under the following Genomics Workbench toolbox menu:

Track Tools | Annotate and Filter | Annotate with Overlap Information

The tool is described on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Annotate_with_overlap_information.html

After annotating the Expression Track or Statistical Comparison Track with information from the Annotation Track the added columns are available to the right hand side in the table view.

 

How to add additional information from the Annotation Tracks to Expression Browser

The available feature information that is carried over from the annotation track to the Expression Browser include:

  • Name
  • Chromosome
  • Region
  • Identifier

In some cases there might be more relevant information in the annotation track, that you wish to add to your Expression Browser. This can for example be the information from the "Description" column, which often is available from a GFF3 file or the "Product" information that may be available if the reference is originating from a Genbank file. At the moment it is not possible directly to add more information from the annotation tracks to the Expression Browser. However, you can easily create a Generic Annotation File from the Annotation Track, and then add that to the Expression Browser.

To create a Generic Annotation File from the Annotation Track please follow the procedure described below:

  1. Click the Export button in the toolbar.
  2. Type in csv in the search field of the export wizard.
  3. Choose the option Table CSV and click Select (Figure 1).
  4. Select the Annotation Track of interest in the export wizard and click Next.
  5. Make sure to uncheck the Export All columns in the Set parameters step of the wizard and click Next (Figure 2).
  6. Select the Name and description column and any other columns that you would like to add to the experiment (Figure 3). The Name column is needed for adding the description to the right feature, so this should always be included. Click Next.
  7. Choose were to save the file and click Finish.
  8. After export open the .csv file in Notepad++ or other text editing program (Figure 4).
  9. Edit the Name to Feature ID (Figure 5) and the other selected column names to headers listed in the manual as follows: Generic annotation file for expression data format . Description is one of the available headers, so this you do not need to edit.
  10. Save the edits.
  11. Import the .csv file to the CLC Genomics Workbench using the Standard import.
  12. After import the icon of the imported annotation file should look as shown on Figure 6. If the icon does not look like this, then something have when wrong with the formatting or you have not selected a valid header.

 

Figure 1: Choose to export to Table CSV.

 

Figure 2: Uncheck the option "Export all columns".

 

Figure 3: Select the columns of interest. In this example "Name" and "Description".

 

Figure 4: The exported file opened in Note pad++.

 

Figure 5: Edit the headers to headers that are valid for the Generic expression format.

 

Figure 6: The Generic annotation file imports as an annotation table.

 

Finally, you can create an Expression Browser choosing the newly imported Generic Annotation file as an annotation resource. How to create an Expression Browser is described in the manual as follows:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Create_Expression_Browser.html

 

Figure 7: Expression Browser view showing the "Description" as an annotation.

 

 

 

 

 

 

 

8.7. Why do I get p-values of zero from Differential Expression Analysis?

The p-value is the "probability of the observations under the null hypothesis". In case of Differential Expression Analysis the null hypothesis is the assumption of no differential expression.

The less likely the observation is under the assumption that there is no differential expression, the smaller the p-values will become. In theory, it's possible to get a p-value of precisely zero in any statistical test, if the observation is simply impossible under the null hypothesis. In practice, this is extremely rare.

The most likely reason that p-values of zero are observed in the Statistical comparison track is due to so-called "arithmetic underflow". This happens due to very very tiny (positive) numbers that cannot be represented by the computer. This should not be an issue, as the "true" numbers are so small that for practical purposes, they might as well have been 0.0. 

CLC Genomics Workbench uses double precision floating point variables (Double) for the P-value column in the Statistical comparison track. Floating point variables can contain a very large span of values (Double around +/- 1.798E+308 ), but values are stored with limited precision because a fixed number of bits are used.

Double type variables carry 15 decimal digits. This means that there is a limit on the smallest value x > 0 that can be returned from a calculation involving Double, dependent on input values. The end result is that cells in result tables, such as the P-value column, may be automatically rounded off to nearest representable value. 

The "real" P-values in the zero-value cells are actually unknown but bounded: 1.1102230246251565E-16 > P > 0 .
 
Since, it is not possible to plot a p-value of zero on a logarithmic scale, such values are rounded for the volcano plot. The default setting is set to 1E-16.

The image below show an example of a Statistical comparison track and volcano plot in a split view. The p-values represented as 0.000 in the table are rounded to 1E-16 (-log10 1E-16) and highlighted in the volcano plot:

 

8.8. How can I create a heat map for a specific list of genes?

This FAQ addressed how to create a heat map to visualize only a specific list of genes.

The FAQ is divided into three sections:

 

How to create a heat map for a specific list of genes

You can create a heat map for only a specific list of genes using the "Specify features" option in the Create Heat Map for RNA-Seq tool wizard. This functionality is also described on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Clustering_features_samples.html

 

In the wizard you can either enter the list of gene names by typing or coping/pasting your list of gene names into the text box:

Here, any white-space characters, and ",", and ";" are accepted as separators. 

OR

By specifying a filtered feature track only containing the listed genes:

 

The latter option is preferable if wishing to visualize the same set of genes on a routine basis or for saving the gene list in the Navigation Area for later use. Furthermore, the  feature track option can be used if the list of features contain more than 10000 characters. 

For information on how to create the filtered feature track see section 3.

 

Example showing how to create a heat map visualizing differential expressed genes for a specific GO term

In this example 39 differential expressed genes involved in the GO biological process cornification were identified by a Gene Set Test. In this example we will visualize these 39 genes on a heat map.

The steps are:

  • First open the resulting table from the Gene Set Test
  • Select to show DE Genes (Names) in the side panel
  • Right click the cell with the DE Gene names and select the option to Copy Cell

  • Launch Create Heat Map for RNA-Seq tool and follow the wizard to the Set filtering step
  • Select Specify features as the filter setting
  • Paste the DE gene names in the Keep these features text box by clicking Ctrl+V or ⌘+V on Mac

  • Follow the wizard to save the heat map

 

You can now edit the layout of the heat map by setting different side panel settings:

 

How to create a filtered feature track

To create a filtered feature track please follow the steps below:

  • Open the feature track used for the RNA-Seq Analysis in the table view
  • Click the arrow in the top right corner to open the Advanced filter option
  • Set the filter to Name | is in list |
  • Type or copy/paste the list of gene names and click the button labelled "Filter" 
  • Select all rows of the table by clicking a single row, then pressing Ctrl+A or ⌘+A on Mac
  • Click the button labelled "Create Track from Selection"

9. Variant detection and reporting

9.1. Which steps should I follow to perform a resequencing analysis in the CLC Genomics Workbench?

To perform your resequencing study, you can follow the steps below:

1 Import Reference and Sequencing Reads

a) Import or Download Reference Genome

There are several options for importing a reference genome into the CLC Genomics Workbench. These options are listed below:

  • Download Genomes
  • Search for Sequences at NCBI
  • Track import to be used for import of FASTA and GFF3 or GFF2/GTF/GVF file

The easiest way to download the reference genome for selected organisms is via the 'Download Genomes' tool in the Workbench. Another option is to perform a search in the NCBI Entrez database. More information on these tools are provided in the sections below:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Download_Genomes.html

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Search_Sequences_at_NCBI.html

If your reference sequence and annotations are in separate files and the reference sequence is in FASTA file format, you will first need to import the reference sequence using the 'Import Tracks' tool. You can find this tool here: 

Import in the Toolbar | Tracks

To import reference annotations, again you should use the 'Import Tracks' tool. The annotations can be imported in GFF3 or GFF2/GTF/GVF file format. Please make sure to select the reference genome just imported from the FASTA file at the bottom of the wizard. More information can be found at our manual below:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_tracks.html

*Annotating variants with known variants from variant databases is a key concept when you are working with resequencing data. In a later step (step 9), when you will have the identified variant list, you may want to annotate the variants with known variants from variant databases. Any variant track can be used as a known variants track. You can import or download the variant track from variant database resources specific for the organism that you are working with by using the 'Import Tracks' tool. You will also need to have obtained the reference sequence file relevant to the variant track in the Workbench prior to importing it.

b) Import Sequencing Reads

There are dedicated tools for importing high-throughput sequencing data into the CLC Genomics Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_high_throughput_sequencing_data.html

For example, for importing Illumina reads into the Workbench we have the Illumina importer. Please click on the Import button in the top toolbar and choose Illumina. If you have paired reads, you should select "Paired reads" in the General options. For more information on Illumina importer, please see our manual below:

 http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Illumina.html

 

2 Trim Sequence

To remove unwanted or poor quality bases from the reads prior to mapping, you can use our 'Trim Reads' tool. This includes quality trimming, adapter trimming and length trimming. You can access the 'Trim Reads' tool from: 

Toolbox | Preparing Sequencing Data | Trim Reads

For more information please see:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Trim_Reads.html

 

3 Map Reads to Reference

In this step you will map the trimmed reads to the reference sequence. Please run the 'Map Reads to Reference' tool from:

Toolbox | Resequencing Analysis | Map Reads to Reference

Please see our manual and the subsection pages below on Map Reads to Reference.

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Map_Reads_Reference.html

 

4 InDels and Structural Variant detection

The 'InDels and Structural Variant' tool will help you to identify structural variants such as insertions, deletions, inversions, translocations and tandem duplications in read mappings. This tool relies exclusively on information derived from unaligned ends of the reads in the mappings.

The Reads track output from the 'Map Reads to Reference' tool can be used as input for 'InDels and Structural Variant' detection tool, which can be accessed from:

Toolbox | Resequencing Analysis | Variant Detection | InDels and Structural Variants

More information on this tool can be found in our manual (please see the subsection pages):

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=InDels_Structural_Variants.html

 

5 Prepare Guidance Variants track 

The InDel variant track and the Structural Variant track obtained from step 4 can be combined using the 'Prepare Guidance Variants track' tool. The tool is part of the Biomedical Genomics Analysis plugin, which needs to be installed in the Workbench before this tool can be used. Once the plugin is installed, the tool is available from:

Tools | Resequencing Analysis Prepare Guidance Variant Track 

The combined track can then be used as a guidance track to use with the Local realignment tool in the next step. 

More information about this tool is available from the link below:

http://resources.qiagenbioinformatics.com/manuals/biomedicalgenomicsanalysis/current/index.php?manual=Prepare_Guidance_Variant_Track.html

 

6 Local Realignment

Next, you can run the 'Local Realignment' tool to improve the alignments in the existing read mapping using combined guidance track obtained in step 5. You can run this tool from: 

Toolbox | Resequencing Analysis | Local Realignment

Please see our manual on Local Realignment:  
  
http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Local_Realignment.html

Using the combined InDel and Structural Variant track as guidance will locally realign the reads to include InDels up to 200 bp in the mapped part of the reads, whereby such larger InDels can be called by the variant detection tools. This would otherwise not be possible as such large InDels cannot be included in the mapped reads during standard read mapping procedure.

 

7 Create Statistics for Target Regions (optional for targeted amplicon sequencing)

For your targeted amplicon sequencing experiment, you may run the 'QC for Targeted Sequencing' tool which will report the performance (enrichment and specificity) of a targeted re-sequencing experiment. Here, you need to provide an annotation track with the target regions (e.g. imported BED file) and a mapping file (Reads track). It will investigate the read mapping to determine whether the targeted regions have been appropriately covered by sequencing reads as well as information about how specific the reads map to the targeted regions.

The target region BED file can be imported using the 'Import Tracks' option. You can run this tool from:

 Toolbox | Resequencing Analysis| QC for Targeted Sequencing

 More information on 'QC for Targeted Sequencing' tool can be found at:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=QC_Targeted_Sequencing.html

 

8 Identifying Variants/Mutations

a) Variant Detection

You can then go for variant detection with your locally realigned mapped reads. Please see the manual for an overview of the variant detection tools that we have in the Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_Detection_tools.html

You can access the variant detection tools in the Workbench from:

Toolbox | Resequencing Analysis | Variant Detection

The variant track output is discussed here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=_variant_track_output.html

b) Identify Known Mutations from Sample Mappings

If you are simply interested to know if a list of variants are present in your samples or not, you can use the 'Identify Known Mutations from Sample Mappings' tool. 

Two types of input are required to run this tool:

  • A variant track that holds the specific variants that you wish to test for. If you want to search in any external database variants, you need to import the variants (e.g. VCF file) beforehand using the Import Tracks option. Then, it would be saved as variant track in the Workbench navigation area.
  • The read mapping(s) that you wish to check for the presence (or absence) of specific variants. Please use the reads track from the Local Realignment step.  

You can run this tool from:

Toolbox | Resequencing Analysis | Identify Known Mutations from Mappings

For more information on how to run the 'Identify Known Mutations from Mappings' tool, please see our manual below:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Identify_Known_Mutations_from_Sample_Mappings.html

 

9 Predict Functional Consequences

a) Predict Amino Acid Changes

To predict or classify the functional impact of the variant, you can run the 'Amino acid changes' tool. This tool adds the HGVS nomenclature of variants within the coding regions of genes. To identify the functional impact in your identified variant list, please run Amino Acid Changes tool from:

Toolbox | Resequencing Analysis | Functional Consequences | Amino Acid Changes

More information on this tool can be found at our manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Amino_Acid_Changes.html

b) Predict Splice Site Effect

The 'Predict Splice Site Effect' tool analyzes a variant track and determines whether the variants fall within potential splice sites. You can run this tool from

Toolbox | Resequencing Analysis | Functional Consequences | Predict Splice Site Effect

A detailed description on this tool can be found at:  
  
http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Predict_Splice_Site_Effect.html

 

10 Annotate Variants

To annotate variants with information from databases of known variants, you can use 'Annotate from Known Variants', accessible from:

Toolbox | Resequencing | Variant annotation | Annotate from Known Variants

To know more about this tool, please see our manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Annotate_from_Known_Variants.html

 

11 Create Track List for visualization and inspection of the data

Finally, you can create a track list for easy navigation to the detected variants for visualization and inspection of the data.

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Track_lists.html

 

Workflow in CLC Genomics Workbench

To avoid going through the tool wizards for each of these tools you may wish to build a Workflow, which is a pipeline of interconnected tools. Figure 1 below shows an example of a Workflow that can be used as a variant detection pipeline.

Please see our manual and its subsection pages for more information on creating a Workflow.

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Creating_workflow.html

Figure 1: Example of a resequencing Workflow including the most important steps. More elements, inputs and outputs can be added.

If you have more samples, then you can also run the workflow in batch, which will automate the processing of multiple samples by going through the wizard steps only once. You can read about the batch function in the manual as follows:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Batch_processing.html

Starting from CLC Genomics Workbench version 12.0, you can also use various ready-to-use Workflows relevant for your NGS dataset by installing the Biomedical Genomics Analysis plugin. These Workflows were formerly part of the Biomedical Genomics Workbench. 

Detailed information about these Workflows are available in the following section:

http://resources.qiagenbioinformatics.com/manuals/biomedicalgenomicsanalysis/current/index.php?manual=Ready_to_use_workflows_descriptions_guidelines.html

NGS application specific Workflows are described below:

Whole Genome Sequencing (WGS)

Whole Exome Sequencing (WES)

Targeted amplicon sequencing (TAS)

Whole Transcriptome Sequencing (WTS)

 

 

9.2. How to add flanking sequence to variants in the CLC Genomics Workbench?

There are two different ways in the CLC Genomics Workbench to add flanking sequences to variant information. The first option adds a column to the variant table, which shows the variant including the flanking sequence. The other option creates a new sequence list of the variants with flanking sequences.

The first better shows the flanking sequence in context with the variants. If you are going to do further analysis with the sequence information, or need to export it as a sequence list, the second option might be more suitable.  In many cases, it may suit the analysis requirements to run both tools.

 

Adding flanking information to the variant table

Go to the toolbox of the CLC Genomics Workbench:

Resequencing Analysis | Annotate and Filter Variants | Annotate with Flanking Sequence

In the wizard you first choose the variant track, in the next step you choose the reference track and flanking size that you would like to include.

This tool is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Annotate_with_flanking_sequence.html

 

Creating a sequence list with the flanking sequence (including the variant)

Go to the tool box of the CLC Genomics Workbench:

Classical Sequence Analysis | General Sequence Analysis | Extract Annotations

In the wizard you,

  • Select the variant track.
  • Then choose the reference sequence and the types of annotations that you would like to extract, i.e. SNV, MNV, Insertion, Deletion and/or Replacement. In this step you also choose the length of the flanking sequence. Using this option a different number of residues for the flanking sequence upstream and downstream can be selected.
  • Finally you choose how to name the new sequences. Here you need to make sure to include annotation region and chromosome, so that you can identify the variant later.

This tool is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Extract_Annotations.html

9.3. How can I get amino acid prediction information for the variants in my data?

The Workbench offers two different ways of viewing mapping data - the Stand-alone view and Tracks.

If you work with the Stand-alone view, all the relevant information for looking for sites of a potential amino acid change is included within the mapping object itself.  In this case we can predict the amino acid changes based on the CDS annotations contained in the mapping object, and thus this information is then included in the annotated (variant) table.

If you on the other hand work with tracks each type of data has it own track. For example, reference sequences are held in a different track than CDS annotations, which are in their own track. Similarly, gene annotations are in a single track variations are in a single track, and so on.  Since in this case the mapping data and the CDS annotations are held in two different objects, a second step is needed in order to find and output sites of potential amino acid changes.

This additional step is to run the Amino Acid Changes analysis tool, which can be found here:

Toolbox | Resequencing | Functional Consequences | Amino Acid Changes

How to use the tool is described in the manual following the link below:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Amino_acid_changes.html

 

Please note that if you start with a stand-alone mapping object and choose to output to both a track and table output type, then the table output will contain the amino acid change information but the track-based output will not. This is for consistency. If you wish to get a track containing both the variant information and the amino acid change information, please just run the Amino Acid Changes analysis tool, as suggested above.

9.4. How are the homozygous and heterozygous calls determined?

The meaning of the columns in a variant track, including Zygosity, are covered in the manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_tracks.html

After variants are called by one of the variant detection tools, the "Zygosity" column in the variant output table reports if there is only one variant at that position (homozygous), or more than one (heterozygous). Thus, the status of a variant as homozygous or heterozygous in the Zygosity column reflects the parameters and filtering used when the variant calling was run, as these factors affect what will or will not be called as a variant.

In other words, even if there are several different bases in the reads at a position, if only one variant was actually called for that position, the variant zygosity value will be marked as Homozygous in the output variant table.

10. Trimming

10.1. How can I find the location and orientation of my primers or adapters in the reads?

The Trim Reads tool can automatically trim the sequencing adapters from paired reads. If your single reads include adapters, that you wish to trim away, then this can be done using the Trim Reads tool by including a Trim Adapter List. Likewise, primers can be trimmed away using a Trim Adapter List. To create a Trim Adapter List, you need to know the location know the location (5' or 3' end) and orientation (forward or reverse complement) of the adapter/primer. If you do not know the location and orientation of the adapter/primer you can find out using the steps described in this FAQ.

 

Quick approach using Find in the side panel

A simple and quick approach is to use the Find option in the side panel:

  • Open the Read list in the sequence view
  • Open Find tab in the side panel
  • Enter the primer or adapter sequence
  • Click Find

This option is easy to use if the adapter/primer is present at high frequency with perfect matches. By default, the Find option will both search the positive and negative strand. The adapter/primer should be entered in the Adapter Trim List as it is seen in the reads.

 

 

Through approach using motif search

If you wish to do a more thorough search allowing mismatches in the adapter/primer sequence and annotating the presence of the adapter/primer, then you can use the motif search tool.

1) First create a subset of reads (minimum 100 reads, maximum 100.000). Creating a small subset of reads is necessary, as the tools used in the next step were not designed for large datasets.

You can create a subset using the Sample reads tool: 

Toolbox | Utility Tools | Sample reads

In the wizard please select the option to Sample an absolute number and set the Sample size to the number you wish to include. Analysis is faster on a smaller dataset, but if the adapter/primer is only present at a low frequency a larger subset is needed.

The output from Sample reads tool is a new Read List only including X reads from the original Read List. 

 

 

2)  Next you can look for the adapter/primer in your reads by running a Motif search. This can be done either using the Motif search from the Toolbox or Dynamic motif search. The advantages of using the Motif search from the Toolbox is, that it allows you to account for sequencing errors in the adapter/primer and provides you with a table overview of the identified motif (adapter/primer). On the other hand the dynamic motif search allows you to quickly add new motifs (adapters/primers) and to save the view settings and thereby apply the motif (adapter/primer) search to other Read Lists.

Both options are described below:

2a) Motif search from the Toolbox 

  • First create a Motif List with the adapter sequences: 

File | New | Create Motif List

In the wizard you can import a fasta file with all the adapter/primer sequences in the 5' - 3' orientation. If you do not have a fasta format file with your adapters/primers, you can add each adapter sequence manually by clicking the Add button.

  • After creating the Motif List run the Motif Search tool on the Read List with the 100 – 100.000 reads: 

    Toolbox | Classical Sequence Analysis | General Sequence Analysis Motif Search

    Use the following parameter settings:
    • Motif search type: Motif List
    • Choose the newly created Motif List
    • Set the Accuracy (%) to somewhere between 50 and 100% depending on the accuracy you expect of your sequencing data.
    • Select Include Negative Strand
    • Make sure that the option to Add annotations to sequences is selected
  • Finish the wizard. The output of the Motif search from the Toolbox is a motif table and the input Read List updated with motif annotations.
  • View the motif annotations on the Reads. To do this make sure to select to show the Motif annotations in the side-panel. If only a few motifs (adapters/primers) are found, then it can be helpful to view the Reads List in a split view with the motif table. In the split view you can select a row in the motif table and the view of the Read List will then jump to this position. 

    Please see the example screen shot below:

 

 

 

2b) Dynamic motif search

  • A dynamic motif can be added either by clicking the Add Motif button and then pasting in the adapter/primer name it's sequence in the window that opens up or by clicking the Manage Motifs button, after which you can select a Motif List.
  • Select to Include reverse motif and to show the added adapter/primer in the side panel. 

    Please see the example screen shot below:

 

 

After identifying the location and orientation of the adapter/primer in the reads you can create a Trim Adapter List as described in the manual page linked below. From CLC Genomics Workbench 11 and onwards the adapter/primer sequence should be added in the orientation as seen on the reads, no matter if you trim on the 5' or 3' end.

Creating a new Trim adapter list

10.2. How to trim 16S rRNA primers of paired Illumina reads?

This FAQ describes how to trim the 16S rRNA primers from 2 x 301 bp Illumina paired reads prepared using a stranded protocol.

If the reads have been prepared with a stranded protocol sequencing the top strand of the 16S rRNA amplicon, then the forward primer will be present in the 5' end of Read 1 and the reverse primer in the 5' end of Read 2. An example of this using the Bakt_341F 5′-CCTACGGGNGGCWGCAG-3′ and Bakt_805R 5′-GACTACHVGGGTATCTAATCC-3 primers generating an approx. 450 bp long amplicons which covers the variable regions 3–4 is found below:

 

To trim such primers, you can create a Trim adapter list as shown on the image below:

 

In the Trim Adapter List, the read to trim have been specified. Specifying the read to trim allows the option to discard reads in which the expected primer is not found. This is recommended as such reads might be of low quality or contamination of the sample.

Since, the full primer sequence is found in the 5'end of the read it can be recognized for trimming both as an internal and end match. You may wish to optimize the match thresholds for your adapters. How to create a Trim Adapter List and set thresholds for when a match is found is described in the manual pages as follows:

Creating a new Trim adapter list

 

If you are unsure which protocol have been used for generation of your reads, then you may wish to check the location and orientation of the primers in the reads. How to do this is described in the related FAQ.

10.3. How to trim adapters from miRNA data sequenced on Illumina machine?

To trim the small RNA adapter from Illumina microRNA (miRNA) reads please create a Trim Adapter List in the following way:

  1. Go to: File | New | Trim Adapter List.
  2. Click the Add row button.
  3. Type or copy/paste the name of the adapter
  4. Type or copy/paste the sequence of the adapter from Illuminas FAQ page: https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html or custom letter https://support.illumina.com/downloads/illumina-adapter-sequences-document-1000000002694.html 1, depending on the adapter being used.
  5. Choose to trim "All Reads"
  6. Choose the action when an adapter is found "Remove the adapter and following sequence (3' trim)"
  7. For reads without adapters choose "Discard the read"
  8. Click Next
  9. Leave default options for Alignment score costs, but optimize Match thresholds for Internal or End matches according to the sequenced read length and your preferences for the specificity. Two examples are included in Figure 1. Details on the Alignment score costs and Match thresholds can be found on the manual page as follows: Creating a new Trim adapter list
  10. Click Finish.

 

Why should the Trim adapter list be created this way?

For miRNA data you will normally sequence through the miRNA and into attached adapter sequence. If the read include the full or remnants of the adapter, it is an indication that this is indeed a miRNA and not mRNA, rRNA or DNA, which have not been completely removed from the sample. Therefore, this kind of data is trimmed using the trimming action "Discard when not found". This option will remove the adapter when found and discard the reads for which the adapter is not found.

Since the read is sequenced from the 5' end through the miRNA sequence and into the adapter sequence, it is the 3' end which should be trimmed away. 

 

Figure 1: 36 nucleotides (nt) long Read including Small RNA v1.5 3′Adapter (Example read is from SRR038853 downloaded from SRA) and 50 nt long read including TruSeq Small RNA Adapter (Example read is from SRX1818566 download from SRA)

 

If your reads are as in the 36 nt long example. The first nt are the miRNA (21 nt in this example) followed by the adapter (24 nt in this example with Small RNA v1.5 3′Adapter), which then extend beyond the read. Hence, you need to select the option "Allow end trimming". The minimum end score should be according to the specificity what you wish to use for when an adapter is recognized. If you for example set it to 6, as done in our tutorial, you will allow for up to three mismatches or two gaps in cases where the miRNA is 21 nt long.

If your reads are longer, say, 50 bp, then the adapter sequence will be found in the middle of the read. The first nt are the miRNA, which in this example is 21 nt miRNA, followed by the 21 nt adapter (TruSeq Small RNA in this example) + 8 nt included after the adapter sequence. If this is the case you will need to select the option "Allow internal matches". The minimum internal score should be according to the specificity what you wish to use for when an adapter is recognized. If leaving the default, which is 10, then it will allow up to three mismatches or two gaps for the 21 bp adapter.

 

Our tutorial using the 36 bp long reads can be found in our webpage as follows:

https://www.qiagenbioinformatics.com/clc-genomics-workbench-resources/

 

1Oligonucleotide sequences c 2007-2013 Illumina, Inc. All rights reserved. No sponsorship or affiliation. Link provided for convenience. QIAGEN not responsible for content at link.

11. Phylogenetic trees

11.1. How to save the decorations on a phylogenetic tree?

When decorating, a phylogenetic tree using the options in the side panel the decoration information is not saved to the file of the phylogenetic tree, as all the decorations options in the side panel are only view settings. It is certainly possible to save the view settings for a phylogenetic tree. When saving the view setting of a tree you can select whether you want to apply the view settings for only the phylogenetic tree that you are currently working with or if you would like to save the settings for tree views in general.

The advantage of using view settings instead of saving the decorations to the actual phylogenetic tree file are:

  • The option to save and apply the different decorations to different phylogenetic trees
  • The option to save different view settings e.g. cladogram with red diamond shaped nodes, phylogram with bootstrap value etc. and then quickly apply these setting to different trees by only clicking two buttons
  • The option to choose to automatically show you favorite layout using the Use as Standard view settings for tree view option (described below)

 

1. To save and apply your view settings on a specific phylogenetic tree please:

  • Select the view settings that you prefer in the side panel
  • Click on the Save View... button in the lower right corner of the workbench (Figure 1)
  • Enter a name for the view settings in the text box (Figure 2)
  • Click the Save button to save the view settings
  • Click the Close button to close the window
  • Save the phylogenetic tree

After saving the view settings in this way, they will automatically be applied the next time that you open the phylogenetic tree.

 

Figure 1. Click on the Save View... button to save view settings.

 

 

Figure 2. Enter a name for the view settings and press Save.

 

 

 

2. To save your view settings on phylogenetic trees in general please:

  • Select the view settings that you prefer in the side panel
  • Click on the Save View... button in the lower right corner of the workbench (Figure 1)
  • Enter a name for the view settings in the text box
  • Check the Save for all tree views check box (Figure 3)
  • Click Save button to save the settings
  • Click the Close button to close the window

 

Figure 3. Save the view settings for application to other trees.

 

 

 

3. To apply Saved View Settings to a phylogenetic tree please:

  • Open the phylogenetic tree
  • Click on the Save View... icon in the lower right corner of the workbench
  • Select the saved view setting to apply from the drop-down list and click the Apply button (Figure 4)
  • Click Close button to close thewindow

The saved view will now be applied.

 


Figure 4. Select which saved view to apply or to revert to the CLC Standard Settings.

 

 

 

4. Customize the standard view settings for phylogenetic trees

  • Create and save view settings as described in section 2, but don't close the window
  • Select the saved view setting to apply from the drop-down list
  • Check the Use as standard view settings for a tree view check box (figure 5)
  • Click the Apply button
  • Click the Close button to close the window

The view will now be used for all phylogenetic trees.

 

Figure 5. Customize the standard settings using the option Use as standard view settings for a tree view

 

 

 

5. To Remove Tree View Settings please:

  • Open the phylogenetic tree
  • Click on the Save View... button in the lower right corner of the workbench
  • Select the view to removefrom the drop down list in the lower part of the window
  • Click the Remove button (Figure 6)
  • Click the Close button to close the window

 


Figure 6. Select the view settings that you would like to remove again.

 

 

For more information about Saving, applying and removing saved settings please see the manual following the links below:

CLC Main Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcmainworkbench/current/index.php?manual=View_settings_Side_Panel.html

CLC Genomics Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=View_settings_Side_Panel.html

Biomedical Genomics Workbench:

http://resources.qiagenbioinformatics.com/manuals/biomedicalgenomicsworkbench/current/index.php?manual=View_settings_Side_Panel.html

 

 

For more information about Phylogenetic trees please see section following the links below:

CLC Main Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcmainworkbench/current/index.php?manual=Phylogenetic_trees.html

CLC Genomics Workbench:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Phylogenetic_trees.html

 

You might also be interested in our tutorial on visualization of trees, which you can find here:

http://resources.qiagenbioinformatics.com/tutorials/Phylogeny-module-visualization-of-Trees-and-Metadata.pdf

12. Gateway Cloning

12.1. Why are my att sites not recognized by the Gateway Cloning tool?

The Workbench comes with a pre-defined list of Gateway cloning recombination sites. The cloning tool will only recognize sites found in this list. These sites and the recombination logics can be modified by downloading and editing a properties file. Information on how to do this is found in the user manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Technical_information_about_modifying_Gateway_cloning_sites.html#sec:techgatewayinfo

12.2. Why is my destination vector not being accepted by the Gateway Cloning tool?

When running the Gateway Cloning tool 'Create Expression Clone (LR)' as part of the setup you will select a destination vector.

For a destination vector to be recognized as such apart from the appropriate att sites it must contain the toxin ccdB. This must be present as either 1) a ccdB annotation or 2) the ccdB sequence itself.

  1. If the ccdB toxin is represented by an annotation this annotation should be named 'ccdB'. The annotation type is not important, neither is the length of the annotation nor the sequence sitting underneath it.
  2. If the ccdB requirement is met by including the ccdB sequence in the vector the sequence must be identical to the sequence below:

The ccdB sequence:

> ATGCAGTTTAAGGTTTACACCTATAAAAGAGAGAGCCGTTATCGTCTGTTTGTGGATGTACAGAGTGATA
TTATTGACACGCCCGGGCGACGGATGGTGATCCCCCTGGCCAGTGCACGTCTGCTGTCAGATAAAGTCTC
CCGTGAACTTTACCCGGTGGTGCATATCGGGGATGAAAGCTGGCGCATGATGACCACCGATATGGCCAGT
GTGCCGGTCTCCGTTATCGGGGAAGAAGTGGCTGATCTCAGCCACCGCGAAAATGACATCAAAAACGCCA
TTAACCTGATGTTCTGGGGAATATAA