HomeCLC FAQ - Workflows, Batching and other Workbench utilitiesPrinter Friendly Version

CLC FAQ - Workflows, Batching and other Workbench utilities

Questions about tools that help in handling analyses in the Workbench, such as batching, batch renaming and Workflows

1. Running analyses in batches

1.1. How can I run a batch job with multiple libraries for each sample?

From CLC Genomics Workbench 20 it is possible to define batch units using metadata when running Workflows in batch. Information on how to do this can be found on the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Running_workflows_in_batch_mode.html#sec:batchworkflows_general_info

 

While it in most cases will be fastest to use metadata it is still possible to define batch units based on folder structure as described below in this FAQ, but it is only recommended if having very few samples or if running individual tools one by one. 

 

If wishing to define batch units based on folder structure in the Navigation Area you need to create one folder for each batch unit plus a top folder for your experiment.

In the example below we have three samples, called A, B, and C. For each sample three libraries have been sequenced, e.g. A-1, A-2, and A-3 (Figure 1).

Figure 1: Folder structure with a top folder for the experiment and one folder for each sample.

 

This, means that all the elements under the folder you choose when you start a batch analysis are considered a batch unit. In the image above, the three folders "Sample A", "Sample B" and "Sample C" are considered as batch units. So, for example, everything within the "Sample A" folder will be used in a given analysis run.

You can, of course, set restrictions on the data to be used as input from the batch folders. This is described in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Batch_overview.html

 

To analyze the samples in batch please follow the steps below:

  • Open the wizard and check the Batch option.
  • You may now select the top folder holding all the sample folders (Figure 2).
  • Follow the wizard as usual to set the parameters.
  • In the Result handling step you have two options when running in batch. These are: Save in input folder and Save in specified location. For the latter option there is an additional option to Create subfolders per batch unit.
  • Select the option which you find most suitable for your procedures.

 

Figure 2: The Batch option has been checked, after which the top folder can be selected.

In the figure below the option Save in input folder was selected and a Reads Track has been produced for each sample. As you may notice the Reads Tracks are named according to the first library, but if you look in the History tab of the Reads Track you see that all three libraries were included for the analyses (Figure 3).

Figure 3: The History tab of the out-put file shows the files included in the analysis.

 

1.2. How to import, arrange and batch analyze data from an Illumina NextSeq machine when having multiple samples in older versions of the Workbench?

This FAQ is only relevant for older versions of the Workbench. From CLC Genomics Workbench 20, data from different lanes can either be merged on import or defined as batch units for analysis using metadata.

For more information please see the following manual pages:

 

Merge lanes on import:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Illumina.html

 

Define batch units based on metadata:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Running_workflows_in_batch_mode.html

 

 

How to import, arrange and batch analyze data from an Illumina NextSeq machine when having multiple samples in older versions of the Workbench?

Illumina NextSeq machines have four physical lanes and produces eight fastq files per sample, i.e. four R1 and four R2 fastq files per sample.

 

To batch analyze samples consisting of multiple sequence list in CLC Genomics Workbench, the sequence lists either have be concatenated or placed in folders representing batch units. This is described in the related FAQs linked at the bottom of this FAQ.

 

In this FAQ we described how you can utilize the benefits of a multiple inputs workflow to automate the folder generation and batch analysis of the samples without concatenating the sequence list using a single workflow. A multiple inputs workflow can sort your sequence lists into folders based on an Excel spreadsheet describing which fastq files belong to which samples. After which each sample is analyzed in batch. 

 

More information on batch launching workflows with multiple inputs can be found at the manual page:

Batch launching workflows with multiple inputs

Furthermore, we suggest utilizing batch renaming function to keep sample names short and informative.

 

A detailed guide is found below. It has been sectioned into the following paragraphs:

  1. Build a workflow with multiple inputs to prepare and analyze the data
  2. Create Excel spreadsheet describing the samples
  3. Import fastq files and batch rename the resulting sequence lists
  4. Run the installed workflow in batch mode
  5. Batch rename output files from the Workflow
  6. Relevant manual pages

 

1. Build a workflow with multiple inputs to prepare the data and analyze the data

The workflow may include Trim Reads and QC for Sequencing Reads with individual inputs, even that the same data will be used as input for the two tools. When running in batch mode this workflow will then automatically sort and arrange samples in folders based on the Excel spreadsheet describing the samples. The workflow may also include analysis steps based on the data application, e.g. de novo assembly, resequencing, etc. In this example we use de novo assembly as the application.

The Workbench will name output objects for which one object is produced from multiple inputs according to the first input object using the default naming option in a workflow. We therefore suggest using a generic name, e.g. Assembly with mapping, and then append the sample name using batch rename after the analysis. If outputting the Trimmed Reads and Unmapped reads we suggest collecting these into a folder as several files are output. In such case the default name will be appropriate to use.

To run the workflow in batch mode the workflow needs to be installed. This is done by clicking the Installation button.

 

2. Create Excel spreadsheet describing the samples

The Excel spreadsheet should include three columns:

  • The first column should include a Unique ID for the fastq files, e.g. ID incl. lane number (L001, L002, L003 and L004).
  • The second column should include a sample name for the grouping, this may be the shared ID from the fastq files or a descriptive name of the sample.
  • The third column should include the type of data. All data should have the same type, e.g. NextSeq reads.

 

 

3. Import fastq files and batch rename the resulting sequence lists

Gather all fastq files for all samples in one folder before import to allow import of all samples in one go. You also need to create a folder for the imported files in the Workbench as this folder needs to be selected for the following batch workflow.

On import the Workbench merge fastq files into one paired sequence list for each lane. When more files are used as input for one output, the resulting object in the Workbench is named based on the first input file. Hence, the sequence lists will include R01_001, even that it includes both the R1 and R2 reads. Furthermore, (paired) is appended to the name to tell that the sequence list includes paired reads. The batch rename function can be used to remove R01_001 from the names by replacing R001_001 with nothing.

Batch renaming of all sequence list are done in the following way:

  • Launch the batch rename tool.
  • Use the right click option to Add folder content to the batch naming and click Next.

  • In this example no objects should be excluded. Click Next.
  • Select the option Rename Elements and click Next.
  • Choose the option Replace part of name and enter: _R1_001, in the From box. Leave the To box empty.

  • Click Finish.
  • Confirm Renaming by clicking Yes.

The sequence lists are now renamed.

4. Run the installed workflow in batch mode

  • Right click the installed workflow and choose the option Run in batch mode.

  • Select the Excel spreadsheet describing the data.
  • Select the Folder with all the imported samples.
  •  Select Partial option for the data association.

 

  • Click Next.
  • Set Group by to Sample ID and Type to Type.
  • Choose the Type, e.g. NextSeq Reads in this example, for both the first and the second input.
  • Click Next.

  • Create or choose a folder where you want to store the workflow results.
  • Click Finish.

The output of the workflow is a subfolder for each batch unit/sample, named according to the Sample ID. Each subfolder contains:

  • Trim Reads Report
  • Graphical QC Report
  • Assembly Report
  • Assembly with mapping
  • Folder with the trimmed reads
  • Folder with the unmapped reads

 

 

5. Use batch rename to rename the results

The batch rename option in the Workbench can be used to add sample ID to the results.

This is done in the following way:

  • Launch Batch rename tool.
  • Select top workflow result folder and use the right click option to Add folder content (recursively) and click Next.

  • Exclude the sequence lists from the batch rename using the text: (paired)

  • Select the option Rename Elements and click Next.
  • Select the option Add text to name. Use the Shift + F1 option to see options. Choose #BR-F# to add the folder name. In this example we wish to add the sample name at the beginning and therefore we included a space after #BR-F#.

  • Click Finish
  • Confirm Renaming by clicking Yes.

The sample name from the subfolders are now appended to the objects.

 

 

The trimmed sequence list will be named according to the inputs, while the unmapped reads will be named according to the first input, but with the original name in brackets [].

 

6. Relevant manual pages

Relevant manual pages related to the information in this FAQ are:

1.3. How to concatenate four sequence lists of NextSeq data in to one sequence list?

This FAQ relates to CLC Genomics Workbench 12.0.x and previous version. For CLC Genomics Workbench 20.00 reads from different lanes can be joined on import. For more information on this please see the manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Illumina.html

 

 

Illumina NextSeq machines have four physical lanes and produces eight fastq files per sample, i.e. four R1 and four R2 fastq files.

 

If you wish to have these concatenated into one sequence list (and not four) for analysis in CLC Genomics Workbench, this can be done after import using the New Sequence list option found under:

File | New | Sequence list

 

If your Sequencing Center uses Illuminas bcl-to-fastq tool. Then an alternative solution is to ask the Sequencing Center to use the "--no-lane-splitting" option, which forces bcl-to-fastq to output only two fastq files, i.e. one R1 and one R2 fastq files.

 

Please notice that it is not necessary to concatenate the fastq files before analysis in CLC Genomics Workbench. For more information about how to analyze without concatenating please see the related FAQ pages found in the bottom of the page.

 

If you have a large amount of sequence lists that you wish to concatenate, then we suggest that you utilize the benefits of a multiple inputs workflow to automate the concatenation based on an Excel spreadsheet describing which fastq files belong to which samples.

More information on batch launching workflows with multiple inputs can be found at the manual page:

Batch launching workflows with multiple inputs

Furthermore, we suggest utilizing batch renaming function to keep sample names short and informative.

 

A detailed guide giving an example of how this can be done is found below. It has been sectioned into the following paragraphs:

  1. Build a workflow with multiple inputs to concatenate the sequence lists and produce a QC report for the reads.
  2. Create Excel spreadsheet describing the samples
  3. Import fastq files and batch rename the resulting sequence lists
  4. Run the installed workflow in batch mode
  5. Batch rename output files from the Workflow
  6. Relevant manual pages

 

1. Build a workflow with multiple inputs to concatenate the sequence lists

Build a multiple input workflow including Sequence list and QC for Sequencing Reads. When running in batch mode this workflow will then automatically concatenate and arrange samples in folders based on the Excel spreadsheet describing the samples.

In this example the Graphical Report output was configured to "Graphical QC Report" to ease the batch renaming at a later stage.

 

 

2. Create Excel spreadsheet describing the samples

The Excel spreadsheet should include three columns:

  • The first column should include a Unique ID for the fastq files, e.g. ID incl. lane number (L001, L002, L003 and L004).
  • The second column should include a sample name for the grouping, this may be the shared ID from the fastq files or a descriptive name of the sample.
  • The third column should include the type of data. All data should have the same type, e.g. NextSeq reads.

 

 

3. Import fastq files and batch rename the resulting sequence lists

Gather all fastq files for all samples in one folder before import to allow import of all samples in one go. You also need to create a folder for the imported files in the Workbench as this folder needs to be selected for the following batch workflow.

On import the Workbench merge fastq files into one paired sequence list for each lane. When more files are used as input for one output, the resulting object in the Workbench is named based on the first input file. Hence, the sequence lists will include R01_001, even that it includes both the R1 and R2 reads. Furthermore, (paired) is appended to the name to tell that the sequence list includes paired reads. The batch rename function can be used to remove R1_001 from the names by replacing R1_001 with nothing.

 

Batch renaming of all sequence list are done in the following way:

  • Launch the batch rename tool.
  • Use the right click option to Add folder content to the batch naming and click Next.

  • In this example no objects should be excluded. Click Next.
  • Select the option Rename Elements and click Next.
  • Choose the option Replace part of name and enter: _R1_001, in the From box. Leave the To box empty.

  • Click Finish.
  • Confirm Renaming by clicking Yes.

The sequence lists are now renamed.

 

 

4. Run the installed workflow in batch mode.

  • Right click the installed workflow and choose the option Run in batch mode.

 

  • Select the Excel spreadsheet describing the data.
  • Select the Folder with all the imported samples.
  • Select Partial option for the data association.
  • Click Next.

  • Set Group by to Sample ID and Type to Type.
  • Choose the Type, e.g. NextSeq Reads in this example, for both the first and the second input.
  • Click Next.

 

  • Create or choose a folder where you want to store the workflow results.
  • Click Finish.

The output of the workflow is a subfolder for each batch unit/sample, named according to the Sample ID. Each subfolder contains:

  • Sequence List
  • Graphical Report

 

 

5. Batch rename output files from the Workflow

The batch rename option in the workbench can be used to replace New Sequence List with the sample ID and to append the sample ID to the QC report.

 

To replace New Sequence List with the Sample ID:

  • Launch Batch rename tool.
  • Select top workflow result folder and use the right click option to Add folder content (recursively) and click Next.

  • Filter on using the search term New to only include the New Sequence Lists.
  • Select to Rename Elements
  • Choose the option Replace full name and use Shift + F1 to see the options. Select #BR-F# to replace with the parent folder name.

  • Click Finish
  • Confirm Renaming by clicking Yes.

 

Repeat to append the Sample ID to the Graphical Report:

  • This time choose the option Add text to name. Choose #BR-F# to add the folder name. In this example we wish to add the sample name at the beginning and therefore we included a space after #BR-F#.

The sequence lists are now concatenated, renamed and ready for your subsequent analysis.

 

 

6. Relevant manual pages

Relevant manual pages related to the information in this FAQ are:

 

1.4. How can I keep the input sample name in the extracted consensus sequences from mappings?

How can I keep the input sample name in the extracted consensus sequences from mappings?

The name of an extracted consensus sequence will by default be the reference name followed by "consensus". When extracting and exporting consensus sequences (to e.g. fasta format) from multiple samples mapping to the same reference all the consensus sequences will therefore be named the same.

The image below illustrates the purpose of this FAQ, which is to replace the default name with the sample name.

 

 

 

The instructions below show steps to replace the reference sequence names on the consensus sequences with the input sample names.

 

Step 1: To name your extracted consensus sequence from a read mapping according to your input samples, you can include Extract Consensus Sequence in a Workflow. The Workflow will need to include a mapping step followed by Extract Consensus Sequence. To name the sequence list with the consensus sequence according to the input sample you can use the placeholder {input} or {2} when configuring the final output name.  When using {input} or {2} the output will be named as the input.

More details about output names in Workflows can be found at the following manual page: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Output.html

 

 

  • To run the Workflow with multiple samples, select the Batch option
  • The samples can be selected either by selecting the folder with the samples or the individual elements.

 

 

Step 2: Use Batch rename tool to rename the consensus sequences within the sequence lists. To rename the consensus sequence in multiple sequence lists at the same time you can follow this approach:

  • Launch Batch rename tool.
  • Use right-click option to Add folder content or Add folder content (recursively).

  • Choose the option Rename sequences in sequence lists.

  • To replace the full name of the consensus sequence, select the option Replace full name. If you rather wish to add the sample name, you can use the option Add text to name.
  • Use the Shift + F1 option to see options. Choose #BR-PE# to replace the name of the consensus sequence with that of the parent element. In this case the parent element is the sequence list that includes the consensus sequence.

  • The consensus sequences within the sequence list are now renamed

 

The batch rename tool is described in detail at the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Batch_Rename.html

 

Step 3: Export to FASTA. Once the consensus sequences have been renamed, you can export them in fasta format. The sample name will now be retained in the exported fasta header.

The consensus sequences can either be exported to multiple fasta files or to one single fasta file. If you wish to export to a single fasta file, then select the option Output a single file. If outputting a single file, you may wish to add a custom name. If leaving the default, the fasta file will be named according to the first input.

 

 

Export from the Workbench is described in the manual page as follows:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Data_export.html

 

NB: If using a Workbench version before CLC Genomics Workbench 12, then the batch rename function is installed as a plugin.  The Plugins manager is launched by clicking on the Plugins button in the top toolbar. Please note that to install Plugins, you need to be running your Workbench as an administrative user.

1.5. How can I trim and assemble my forward and reverse Sanger sequence for each sample in batch?

1. Trim in batch

To trim several Sanger sequences at the same time, select all sequences as input for the Trim Sequences tool. This is easiest done using the right-click option "Add folder contents" in the wizard.

The Trim Sequences tool will add annotations to the input sequences that signifies the trimmed regions.

 

2. Assemble in batch

To assemble the sequences, build a Workflow containing only the Assemble Sequences element.

 

Create a metadata table in Excel that defines the two Sanger sequences to assemble.

In it's simplest form the metadata only need to include the sequence name (or a unique prefix) and sample name, with two sequences being from the same sample.

 

However, it is also possible to add additional metadata information that you wish to store for the sequences.

 

 

To run the Assemble Sequences Workflow, follow these steps:

  • Launch the Workflow
  • Click the "Batch "option and select the folder with the sequences to assemble

 

  • In the next step select the option "Use metadata"
  • Navigate to the Excel sheet with the metadata by clicking on the folder to the right
  • Select the column to define the batch units on. In this case it is "Sample", but it can be named anything you like

  • In the next step you see an overview of the batch units

In the "Result handling" step you can choose to "Create subfolders per batch unit"

  • If selecting this option subfolders named based on the batch identifier (sample names in this example) will be created
  • The actual contig will be named based on the configuration of the workflow output name. In the example below default naming is used

In the "Result handling" step you can also choose to create a "Workflow Result Metadata". This can be used to navigate to a specific contig and save information about the sample. Additional information can be added to the "Workflow Result Metadata" after it is saved in the "Navigation Area".

 

Relevant manual pages for more details are:

 

Known limitations

  • It is currently not possible to use the Trim Sequences tool in a Workflow. This will be included for a future release of the Workbench.
  • It is currently only possible to name subfolders based on the batch identifier in the metadata. However, the output contigs can be renamed based on the folder name using the Batch Rename tool.

1.6. How can I trim and assemble my forward and reverse Sanger sequence for each sample in batch using QIAGEN CLC Genomics Workbench?

The Trim Sequences tool meant for trimming of Sanger data can currently not be utilized in a Workflow, but it will be workflow enabled for the next major release.

If using QIAGEN CLC Genomics Workbench, a work-around to create a "Trim and Assemble Batch Workflow" in the current version (20.x) is to replace the Trim Sequences tool with the Trim Reads tool meant for NGS data.

The major differences between the Trim Reads and Trim Sequences tool, is that the Trim Reads tool does not allow for automatic trimming of vector sequence using UniVec database or trimming based saved sequences, instead it uses a Trim adapter list specifying the sequences to trim. Furthermore, the Trim Reads tool output a new sequence list for which the trimmed parts are removed, rather than adding a trim annotation to the original input sequence. The quality trimming works in the same way for the two tools.

To trim and assemble the sequences, build a Workflow containing the Trim Reads tool followed by the Assemble Sequences tool.

If primer sequences should be trimmed, configure the Trim Reads tool with a Trim adapter list.

In this example the Workflow was configured with a Trim adapter list to remove the gene specific PCR primers from the 5' end and their reverse complement from the 3' end.

 

For information about how to create a Trim adapter list please see the following manual page:

Creating a new Trim adapter list

 

For information about how to run the Workflow in batch please see the related FAQ:

How can I trim and assemble my forward and reverse Sanger sequence for each sample in batch?

2. Workflows

2.1. How can I update a workflow?

When new versions of the Workbenches are released, some of the tools that are part of workflows installed by the user or distributed with plugins may change due to the addition of new parameters, improvements or bug fixes. When this happens, the installed workflow may no longer be valid and must be updated to be able to run the workflow on the newest version of the Workbench.

When you try to run a workflow that needs to be updated, the error message "your workflow and workflow elements must be updated to run on the newest version." will be shown in the workflow wizard.

How to update workflows is described in the user manuals as follows:

CLC Genomic Workbench: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Updating_workflows.html

CLC Main Workbench: http://resources.qiagenbioinformatics.com/manuals/clcmainworkbench/current/index.php?manual=Updating_workflows.html