HomeCLC FAQ - Analyses-related questionsSequences and sequence listsHow can I make subsets of a Sequence List?

2.2. How can I make subsets of a Sequence List?

One of the most commonly used data objects within the Workbench is the Sequence List. Detailed information about Sequence Lists can be found at the following manual page: 

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sequence_Lists.html

Subsets of Sequences Lists can be created different ways.


Create a subset of a Sequence List via the table view

  • Open the Sequence List in table view by clicking on the small icon of a table at the bottom of the view.
  • Select the rows representing the sequences you wish to include in the new Sequence List.
  • Click on the button at the bottom of the view labelled Create New Sequence List.

This will create a new Sequence List consisting only of those filtered/desired sequences. You will need to explicitly save this new Sequence List. 

 

How to do this is explained in our manual here:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sequence_Lists.html

 

Create a subset of sequences by filtering for the desired sequences based on an attribute

Example 1: Using basic filtering based on text within an attribute shown in the Sequence List table view

To create a subset of sequences based on common text found within  a sequence description or any other attribute found within the Sequence List:

  • Open the Sequence List in table view (see image in section Manually Creating a Subset of  Sequence List above).
  • Type the text you wish to use to select the subset of sequences into the filtering text box. 
  • When only the sequences of interest are in view select all the rows of the table. To select all rows click any row in the table, then press Ctrl+A or ⌘+A on Mac.
  • Click the Create New Sequence List button found below the table.

This will create a new Sequence List consisting only of those filtered/desired sequences. You will need to explicitly save this new Sequence List.

 

Example 2: Using advanced filtering  based on text within an attribute shown in the Sequence List table view

Advanced filtering also allows you to extract a sublist of sequences based on a list of sequence names, or other attribute. This list should contain text unique to the sequences of interest separated by new lines, commas, or semicolons.

      • Open the Sequence List in table view by clicking on the small icon of a table at the bottom of the view.
      • Expand the Filtering view at the top of the table for Advanced Filtering.
      • Select the column name from the drop down menu that represents the text in your list, such as "Name".
      • In the drop down menu to the right of the column, choose the option "is in list".
      • Copy and paste the list into the filter text box and click the Filter button.
      • Select all rows in view by clicking a single row, then pressing Ctrl+A or ⌘+A on Mac.
      • Click on the button at the bottom of the view labelled Create New Sequence List.

This will create a new Sequence List consisting only of those desired sequences. You will need to explicitly save this new Sequence List.

 

More information about filtering tables can be found in our manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Filtering_tables.html

Create a subset of sequences based on length using the Trim Sequences tool


In the CLC Genomics Workbench, the Trim Reads tool can be used to create a subset of sequences that fall within a specified length range.

This tool is lanched by going to:

Toolbox | NGS Core Tools | Trim Reads

The steps involved here are:

      • Specify the sequence list(s) you wish to use.
      • Uncheck all the Quality Trimming options when you reach that stage of the Wizard.
      • Ensure no Trim Adapter List is selected when you read the Adapter trimming section of the Wizard.

        If one is listed at this stage, please click the arrow button in the bottom left of the wizard dialogue box to go back to the default settings for this, thereby deselecting the Adapter trim list.

      • The Sequence Filtering stage (image below), is the key stage for this process. Here:
        • Uncheck options in the Trim Bases section.
        • Check one or both of the Filter on Length options.
        • Enter the minimum and/or maximum length sequences you wish to include in your new subset of reads.

This is shown in the image below, where a minimum threshold of 75bp and a maximum threshold of 125 bp are specified.

 

After running the Trim Reads tool in this way, a new sequence list will be created. It will have the same name as the Sequence List you provided  with the text "trimmed" appended to it. The sequences within this new Sequence List are those from the original list that had a length within the size range you specified.

If you are working with paired reads, make sure to check the Save broken pairs option in the 5th step of the wizard. Selecting this option will generate a second Sequence List consisting of what are now single reads that fall within the specified length range but the other read in the pair did not fall within that length range.

  

Create subsets of sequences based on sequence name using the Sort Sequences by Name tool

The Sort Sequences by Name tool creates multiple Sequence List subsets based on sequence names. The sorted sequences lists are non-redundant, and combined, they include the entirety of the original set of sequences.  This differs from other tools described here, where only a single subset is created.

The Sort Sequences by Name is launched in the CLC Main Workbench by going to:

Toolbox | Sequencing Data Analysis | Sort Sequences by Name

 

In the CLC Genomics Workbench, it is launched by going to:

Toolbox | Molecular Biology Tools | Sequencing Data Analysis | Sort Sequences by Name

 

In the Biomedical Genomics Workbench, it is launched by going to:

Toolbox | Sanger Sequencing| Sequencing Data Analysis | Sort Sequences by Name

 

A description of how to make use of this tool can be found in the following manual page:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sort_sequences_name.html

 

 

Extract a random subsample of sequences

Generate a random subset of sequences via the CLC Genomics Workbench

A tool that can generate a random subset of sequences is the Sample Reads tool. Instructions for making use of this tool are found in the manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Sample_reads.html

This Sample Reads tool first became available in Genomics Workbench 7.5. Old version of the Workbench can make use of the Sample Reads tool included within the CLC Genome Finishing Module. This is a commerical module, which can be downloaded and installed in the Workbench as a plugin. You can read more about this particular tool in the Genome Finishing Module manual:

http://resources.qiagenbioinformatics.com/manuals/clcgenomefinishing/current/User_Manual.pdf

You may download and install this plugin through the plugin manager of the Workbench. 

Generate a random subset of sequences with the CLC Assembly Cell

A random sample of sequencing reads can be selected from a larger set using the clc_sample_reads tool, distributed with the CLC Assembly Cell product. The Assembly Cell  is a collection of binary executables run via the command line.

Information about the clc_sample_reads tool can be found in our online manual here:

http://resources.qiagenbioinformatics.com/manuals/clcassemblycell/current/index.php?manual=Options_clc_sample_reads.html

General information about the CLC Assembly Cell can be found here:

https://www.qiagenbioinformatics.com/products/clc-assembly-cell/

Knowledge Tags

This page was: Helpful | Not Helpful