8.1. How can I create a Metadata Table for my RNA-Seq experiment?
How can I create a Metadata Table for my RNA-Seq experiment?
This FAQ gives advice and shows examples how to create a Metadata Table to be used as input for the Differential Expression for RNA-Seq tool. The Metadata Table also affects how results can be visualized using PCA for RNA-Seq, Create Heat Map for RNA-Seq, Create Expression Browser, and Create Venn Diagram for RNA-Seq.
It is possible to create a Metadata Table within the Workbench, but typically it will be imported from an Excel file. Thus, this FAQ will focus on the option generating the Metadata using Excel.
The full manual chapter about Metadata can be found as follows:
The FAQ includes the following sections:
- Data association
- Defining groups to compare
- Confounding factors
- Metadata used for visualization
- Order of samples in the Metadata
The first column, also called the Key ID column, in the metadata table is used to identify to which sample the metadata should be added to. It therefore needs to include a unique Sample ID which can be used to recognize the sample.
The Sample ID can be an exact match of the sample name. However, in many cases, the sample name may contain extra information added by sequencing machine e.g. _S1_L001_R1_001 or the Workbench, e.g. (paired). While importing metadata, both exact and partial match options for association are available. For partial matches, the sample name will be divided into parts based on delimiters. Therefore, the first whole part of the name must match the metadata entry exactly for the association to work.
Following the Sample ID, the metadata table should contain at least one more column to describe the groups of samples to compare for detection of differential expression, e.g. Treatment vs Control.
It is possible to added as many columns as you wish depending on the number of factors and confounding factors in the experiment. If you wish to compare groups of samples, where multiple factors vary, then you also need a column showing such groups.
For more information on this please see the related FAQ:
In ideal case scenario, RNA-seq samples being analyzed together should be handled similarly in terms of sample preparation and sequencing such that the only differing factor(s) are those of interest for the study.
However, this may not always be possible in practice and therefore it is possible to control for confounding factor(s) that may affect gene expression.
To control for a confounding factor, the grouping of samples affected by the confounding factor needs to be different than the groups used to test for the main experimental factor. Thus, possible confounding factors need to be considered early during experimental design and recorded for each sample. Furthermore, there need to be at least two samples affected by a confounding factor to control for it during differential expression.
Some examples of confounding factors are listed below.
- Paired experimental design. Genes expression between different individuals may vary due to their genetic background and environment. Thus, if conducting a study where samples from the same individuals are compared, e.g. before and after treatment or different tissues from the same individual, then it is possible to control for the individual. This may also be referred to as sample pairing.
- Batch effects during the experiment. Consider an experiment examining the gene expression differences between two plant genotypes growing in a greenhouse. Due to space limitations, all plants cannot stay close to the window where there is more light. Thus, if placing one genotype close to the window and the other away from the window, it would not be possible to identify expression differences due to 'genotype' from that attributed to 'position in the greenhouse'. However, if randomizing/mixing where the plants are placed in the greenhouse regardless of their genotype, it would be possible to check using PCA-plot if the position in the greenhouse affects gene expression and then control for the effect (if there was any) in the 'Differential Expression for RNA-Seq tool'.
- Batch effect during wet lab procedure. In experiments, where it is not practical to extract mRNA from all samples at the same time (e.g. tumor biopsy from different patients).
All experiments are not necessarily affected by confounding factors. If no confounding factors are recorded or no confounding factors influence the gene expression (seen from PCA plot), then the "While controlling for" box in the Differential Expression for RNA-Seq tool can be left blank.
The provided metadata is not only used to specify the groups to compare in Differential Expression for RNA-Seq tool. The metadata table also affects how results can be visualized using PCA for RNA-Seq, Create Heat Map for RNA-Seq, Create Expression Browser, and Create Venn Diagram for RNA-Seq. For easy interpretation of the figures, especially if they are to be used in a presentation or publication, the naming in the metadata table should be carefully considered, as the words included in the metadata will be shown in the figures. Thus, we recommend using words describing the samples rather than yes/no or numbering (1, 2, and 3) in the metadata.
Example metadata with two uninformative columns and three informative:
Expression browser based on each of the metadata columns B and D. The information is the same, but using words describing the samples makes the data easier to interpret:
Heatmap showing each of the five metadata columns:
If wishing to use the comparison options All group pairs or Across groups (Anova-like) the order of the samples (rows) in the metadata can used to control the order of sample comparison. More about this can be found in the related FAQ: