ReSDK Tables

ReSDK tables are helper classes for aggregating collection data in tabular format. Currently, we have four flavours:

RNATables

Imagine you are modelling gene expression data from a given collection. Ideally, you would want all expression values organized in a table where rows represents samples and columns represent genes. Class RNATables gives you just that (and more).

We will present the functionality of RNATables through an example. We will:

  • Create an instance of RNATables and examine it’s attributes

  • Fetch raw expressions and select TIS signature genes with sufficient coverage

  • Normalize expression values (log-transform) and visualize samples in a simple PCA plot

First, connect to a Resolwe server, pick a collection and create and instance of RNATables:

import resdk
from resdk.tables import RNATables
res = resdk.Resolwe(url='https://app.genialis.com/')
res.login()
collection = res.collection.get("sum149-fresh-for-rename")
sum149 = RNATables(collection)

Object sum149 is an instance of RNATables and has many attributes. For a complete list see the SDK Reference, here we list the most commonly used ones:

# Expressions raw counts
sum149.rc

# Expressions normalized counts
sum149.exp
# See normalization method
sum149.exp.attrs["exp_type"]

# Sample metadata
sum149.meta

# Sample QC metrics
sum149.qc

# Dictionary that maps gene ID's into gene symbols
sum149.id_to_symbol
# This is handy to rename column names (gene ID's) to gene symbols
sum149.rc.rename(columns=sum149.id_to_symbol)

Note

Expressions and metadata are cached in memory as well as on disk. At each time they are re-requested a check is made that local and server side of data is synced. If so, cached data is used. Otherwise, new data will be pulled from server.

In our example we will only work with a set of TIS signature genes:

TIS_GENES = ["CD3D", "IDO1", "CIITA", "CD3E", "CCL5", "GZMK", "CD2", "HLA-DRA", "CXCL13", "IL2RG", "NKG7", "HLA-E", "CXCR6", "LAG3", "TAGAP", "CXCL10", "STAT1", "GZMB"]

We will identify low expressed genes and only keep the ones with average raw expression above 20:

tis_rc = sum149.rc.rename(columns=sum149.id_to_symbol)[TIS_GENES]
mean = tis_rc.mean(axis=0)
high_expressed_genes = mean.loc[mean > 20].index

Now, lets select TPM normalized expressions and keep only highly expressed tis genes. We also transform to log2(TPM + 1):

import numpy as np
tis_tpm = sum149.exp.rename(columns=sum149.id_to_symbol)[high_expressed_genes]
tis_tpm_log = np.log(tis_tpm + 1)

Finally, we perform PCA and visualize the results:

from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True)
Y = pca.fit_transform(tis_tpm_log)

import matplotlib.pyplot as plt
for ((x, y), sample_name) in zip(Y, tis_tpm.index):
    plt.plot(x, y, 'bo')
    plt.text(x, y, sample_name)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]})")
plt.show()

MethylationTables

Similar as RNATables provide access to raw counts and normalized expression values of RNA data, MethylationTables allow for fast access of beta and m-values of methylation data:

meth = resdk.tables.MethylationTables(<collection-with-methylation-data>)

# Methylation beta-values
meth.beta

# Methylation m-values
meth.mval

MATables

Similar as RNATables provide access to raw counts and normalized expression values of RNA data, MATables allow for fast access of expression values per probe of microarray:

ma = resdk.tables.MATables(<collection-with-microarray-data>)

# Microarray expressions values (columns are probe ID's)
ma.exp

VariantTables

Similar as RNATables provide access to raw counts and normalized expression values of RNA data, VariantTables allow for fast access of variant data present in Data of type data:mutationstable:

vt = resdk.tables.VariantTables(<collection-with-variant-data>)
vt.variants

The output of the above would look something like this:

sample_id

chr1_123_C>T_Gly11Asp

chr1_126_T>C_Asp12Gly

101

2

0

102

0

1

In rows, there are sample ID’s. In columns there are variants where each variant is given as: <chromosome>_<position>_<nucleotide-change>_<amino-acid-change>. Values in table can be 0 (no mutation), 1 (heterozygous mutation) or 2 (homozygous mutation).

The above example gives an ideal situation where the mutation status for each position is known. However, this is not always the case.

Missing values and discard_fakes argument

Very often, there is no info about a certain variant / sample, so values can also be NaN (unknown). Other common case is just the info that there is no mutation on a given position. This is a valid information also. Given the above, a more realistic example of output is:

sample_id

chr1_123_C>T_Gly11Asp

chr1_126_T>C_Asp12Gly

chr1_127

101

2

NaN

0

102

0

1

NaN

One can se that for some combination of variants / samples there is no information: a value in table is NaN. It is up to a user if this is interpreted as no variant or something else. In the first case, one can quickly convert NaN to 0 with vt.variants.fillna(0). One can also see that there is a column (chr1_127) that is not actually representing a variant. One may call this a “fake” variant. It is a way of signalling the absence of variant on a given position. Usually this is not useful, but is some cases it is. If you would like your output to contain such fake variants please specify discard_fakes=False in VariantTables constructor.

Inspecting depth

The reason for NaN values may be that the read depth on certain position is too low for GATK to reliably call a variant. In such case, it is worth inspecting the depth or depth per base:

# Similar as above but one gets depth on particular variant / sample
vt.depth
# One can also get depth for specific base
vt.depth_a
vt.depth_c
vt.depth_t
vt.depth_g

Filtering mutations

Process mutations-table accepts an input mutations which specifies the gene (and optionally amino acid change) of interest. It restricts the scope of mutation to just a given gene or amino acid.

However, it can happen that not all the samples have the same mutations input. In such cases, it makes little sense to merge the information about mutations from multiple samples. By default, VariantTables checks that all Data is computed with same mutations input. If this is not true, it will raise an error.

But if you provide additional argument mutations it will limit the mutations to only those in the given gene. An example:

# Sample 101 has mutations input "FHIT, BRCA2"
# Sample 102 has mutations input "BRCA2"

# This would cause error, since the mutations inputs are not the same
vt = resdk.tables.VariantTables(<collection>)
vt.variants

# This would limit the variants to just the ones in BRCA2 gene.
vt = resdk.tables.VariantTables(<collection> mutations=["BRCA2"])
vt.variants