Getting started

This tutorial is for bioinformaticians. It will help you install the ReSDK and explain some basic commands. We will connect to an instance of Genialis server, do some basic queries, and align raw reads to a genome.

Installation

Installing is easy, just make sure you have Python and pip installed on your computer. Run this command in the terminal (CMD on Windows):

pip install resdk

Registration

The examples presented here require access to a public Genialis Server that is configured for the examples in this tutorial. Some parts of the documentation will work for registered users only. Please request a Demo on Genialis Server before you continue, and remember your username and password.

Connect to Genialis Server

Start the Python interpreter by typing python into the command line. You’ll recognize the interpreter by ‘>>>’. Now we can connect to the Genialis Server:

import resdk

# Create a Resolwe object to interact with the server and login
res = resdk.Resolwe(url='https://app.genialis.com')
res.login()

# Enable verbose logging to standard output
resdk.start_logging()

Note

If you omit the login() line you will be logged as anonymous user. Note that anonymous users do not have access to the ful set of features.

Note

When connecting to the server through an interactive session, we suggest you use the resdk.start_logging() command. This allows you to see important messages (e.g. warnings and errors) when executing commands.

Note

To avoid copy-pasting of the commands, you can download all the code used in this section.

Query data

Before we start querying data on the server we should become familiar with what a data object is. Everything that is uploaded or created (via processes) on a server is a data object. The data object contains a complete record of the processing that has occurred. It stores the inputs (files, arguments, parameters…), the process (the algorithm) and the outputs (files, images, numbers…). Let’s count all data objects on the server that we can access:

res.data.count()

This is all of the data on the server you have permissions for. As a new user you can only see a small subset of all data objects. We can see the data objects are referenced by id, slug, and name.

Note

id is the auto-generated unique identifier of an object. IDs are integers.

slug is the unique name of an object. The slug is automatically created from the name but can also be edited by the user. Only lowercase letters, numbers and dashes are allowed (will not accept white space or uppercase letters).

name is an arbitrary, non unique name of an object.

Let’s say we now want to find some genome indices. We don’t always know the id, slug, or name by heart, but we can use filters to find them. We will first count all genome index data objects:

res.data.filter(type='data:index').count()

This is quite a lot of objects! We can filter even further:

res.data.filter(type="data:index:star", name__contains="Homo sapiens")

Note

For a complete list of filtering options use a “wrong” filtering argument and you will receive an informative message with all options listed. For example:

res.data.filter(foo="bar")

For future work we want to get the genome with a specific slug. We will get it and store a reference to it for later:

# Get data object by slug
genome_index = res.data.get('resdk-example-genome-index')

We have now seen how to use filters to find and get what we want. Let’s query and get a paired-end FASTQ data object:

# All paired-end fastq objects
res.data.filter(type='data:reads:fastq:paired')

# Get specific object by slug
reads = res.data.get('resdk-example-reads')

We now have genome and reads data objects. We can learn about an object by calling certain object attributes. We can find out who created the object:

reads.contributor

and inspect the list of files it contains:

reads.files()

These and many other data object attributes/methods are described here.

Run alignment

A common analysis in bioinformatics is to align sequencing reads to a reference genome. This is done by running a certain process. A process uses an algorithm or a sequence of algorithms to turn given inputs into outputs. Here we will only test the STAR alignment process, but many more processes are available (see the Process catalog). This process automatically creates a BAM alignment file and BAI index, along with some other files.

Let’s run STAR on our reads, using our genome:

bam = res.run(
    slug='alignment-star',
    input={
        'reads': reads.id,
        'genome': genome_index.id,
    },
)

This might seem like a complicated statement, but note that we only run a process with specific slug and required inputs. The processing may take some time. Note that we have stored the reference to the alignment object in a bam variable. We can check the status of the process to determine if the processing has finished:

bam.status

Status OK indicates that processing has finished successfully. If the status is not OK yet, run the bam.update() and bam.status commands again in few minutes. We can inspect our newly created data object:

# Get the latest info about the object from the server
bam.update()
bam.status

As with any other data object, it has its own id, slug, and name. We can explore the process inputs and outputs:

# Process inputs
bam.input

# Process outputs
bam.output

Download the outputs to your local disk:

bam.download()

We have come to the end of Getting started. You now know some basic ReSDK concepts and commands. Yet, we have only scratched the surface. By continuing with the Tutorials, you will become familiar with more advanced features, and will soon be able to perform powerful analyses on your data.