Create, modify and organize data

To begin, we need some sample data to work with. You may use your own reads (.fastq) files, or download an example set we have provided:

import resdk

res = resdk.Resolwe(url='https://app.genialis.com')
res.login()

# Get example reads
example = res.data.get('resdk-example-reads')
# Download them to current working directory
example.download(
    field_name='fastq',
    download_dir='./',
)

Note

To avoid copy-pasting of the commands, you can download all the code used in this section.

Organize resources

Before all else, one needs to prepare space for work. In our case, this means creating a “container” where the produced data will reside. So let’s create a collection and than put some data inside!

# create a new collection object in your running instance of Resolwe (res)
test_collection = res.collection.create(name='Test collection')

Upload files

We will upload fastq single end reads with the upload-fastq-single process.

# Upload FASTQ reads
reads = res.run(
    slug='upload-fastq-single',
    input={
        'src': './reads.fastq.gz',
    },
    collection=test_collection,
)

What just happened? First, we chose a process to run, using its slug upload-fastq-single. Each process requires some inputs—in this case there is only one input with name src, which is the location of reads on our computer. Uploading a fastq file creates a new Data on the server containing uploaded reads. Additionally, we ensured that the new Data is put inside test_collection.

The upload process also created a Sample object for the reads data to be associated with. You can access it by:

reads.sample

Note

You can also upload your files by providing url. Just replace path to your local files with the url. This comes handy when your files are large and/or are stored on a remote server and you don’t want to download them to your computer just to upload them to Resolwe server again…

Modify data

Both Data with reads and Sample are owned by you and you have permissions to modify them. For example:

# Change name
reads.name = 'My first data'
reads.save()

Note the save() part! Without this, the change is only applied locally (on your computer). But calling save() also takes care that all changes are applied on the server.

Note

Some fields cannot (and should not) be changed. For example, you cannot modify created or contributor fields. You will get an error if you try.

Annotate Samples

The next thing to do after uploading some data is to annotate samples this data belongs to. This can be done by assigning a value to a predefined field on a given sample. See the example below.

Each sample should be assigned a species. This is done by attaching the general.species field on a sample and assigning it a value, e.g. Homo sapiens.

reads.sample.set_annotation("general.species", "Homo sapiens")

Annotation Fields

You might be wondering why the example above requires general.species string instead of e.g. just species. The answer to this are AnnotationFields. These are predefined objects that are available to annotate samples. These objects primarily have a name, but also other properties. Let’s examine some of those:

# Get the field by it's group and name:
field = res.annotation_field.get(group__name="general", name="species")
# Same thing, but in shorter syntax
field = res.annotation_field.from_path("general.species")
# Examine some of the field attributes
field.name
field.group
field.description

Note

Each field is uniquely defined by the combination of name and group.

If you wish to examine what fields are available, use a query

res.annotation_field.all()
# You can also filter the results
res.annotation_field.filter(group__name="general")

You may be wondering whether you can create your own fields / groups. The answer is no. Time has proven that keeping things organized requires the usage of a selected set of predefined fields. If you absolutely feel that you need an additional annotation field, let us know or use resources such as Metadata.

Annotation Values

As mentioned before, fields are only one part of the annotation. The other part are annotation values, stored as a standalone resource AnnotationValues. They connect the field with the actual value.

# Get an AnnotationValue
ann_value = reads.sample.get_annotation("general.species")
# The actual value
ann_value.value
# The corresponding field
ann_value.field
# The corresponding sample
ann_value.sample

As a shortcut, you can get all the AnnotationValues for a given sample by:

reads.sample.annotations

Helper methods

Sometimes it is convenient to represent the annotations with the dictionary, where keys are field names and values are annotation values. You can get all the annotation for a given sample in this format by calling:

reads.sample.get_annotations()

Multiple annotations stored in the dictionary can be assigned to sample by:

annotations = {
    "general.species": "Homo sapiens", "general.description": "Description"
}
reads.sample.set_annotations(annotations)

Annotation is deleted from the sample by setting its value to None when calling set_annotation or set_annotations helper methods. To avoid confirmation prompt, you can set force=True.

reads.sample.set_annotation("general.description", None, force=True)

Run analyses

Various bioinformatic processes are available to properly analyze sequencing data. Many of these pipelines are available via Resolwe SDK, and are listed in the Process catalog of the Resolwe Bioinformatics documentation.

After uploading reads file, the next step is to align reads to a genome. We will use STAR aligner, which is wrapped in a process with slug alignment-star. Inputs and outputs of this process are described in STAR process catalog. We will define input files and the process will run its algorithm that transforms inputs into outputs.

# Get genome
genome_index = res.data.get('resdk-example-genome-index')

alignment = res.run(
    slug='alignment-star',
    input={
        'genome': genome_index,
        'reads': reads,
    },
)

Lets take a closer look to the code above. We defined the alignment process, by its slug 'alignment-star'. For inputs we defined data objects reads and genome. Reads object was created with ‘upload-fastq-single’ process, while genome data object was already on the server and we just used its slug to identify it. The alignment-star processor will automatically take the right files from data objects, specified in inputs and create output files: bam alignment file, bai index and some more…

You probably noticed that we get the result almost instantly, while the typical assembling process runs for hours. This is because processing runs asynchronously, so the returned data object does not have an OK status or outputs when returned.

# Get the latest meta data from the server
alignment.update()

# See the process progress
alignment.process_progress

# Print the status of data
alignment.status

Status OK indicates that processing has finished successfully, but you will also find other statuses. They are given with two-letter abbreviations. To understand their meanings, check the status reference. When processing is done, all outputs are written to disk and you can inspect them:

# See process output
alignment.output

Until now, we used run() method twice: to upload reads (yes, uploading files is just a matter of using an upload process) and to run alignment. You can check the full signature of the run() method.

Run workflows

Typical data analysis is often a sequence of processes. Raw data or initial input is analysed by running a process on it that outputs some data. This data is fed as input into another process that produces another set of outputs. This output is then again fed into another process and so on. Sometimes, this sequence is so commonly used that one wants to simplify it’s execution. This can be done by using so called “workflow”. Workflows are special processes that run a stack of processes. On the outside, they look exactly the same as a normal process and have a process slug, inputs… For example, we can run workflow “General RNA-seq pipeline” on our reads:

# Run a workflow
res.run(
    slug='workflow-bbduk-star-featurecounts-qc',
    input={
        'reads': reads,
        'genome': res.data.get('resdk-example-genome-index'),
        'annotation': res.data.get('resdk-example-annotation'),
        'rrna_reference': res.data.get('resdk-example-rrna-index'),
        'globin_reference': res.data.get('resdk-example-globin-index'),
    }
)

Solving problems

Sometimes the data object will not have an “OK” status. In such case, it is helpful to be able to check what went wrong (and where). The stdout() method on data objects can help—it returns the standard output of the data object (as string). The output is long but exceedingly useful for debugging. Also, you can inspect the info, warning and error logs.

# Update the data object to get the most recent info
alignment.update()

# Print process' standard output
print(alignment.stdout())

# Access process' execution information
alignment.process_info

# Access process' execution warnings
alignment.process_warning

# Access process' execution errors
alignment.process_error