Create, modify and organize data
To begin, we need some sample data to work with. You may use your own reads (.fastq) files, or download an example set we have provided:
import resdk
res = resdk.Resolwe(url='https://app.genialis.com')
res.login()
# Get example reads
example = res.data.get('resdk-example-reads')
# Download them to current working directory
example.download(
field_name='fastq',
download_dir='./',
)
Note
To avoid copy-pasting of the commands, you can
download all the code
used in this section.
Organize resources
Before all else, one needs to prepare space for work. In our case, this means creating a “container” where the produced data will reside. So let’s create a collection and than put some data inside!
# create a new collection object in your running instance of Resolwe (res)
test_collection = res.collection.create(name='Test collection')
Upload files
We will upload fastq single end reads with the upload-fastq-single process.
# Upload FASTQ reads
reads = res.run(
slug='upload-fastq-single',
input={
'src': './reads.fastq.gz',
},
collection=test_collection,
)
What just happened? First, we chose a process to run, using its slug
upload-fastq-single
. Each process requires some inputs—in this case there
is only one input with name src
, which is the location of reads on our
computer. Uploading a fastq file creates a new Data
on the server
containing uploaded reads. Additionally, we ensured that the new
Data
is put inside test_collection
.
The upload process also created a Sample object for the reads data to be associated with. You can access it by:
reads.sample
Note
You can also upload your files by providing url. Just replace path to your local files with the url. This comes handy when your files are large and/or are stored on a remote server and you don’t want to download them to your computer just to upload them to Resolwe server again…
Modify data
Both Data
with reads and Sample
are owned by you and you have
permissions to modify them. For example:
# Change name
reads.name = 'My first data'
reads.save()
Note the save()
part! Without this, the change is only applied locally (on
your computer). But calling save()
also takes care that all changes are
applied on the server.
Note
Some fields cannot (and should not) be changed. For example, you cannot
modify created
or contributor
fields. You will get an error if you
try.
Annotate Samples
The next thing to do after uploading some data is to annotate samples this data belongs to. This can be done by assigning a value to a predefined field on a given sample. See the example below.
Each sample should be assigned a species. This is done by attaching the
general.species
field on a sample and assigning it a value, e.g.
Homo sapiens
.
reads.sample.set_annotation("general.species", "Homo sapiens")
Annotation Fields
You might be wondering why the example above requires general.species
string
instead of e.g. just species
. The answer to this are AnnotationField
s.
These are predefined objects that are available to annotate samples. These
objects primarily have a name, but also other properties. Let’s examine some of
those:
# Get the field by it's group and name:
field = res.annotation_field.get(group__name="general", name="species")
# Same thing, but in shorter syntax
field = res.annotation_field.from_path("general.species")
# Examine some of the field attributes
field.name
field.group
field.description
Note
Each field is uniquely defined by the combination of name
and group
.
If you wish to examine what fields are available, use a query
res.annotation_field.all()
# You can also filter the results
res.annotation_field.filter(group__name="general")
You may be wondering whether you can create your own fields / groups. The answer is no. Time has proven that keeping things organized requires the usage of a selected set of predefined fields. If you absolutely feel that you need an additional annotation field, let us know or use resources such as Metadata.
Annotation Values
As mentioned before, fields are only one part of the annotation. The other part
are annotation values, stored as a standalone resource AnnotationValues
.
They connect the field with the actual value.
# Get an AnnotationValue
ann_value = reads.sample.get_annotation("general.species")
# The actual value
ann_value.value
# The corresponding field
ann_value.field
# The corresponding sample
ann_value.sample
As a shortcut, you can get all the AnnotationValue
s for a given sample by:
reads.sample.annotations
Helper methods
Sometimes it is convenient to represent the annotations with the dictionary, where keys are field names and values are annotation values. You can get all the annotation for a given sample in this format by calling:
reads.sample.get_annotations()
Multiple annotations stored in the dictionary can be assigned to sample by:
annotations = {
"general.species": "Homo sapiens", "general.description": "Description"
}
reads.sample.set_annotations(annotations)
Annotation is deleted from the sample by setting its value to None
when
calling set_annotation
or set_annotations
helper methods. To avoid
confirmation prompt, you can set force=True
.
reads.sample.set_annotation("general.description", None, force=True)
Run analyses
Various bioinformatic processes are available to properly analyze sequencing data. Many of these pipelines are available via Resolwe SDK, and are listed in the Process catalog of the Resolwe Bioinformatics documentation.
After uploading reads file, the next step is to align reads to a genome. We
will use STAR aligner, which is wrapped in a process with slug
alignment-star
. Inputs and outputs of this process are described in
STAR process catalog. We will define input files and the process will run
its algorithm that transforms inputs into outputs.
# Get genome
genome_index = res.data.get('resdk-example-genome-index')
alignment = res.run(
slug='alignment-star',
input={
'genome': genome_index,
'reads': reads,
},
)
Lets take a closer look to the code above. We defined the alignment process, by
its slug 'alignment-star'
. For inputs we defined data objects reads
and genome
. Reads
object was created with ‘upload-fastq-single’
process, while genome
data object was already on the server and we just
used its slug to identify it. The alignment-star
processor will
automatically take the right files from data objects, specified in inputs and
create output files: bam
alignment file, bai
index and some more…
You probably noticed that we get the result almost instantly, while the typical assembling process runs for hours. This is because processing runs asynchronously, so the returned data object does not have an OK status or outputs when returned.
# Get the latest meta data from the server
alignment.update()
# See the process progress
alignment.process_progress
# Print the status of data
alignment.status
Status OK
indicates that processing has finished successfully, but you will
also find other statuses. They are given with two-letter abbreviations. To
understand their meanings, check the
status reference
. When processing is done,
all outputs are written to disk and you can inspect them:
# See process output
alignment.output
Until now, we used run()
method twice: to upload reads (yes, uploading
files is just a matter of using an upload process) and to run alignment. You
can check the full signature of the run()
method.
Run workflows
Typical data analysis is often a sequence of processes. Raw data or initial input is analysed by running a process on it that outputs some data. This data is fed as input into another process that produces another set of outputs. This output is then again fed into another process and so on. Sometimes, this sequence is so commonly used that one wants to simplify it’s execution. This can be done by using so called “workflow”. Workflows are special processes that run a stack of processes. On the outside, they look exactly the same as a normal process and have a process slug, inputs… For example, we can run workflow “General RNA-seq pipeline” on our reads:
# Run a workflow
res.run(
slug='workflow-bbduk-star-featurecounts-qc',
input={
'reads': reads,
'genome': res.data.get('resdk-example-genome-index'),
'annotation': res.data.get('resdk-example-annotation'),
'rrna_reference': res.data.get('resdk-example-rrna-index'),
'globin_reference': res.data.get('resdk-example-globin-index'),
}
)
Solving problems
Sometimes the data object will not have an “OK” status. In such case, it is
helpful to be able to check what went wrong (and where). The stdout()
method on data objects can help—it returns the
standard output of the data object (as string). The output is long but
exceedingly useful for debugging. Also, you can inspect the info, warning and
error logs.
# Update the data object to get the most recent info
alignment.update()
# Print process' standard output
print(alignment.stdout())
# Access process' execution information
alignment.process_info
# Access process' execution warnings
alignment.process_warning
# Access process' execution errors
alignment.process_error