Skip to main content
Warning: You are using the test version of PyPI. This is a pre-production deployment of Warehouse. Changes made here affect the production instance of TestPyPI (testpypi.python.org).
Help us improve Python packaging - Donate today!

A package for identifying the translated ORF using ribosome profiling data

Project Description
Detect translated ORFs using ribosome-profiling data
====================================================

*RiboCode* is a very simple but high-quality computational algorithm to
identify genome-wide translated ORFs using ribosome-profiling data.

Dependencies:
-------------

- pysam

- pyfasta

- h5py

- Biopython

- Numpy

- Scipy

- matplotlib

- setuptools

Installation
------------

*RiboCode* can be installed like any other Python packages. Here are some
popular ways:

* Install from pypi::

pip install RiboCode

* Install from local::

pip install RiboCode-*.tar.gz

If you have not administrator permission, you need to install *RiboCode* locally in you own directory by adding the
option ``--user`` to installation commands. Then, you need to add ``~/.local/bin/`` to the ``PATH`` variable, and
``~/.local/lib/`` to the ``PYTHONPATH`` variable. For example, if you are using the bash shell, you would do this by adding
the following lines to your ``~/.bashrc`` file::

export PATH=$PATH:$HOME/.local/bin/
export PYTHONPATH=$HOME/.local/lib/python2.7

You then need to source your ``~/.bashrc`` file by this command::

source ~/.bashrc

Tutorial to analyze ribosome-profiling data and run *RiboCode*
--------------------------------------------------------------

Here, we use the `HEK293 dataset`_ as an example.
In this tutorial, you should specify the path and name of used files according to your own situation.

1. Required files

The genome FASTA file, GTF file for annotation can be download from:


http://www.gencodegenes.org

or from:

http://asia.ensembl.org/info/data/ftp/index.html

http://useast.ensembl.org/info/data/ftp/index.html

For example, the required files in this tutorial can be download from following url:

GTF: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

FASTA: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz

The raw Ribo-seq fastq file can be download using fastq-dump tool from `SRA_Toolkit`_::

fastq-dump -A SRR1630831

2. Trim adapter sequence for ribo-seq

Using cutadapt program https://cutadapt.readthedocs.io/en/stable/installation.html

Example::

cutadapt -m 20 --match-read-wildcards -a (Adapter sequence) -o (Trimmed fastq file) (Input fastq file)

Here, the adapter sequences for this data had already been trimmed off, so we don't need to execute this step.

3. Remove rRNA-derived reads

Align the trimmed reads to rRNA sequences using Bowtie, then select unaligned reads for next step.

Bowtie program http://bowtie-bio.sourceforge.net/index.shtml

rRNA sequences: We provided a "rRNA.fa" file in data folder of this package.

Example::

bowtie-build rRNA.fa rRNA
bowtie -p 8 -norc --un un_aligned.fastq rRNA -q SRR1630831.fastq HEK293_rRNA.align

4. Align the clean reads to reference genome

Using STAR program: https://github.com/alexdobin/STAR

Example:

(1). Build index::

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir hg19_STARindex hg19_genome.fa
--sjdbGTFfile gencode.v19.annotation.gtf

(2). Alignment::

STAR --outFilterType BySJout --runThreadN 8 --outFilterMismatchNmax 2 --genomeDir hg19_STARindex
--readFilesIn un_aligned.fastq --outFileNamePrefix HEK293 --outSAMEtype BAM
SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outFilterMultimapNmax 1
--seedSearchStartLmaxOverLread 0.5 --outFilterMatchNmin 16 --alignIntronMax 1

5. Run *RiboCode* to identify translated ORFs

(1). Prepare the transcripts annotation files::

prepare_transcripts -g gencode.v19.annotation.gtf -f hg19_genome.fa -o RiboCode_annot

(2). Select the length range of the RPF reads and identify the P-site locations::

metaplots -a RiboCode_annot -r HEK293Aligned.toTranscriptome.out.bam

This step will generate a PDF file, which plots the aggregate profiles of the distance between the 5'-end of reads and the annotated start codons or stop codons.

Users can select the read lengths which show strong 3-nt periodicity and identify the P-site locations for each length.

(3). Detect translated ORFs using the ribosome-profiling data::

RiboCode -a RiboCode_annot -c config.txt -l no -o RiboCode_ORFs_result

Specify the information of the bam file and P-site parameters in :download:`config.txt <data/config.txt>`, please refer to the example file in data folder.

**Explanation of final result files**

The *RiboCode* generates two text files as below:
The "(output file name).txt" contains the information of predicted ORFs in each
transcripts; The "(output file name)_collapsed.txt" file combines the ORFs with the
same stop codon in different transcript isoforms: the one which harboring the most
upstream in-frame ATG is chosen.
Some column names of result file:

- ORF_ID: The identifier of ORF
- ORF_type: The type of ORF. The following ORF categories are reported: "annotated" (overlapping annotated
CDS, have the same stop with annnotated CDS), "uORF"(in upstream of annotated CDS, not overlapping
annotated CDS), "dORF"(in downstream of annotated CDS, not overlapping annotated CDS),
"Overlap_uORF"(in upstream of annotated CDS, overlapping annotated CDS), "Overlap_dORF"(in
downstream of annotated CDS, overlapping annotated CDS", "Internal"(in internal of annotated
CDS, but in a different frame relative annotated CDS), "new"(in non-coding genes and non-coding
transcripts of coding genes).
- ORF_tstart, ORF_tstop: the beginning and end of ORF in RNA transcript (1-based coordinate)
- ORF_gstart, ORF_gstop: the beginning and end of ORF in genome (1-based coordinate)
- pval_frame0_vs_frame1: p-value of wilcoxon test for testing the difference between frame 0 and frame 1
- pval_frame0_vs_frame2: p-value of wilcoxon test for testing the difference between frame 0 and frame 2
- pval_combined: combined p-value of pval_frame0_vs_frame1 and pval_frame0_vs_frame2

(4). (optional) plot the P-sites density of predicted ORF

Users can plot the density of predicted ORF using the "plot_orf_density" command, as example below::

plot_orf_density -a RiboCode_annot -c config.txt -t (transcript_id)
-s (ORF_gstart) -e (ORF_gstop)

For any questions, please contact:
----------------------------------

Zhengtao Xiao (xzt13@mails.tsinghua.edu.cn)

Rongyao Huang (THUhry12@163.com)

.. _SRA_Toolkit: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
.. _HEK293 dataset: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1630831
Release History

Release History

This version
History Node

0.0.1.dev15

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting