Running DegNorm
DegNorm is a command line interface (CLI) tool. You can access it through the degnorm
command and the MPI-enabled
distributed version is degnorm_mpi
.
Inputs
You only need two types of files to supply degnorm
- .bam files (0-indexed) and a .gtf file (1-indexed) if you have samtools
in your $PATH
. If you do not have samtools
, you will also need bam index files to accompany your .bam files.
1. (Optional) index files (.bai)
Before running DegNorm, you will need to sort (and optionally, index) them with samtools sort
. This command just re-orders reads
by chromosome and starting index, so that they can be more useful to DegNorm later on.
To sort an alignment file, suppose it's called S1.bam
, you can sort it with the following command:
# sort alignment files (necessary for indexing them)
samtools sort S1.bam -o S1_sorted.bam
DegNorm requires sorted .bam files because it really needs bam index files. If you don't create them
first, DegNorm will run samtools index
to create an index file for each input alignment file:
# create alignment index files
samtools index S1_sorted.bam S1_sorted.bai
Within the .bam files reads must be 0-indexed.
2. Paired or single-end aligned and sorted reads (.bam)
At least two alignment files (.bam) must be specified, as inter-sample degradation normalization cannot happen on a standalone
RNA-Seq sample. We refer to the total number of samples as p
.
If .bai files are not submitted, degnorm
will look for .bai files named after the .bam files only with the ".bai" extension.
If no such file is found, degnorm
will attempt to build one with the samtools index
command. This will only work if the .bam files are sorted.
Instead of specifying individual .bam and .bai files, you can just specify --bam-dir
, a path to a directory holding the relevant .bam and .bai files.
With --bam-dir
, it is assumed that the .bai files are named according to the .bam files, only with the ".bai" extension.
You can supply paired reads files or single end reads files.
Argument | Required? | Meaning |
---|---|---|
--bam-files |
If neither --warm-start-dir nor --bam-dir are specified |
Set of individual .bam files |
--bai-files |
If samtools is not in the $PATH and neither --warm-start-dir nor --bam-dir are specified |
Set of individual .bai files. If specified, they must be in the order corresponding to --bam-files . |
--bam-dir |
If neither --warm-start-dir nor --bam-files are specified |
Directory containing .bam and .bai files for a pipeline run. It is assumed the .bai files have the same name as the .bam files, just with a different extension. |
3. Genome annotation file (.gtf)
DegNorm needs a .gtf file in order to construct the total transcript and compute all per-gene coverage score curves. It is assumed that the positions in the .gtf file are 1-indexed and that the file abides by the standard .gtf conventions.
Argument | Required? | Meaning |
---|---|---|
-g , --genome-annotation |
Yes | .gtf file for relevant genome. |
Start and end positions with the .gtf file must be 1-indexed. Additionally, the .gtf file must have the 9 standard .gtf fields. Here is an example of a .gtf file we use to test degnorm
:
seqname | source | feature | start | end | score | strand | frame | attribute |
---|---|---|---|---|---|---|---|---|
chr1 | unknown | exon | 323892 | 324060 | . | + | . | gene_id "LOC100133331"; gene_name "LOC100133331"; transcript_id "NR_028327_1"; tss_id "TSS14025"; |
chr1 | unknown | exon | 324288 | 324345 | . | + | . | gene_id "LOC100133331"; gene_name "LOC100133331"; transcript_id "NR_028327_1"; tss_id "TSS14025"; |
chr1 | unknown | exon | 324439 | 326938 | . | + | . | gene_id "LOC100133331"; gene_name "LOC100133331"; transcript_id "NR_028327_1"; tss_id "TSS14025"; |
Specifically, the attribute
field must, at a minimum, contain either a gene_name
or gene_id
attribute so that we can pair exons to uniquely-identified genes.
Also, .gtf records will be filtered down to those whose feature
= "exon".
Using a warm start directory
Loading multiple .bam files, parsing a .gtf file, and computing per-gene cross-sample coverage matrices and read counts, can take some time, but this is a one-time
preprocessing cost given a specific set of RNA-Seq samples. If you run the degnorm
pipeline once, you can leverage
the stored coverage, read counts, and parsed transcript data from an existing DegNorm output directory to start a a new DegNorm run where all of the
preprocessing is completed ahead of time. When using --warm-start-dir
you do not need to supply a .gtf file.
Argument | Required? | Meaning |
---|---|---|
--warm-start-dir |
No | A directory holding the contents of a successful previous degnorm pipeline run. When specified, there is no need to supply a .gtf file. |
Runtime and algorithmic parameters
Argument | Required? | Meaning |
---|---|---|
-o , --output-dir |
No | Output directory. Default is current working directory. Directory will be created if it doesn't currently. If not supplied or if directory already exists, DegNorm creates a new output directory degnorm_mmddYYYY_HHMMSS . |
--plot-genes |
No | Names of genes for which to render coverage plots. Sequence of explictly stated gene names or a .txt file containing one gene name per line. |
-d , --downsample-rate |
No | Integer downsampling rate. Systematic samples of a coverage matrix are used to speed up NMF iterations. |
--nmf-iter |
No | Number of iterations per NMF-OA approximation. The higher the more accurate the approximation, but the more costly in terms of time. |
--iter |
No | Number of whole DegNorm iterations. Default is 5. |
--minimax-coverage |
No | Minimum cross-sample maximum coverage for a gene before it is included in the DegNorm pipeline. Can be used to exclude relatively low-coverage genes. |
-s , --skip-baseline-selection |
No | Flag to skip baseline selection, will greatly speed up DegNorm iterations. |
--non-unique-alignments |
No | Flag, allow non-uniquely mapped reads. Otherwise, DegNorm only keeps reads with NH (number of hits) == 1 (default behavior). |
-p , --proc-per-node |
No | Integer number of processes to spawn per compute node. The more the better. Defaults to a lonely 1. |
Example usage
Required: sorting, indexing alignment files
Remember - prior to running degnorm
, you must sort
your aligned reads and create index files (.bai files).
Suppose we have set of .bam files in our current working directory that we would like to send through DegNorm, the following
commands will sort the .bam files and create .bai files, all with samtools
:
# sort and index alignment files in current working directory.
for FILE in ./*.bam
do
echo 'sorting '$FILE
samtools sort $FILE -o ${FILE/.bam/}'_sorted.bam'
echo 'indexing '${FILE/.bam/}'_sorted.bam'
samtools index ${FILE/.bam/}'_sorted.bam' ${FILE/.bam/}'_sorted.bai'
done
Launching a full degnorm
run "the verbose way"
Next, after the alignment files have been sorted (and optionally indexed), let's run degnorm
"the verbose way" by enumerating
each .bam and .bai file individually. Suppose we have two sorted alignment files, S1_sorted.bam
and S2_sorted.bam
, and that we want
to save the results of the pipeline to a new directory called degnorm_output
.
# run degnorm on two samples on 20 cores, using 100 NMF iterations per gene
degnorm --bam-files S1_sorted.bam S2_sorted.bam \
--bai-files S1_sorted.bai S2_sorted.bai \
-g human.gtf \
-p 20 \
-o degnorm_output
If we did not specify any --bai-files
, then DegNorm would automatically search for files "S1_sorted.bai" and "S2_sorted.bai" because they're named after
the alignment files, "S1_sorted.bam" and "S2_sorted.bam." If those index files cannot be found, they will be created in the same directory
containing the input .bam files, only if samtools
is in your $PATH
.
Launching a full degnorm
run "the easy way"
It's probably just easier to supply degnorm
with an input data directory containing all of our .bam files, instead of each .bam file individually.
Suppose that data directory is located at degnorm_data/GBM
.
Let's also use 5 DegNorm iterations, 50 NMF iterations per gene per NMF iteration, and route output to a directory besides the current working directory.
degnorm --bam-dir degnorm_data/GBM \
-g human.gtf \
-p 20 \
--nmf-iter 50 \
-o degnorm_output
Using a warm start directory
After one pipeline run, we could start degnorm
from the output directory resulting from the first run - this directory is known as the "warm start" directory. This time, let's exclude genes with a maximum coverage
(across all samples) less than 20, and run DegNorm with only 3 iterations. In this example, the DegNorm_102118_081045
warm start directory would have been generated from a prior DegNorm run.
degnorm --warm-start-dir degnorm_output/DegNorm_102118_081045 \
-p 20 \
-o degnorm_output \
--minimax-coverage 20 \
--iter 3
Launching distributed runs with MPI
Let's speed this up by distributing DegNorm's workload over multiple servers with degnorm_mpi
!
If the mpi4py
python package has been installed (meaning that MPICH is available in your compute environment).
You can still specify the number of cores to use on each node with -p
.
# run degnorm_mpi using 4 nodes ("-n 4")
mpiexec -n 4 degnorm_mpi \
--warm-start-dir degnorm_output/DegNorm_GBM_102018 \
-p 20 \
-o degnorm_output \
--minimax-coverage 20
--nmf-iter 100
The functionality of degnorm_mpi
is the same as the single-node degnorm
command, so you can continue using warm start directories just like you could with degnorm
.
If you are unfamiliar with MPI, you may find Wes Kendall's tutorials very enlightening.
Example job submission script
Below is an example of a shell script containing the contents of a job you might submit to a job manager (e.g. MOAB). Note that additional steps would be required to convert this script into a SLURM job submission script.
Output
Raw and estimated coverage data are stored in separate .pkl
files, one file per chromosome. See the posthoc analysis
documentation for helper functions to access coverage data.
A gene that has 100% missing coverage (no coverage found in any RNA-seq sample) will not be run through DegNorm for numerical stability purposes and will therefore not have a
coverage matrix stored in any of the .pkl
files.
In addition to per-gene coverage matrices, degnorm
will produce raw and degradation-adjusted read counts, a matrix of degradation index scores,
a matrix describing which genes were sent through the baseline selection procedure and when, along with various summary graphics and a pipeline summary report.
The report will render as an .html file, but if you have pandoc
installed
it will be converted to a .pdf.
├── degnorm.log
├── degradation_index_scores.csv
├── ran_baseline_selection.csv # matrix of Booleans: which genes go thru baseline selection
├── gene_exon_metadata.csv # chromosome, gene, exon relationship data
├── read_counts.csv # raw read counts
├── adjusted_read_counts.csv # degradation normalized read counts
├── chr1
│ ├── GAPDH_coverage.png
│ ├── <more coverage plots>
│ ├── coverage_matrices_chr1.pkl # (gene, raw coverage matrix) pairs in serialized python dictionary.
│ └── estimated_coverage_matrices_chr1.pkl # (gene, estimated coverage matrix) pairs in serialized python dictionary.
├── chr2
│ ├── coverage_matrices_chr2.pkl
│ └── estimated_coverage_matrices_chr2.pkl
├── chr3
│ ├── coverage_matrices_chr3.pkl
│ └── estimated_coverage_matrices_chr3.pkl
├── chr4
│ ├── GAPDH_coverage.png
│ ├── coverage_matrices_chr4.pkl
│ └── estimated_coverage_matrices_chr4.pkl
├── <more chromosome directories>
│ ├── <more raw coverage matrix .pkl files>
│ └── <more estimated coverage matrix .pkl files>
└── report
├── degnorm_summary.pdf # (or .html if pandoc not available)
├── di_boxplots.png
├── di_correlation.png
└── di_heatmap.png