The degnorm.data_access
module
Once you've run the DegNorm pipeline and obtained an output directory, there is a lot of raw and estimated coverage curve data wrapped up in dictionaries stored as .pkl files. Because we can't determine which genes a researcher will be interested in prior to a DegNorm run, we provide a couple of easy-to-use functions to grab and visualize the coverage data on the fly.
By default, degnorm
will create coverage curve plots for the genes with the top-5 and bottom-5 mean degradation index scores. Should you
want access to more genes' coverage data, use the get_coverage_data
and/or get_coverage_plots
functions.
get_coverage_data
Should you need the raw or DegNorm-estimated coverage matrices generated by a degnorm
run, the get_coverage_data
function can load up individual genes' or every gene's coverage data with the option to save those matrices
to individual .txt files. Note: DegNorm only provides coverage for genes that were determined to have nonzero coverage in at least one sample.
from degnorm.data_access import get_coverage_data
degnorm_dir = '/DegNorm_09022018_214247'
# pass one or many gene names, load up coverage matrix dictionary
cov_dat = get_coverage_data('TMEM229B'
, degnorm_dir=degnorm_dir)
You can automatically save those coverage matrices as .txt files by specifying the save_dir
argument. Raw and
estimated coverage matrices are stored in per-chromosome subdirectories as tall DataFrames with sample identifiers serving as the headers.
By setting save_dir=degnorm_dir
, you can write the .txt files back to the original DegNorm output directory.
save_dir='FFvsFFPE_data'
cov_dat = get_coverage_data('TMEM229B'
, degnorm_dir=degnorm_dir
, save_dir=save_dir)
# save all genes' coverage data to .txt files
cov_dat = get_coverage_data('all'
, degnorm_dir=degnorm_dir
, save_dir=save_dir)
The returned object cov_dat
is a dictionary whose keys are named after genes, and each gene's value is a dictionary
with raw
and estimate
coverage pandas.DataFrame values.
get_coverage_plots
Should you need coverage plots in addition to the ones generated during a DegNorm pipeline run, get_coverage_plots
leverages the coverage matrix data in the .pkl files to make new pre- and post-DegNorm coverage curve plots. Use it similarly
to get_coverage_data
. Pass one gene, a list of multiple genes, or the string 'all' to plot every gene's coverage. If you're
not saving plots (by using the save_dir
argument) then this function returns a list of plots.
from degnorm.data_access import get_coverage_plots
plots = get_coverage_plots(['SDF4', 'TMEM229B']
, degnorm_dir=degnorm_dir)
plots[0].show()
Most pipeline runs involve 1000s of genes, so rendering a plot for all genes will likely take a bit of time.
You can also save the resulting plots by specifying a save_dir
argument to the path of a directory where you want to save the plots.
Each gene is saved in a chromosome-level directory. You can still use genes='all'
to generate a coverage plot for every gene.
out = get_coverage_plots('all'
, degnorm_dir=degnorm_dir
, save_dir=degnorm_dir)