The degnorm.data_access module

Once you've run the DegNorm pipeline and obtained an output directory, there is a lot of raw and estimated coverage curve data wrapped up in dictionaries stored as .pkl files. Because we can't determine which genes a researcher will be interested in prior to a DegNorm run, we provide a couple of easy-to-use functions to grab and visualize the coverage data on the fly.

By default, degnorm will create coverage curve plots for the genes with the top-5 and bottom-5 mean degradation index scores. Should you want access to more genes' coverage data, use the get_coverage_data and/or get_coverage_plots functions.


get_coverage_data

Should you need the raw or DegNorm-estimated coverage matrices generated by a degnorm run, the get_coverage_data function can load up individual genes' or every gene's coverage data with the option to save those matrices to individual .txt files. Note: DegNorm only provides coverage for genes that were determined to have nonzero coverage in at least one sample.

from degnorm.data_access import get_coverage_data

degnorm_dir = '/DegNorm_09022018_214247'

# pass one or many gene names, load up coverage matrix dictionary
cov_dat = get_coverage_data('TMEM229B'
                            , degnorm_dir=degnorm_dir)

You can automatically save those coverage matrices as .txt files by specifying the save_dir argument. Raw and estimated coverage matrices are stored in per-chromosome subdirectories as tall DataFrames with sample identifiers serving as the headers. By setting save_dir=degnorm_dir, you can write the .txt files back to the original DegNorm output directory.

save_dir='FFvsFFPE_data'

cov_dat = get_coverage_data('TMEM229B'
                            , degnorm_dir=degnorm_dir
                            , save_dir=save_dir)

# save all genes' coverage data to .txt files
cov_dat = get_coverage_data('all'
                            , degnorm_dir=degnorm_dir
                            , save_dir=save_dir)

The returned object cov_dat is a dictionary whose keys are named after genes, and each gene's value is a dictionary with raw and estimate coverage pandas.DataFrame values.

get_coverage_plots

Should you need coverage plots in addition to the ones generated during a DegNorm pipeline run, get_coverage_plots leverages the coverage matrix data in the .pkl files to make new pre- and post-DegNorm coverage curve plots. Use it similarly to get_coverage_data. Pass one gene, a list of multiple genes, or the string 'all' to plot every gene's coverage. If you're not saving plots (by using the save_dir argument) then this function returns a list of plots.

from degnorm.data_access import get_coverage_plots

plots = get_coverage_plots(['SDF4', 'TMEM229B']
                           , degnorm_dir=degnorm_dir)
plots[0].show()

Most pipeline runs involve 1000s of genes, so rendering a plot for all genes will likely take a bit of time.

You can also save the resulting plots by specifying a save_dir argument to the path of a directory where you want to save the plots. Each gene is saved in a chromosome-level directory. You can still use genes='all' to generate a coverage plot for every gene.

out = get_coverage_plots('all'
                         , degnorm_dir=degnorm_dir
                         , save_dir=degnorm_dir)