Tools to work with cytosine coverage files

Bismark’s coverage2cytosine tools creates very useful cytosine coverage reports that give the coverage of methylated and unmethylated reads at each cytosine. The class CytosineCoverageFile provides various tools for extractin information of interest from these files.

Import a coverage file

To import a coverage file to memory give a path to a cytosine2coverage-format file from Bismark. standard format is to have no header (this will be added), but seven columns giving chromosome, base-pair of each cytosine position on the chromosome, strand (+/-), number of methylated reads, number of unmethylated reads, sequence context (CG, CHG or CHH) and tricnucleotide context (for example, for CHG methylation this could be CAG, CCG or CTG).

import methlab as ml    path="tests/test_data/test_coverage2cytosine.txt.gz"
c2c = CytosineCoverageFile(path)
# Access the whole file, if you really must (it's big!)
c2c.file

Conversion rate on each chromosome

Calculate overall methylation on each chromosome separately. This is especially useful for calculating conversion rates by quantifying methylation on the chloroplast.

# Return proportion of methylation on each chromosome
c2c.conversion_rate()

You can also return raw read counts instead of proportions:

c2c.conversion_rate(return_proportion = False)

Here is an example of the output for chromosome 1 only. Notice that the format is the proportion of methylated rather than unmethylated cytosines, so it isn’t really conversion rate in that the strict sense.

    id   context      meth  unmethylated  ncytosines  n_reads
    Chr1      CG  0.356164      0.643836          82       73
    Chr1     CHG  0.333333      0.666667          68       54
    Chr1     CHH  0.144715      0.855285         818      615
    Chr1   total  0.179245      0.820755         968      742

Methylation in windows

Count the number of methylated and unmethylated reads and number of cytosines in CG, CHG and CHH contexts in windows of fixed size across the genome.

# Methylation in 150-bp windows
windows = c2c.methylation_in_windows(150)

Here is an example of the output. You can see that windows are indxed by their start position.

        chr   start context  meth  unmethylated  ncytosines
    0   Chr1      0      CG    22             4          30
    1   Chr1      0     CHG    15             8          24
    2   Chr1      0     CHH    48           228         273
    3   Chr1      0   total    85           240         327
    4   Chr1   1000      CG     0             9          14
    5   Chr1   1000     CHG     2            17          28
    6   Chr1   1000     CHH    14           115         275
    7   Chr1   1000   total    16           141         317
    8   Chr1   2000      CG     4            34          38
    9   Chr1   2000     CHG     1            11          16
    10  Chr1   2000     CHH    27           183         270
    11  Chr1   2000   total    32           228         324

Methylation over annotated features

Count the number of methylated and unmethylated reads and number of cytosines in CG, CHG and CHH contexts in each of multiple annotated features (e.g. genes, TEs). Usually such information would be described by a file in .bed or .gff format, but to keep it general we just use vectors of start and stop coordinates that are easily extracted from such files.

Here is an example using the first ten lines of the TAIR10 annotation file

gff_file = pd.read_csv(
    "tests/test_data/test_TAIR10_GFF3_genes_transposons.gff",
    sep="\t",
    names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    ).iloc[1:9] # Skip the first row, because it defines the whole chromosome

# Example coverage file
meth_counts = c2c.methylation_over_features(
    chr = gff_file['seqid'],
    start = gff_file['start'],
    stop = gff_file['end']
    )

Here is the output for the first three features.

            id   context  meth  unmethylated  ncytosines
    0   feature0      CG     2            14          20
    1   feature0     CHG     0             4           8
    2   feature0     CHH     6            78         106
    3   feature0   total     8            96         134
    4   feature1      CG     2            14          20
    5   feature1     CHG     0             4           8
    6   feature1     CHH     6            78         106
    7   feature1   total     8            96         134
    8   feature2      CG     2            13          18
    9   feature2     CHG     0             4           6
    10  feature2     CHH     3            53          64
    11  feature2   total     5            70          88

This shows the generic default output for the feature name (the id column) because the there are no useful IDs to use in the GFF file. You can optionally supply a vector of names using the argument names.