Discrete methylation states

We wish to infer whether some region or window of DNA is in one of several discrete states. For example, one model of plant methylation is that DNA is

  • Unmethylated (no converted reads in any sequence context)

  • Gene-body-like, or CG-only methylated (converted reads on CG sites, but not CHG or CHH sites)

  • TE-like methylated (converted sites in CG, CHG and CHH contexts)

This is challenging when there are substantial non-conversion errors, especially when these vary across the genome. Fortunately, for this task we do not need to accurately estimate true methylation level. Instead, we need only ask whether the data are consistent with non-conversion errors alone, or with the cumulative effect of errors plus some additional process.

Note that it is also possible that there may be false conversion errors (cytosines that are observed as converted to thymine, but are truly unmethylated). The current implementation can handle non-conversion errors only.

Data

Functions from CytosineCoverageFile will give you data that looks something like this, summarising feature name, sequence context, numbers of converted and unconverted reads, and total number of cytosines.

              id context  unconverted  converted  ncytosines
    0   feature0      CG     2            14          20
    1   feature0     CHG     0             4           8
    2   feature0     CHH     6            78         106
    3   feature0   total     8            96         134
    4   feature1      CG     2            14          20
    5   feature1     CHG     0             4           8
    6   feature1     CHH     6            78         106
    7   feature1   total     8            96         134
    8   feature2      CG     2            13          18
    9   feature2     CHG     0             4           6
    10  feature2     CHH     3            53          64
    11  feature2   total     5            70          88

Errors as a beta distribution

We assume non-conversion errors vary across the genome, such that we can’t know the precise error rate at any single region. However, we do have a good idea of the distribution of error rates across the genome, and model this as a beta distribution. Whereas a binomial distribution describes the probability of observing binary data given some mean, a beta distribution describes the uncertainty in that mean.

A beta distribution is described by shape parameters a and b. These can be estimated by method-of-moments based on the mean and variance of a samples from a beta distribution. This code imports an example file from the output of CytosineCoverageFile.methylation_over_features() estimated in windows from control DNA known to be unmethylated, and estimates mean non-conversion in each window.

import pandas as pd
import methlab as ml
read_counts = pd.read_csv("tests/test_data/gene_read_counts.csv")
# Subset to include only rows showing total counts for all cytosines
read_counts = read_counts.loc[read_counts['context'] == "total"]
# Estimate the proportion of unconverted reads
read_counts['n'] = read_counts['unconverted'] + read_counts['converted']
read_counts['theta'] = read_counts['unconverted'] / read_counts['n']

Estimate the mean error rates within each window, and the variance between windows

mu = read_counts['theta'].mean()
sigma = read_counts['theta'].var()

This ignores differences in coverage and number of cytosines between windows. You might want to include those as weights.

estimate_beta_parameters() uses these values to estimate the shape parameters of the beta distribution.

ab_errors = ml.estimate_beta_parameters(mu, sigma)

It returns a tuple of two numbers corresponding to the a and b parameters.

Methylation state

We can use the table of read counts and the shape of the beta distribution to calculate likelihoods that each feature/window in the table is unmethylated, gene-body-like methylated or TE-like methylated.

ml.methylation_state(read_counts, ab_errors)

Here is the output as a Pandas dataframe:

              id  coverage  unmethylated    CG-only    TE-like
    0  AT1G01010       746     -5.562562  -5.935396  -8.910764
    1  AT1G01020       840     -4.486843  -5.526963  -9.010684
    2  AT1G01030       882     -4.962632  -6.173149  -9.211579
    3  AT1G01040      2068     -8.321224  -7.392753 -10.439014
    4  AT1G01046        25     46.701561  19.947363 -33.561034
    5  AT1G01050       634     -3.166360  -3.806465  -8.671619
    6  AT1G01060      1156     -8.514358  -8.252310  -9.607966
    7  AT1G01070      1052     -6.517405  -7.524795  -9.311916
    8  AT1G01073        34    -29.364506  -5.335205  -3.258661
    9  AT1G01080       643     -5.348341  -6.894794  -8.887364

The last three columns give (log) likelihoods that each gene is in each of the three states. If you include the argument return_probabilities=True log likelihoods are converted to probabilities that sum to one. If you include the argument hard_calls=True an additional column is added giving the most likely state. This can be convenient, but be aware that generally you probably want to quantify the evidence for each hypothesis rather than make hard calls like this, which is why it defaults to False.