Divergence Reference

API Documentation for the divergence submodule

Jensen-Shannon Divergence

Functions for calculating the Jenson-Shannon divergence between two sampled distributions

metworkpy.divergence.js_divergence_functions.js_divergence(p: ArrayLike, q: ArrayLike, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', n_neighbors: int = 5, discrete: bool = False, jitter: float | None = None, jitter_seed: int | None = None, distance_metric: float | str = 'euclidean', clip: bool = False) → float | DivergenceResult

Calculate the Jensen-Shannon divergence between two distributions represented by samples p and q

Parameters:

p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test
alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative hypothesis to use, see metworkpy.utils.permutation.permutation_test
permutations (int, default=9999) – The number of permuatations to use when calculating the p-value
permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.
permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde
n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.
discrete (bool) – Whether the samples are from discrete distributions
jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.
jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise
distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
clip (bool, default=False) – Whether or not to clip the divergence values at 0.0

Returns:

The Jensen-Shannon divergence between p and q

Return type:

float

Kullback-Leibler Divergence

Function for calculating the Kullback-Leibler divergence between two probability distributions based on samples from those distributions.

Calculate the Kulback-Leibler divergence between two distributions represented by samples p and q

Parameters:

p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test
alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative to use, see metworkpy.utils.permutation.permutation_test
permutations (int, default=9999) – The number of permuatations to use when calculating the p-value
permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.
permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde
n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.
discrete (bool) – Whether the samples are from discrete distributions
jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.
jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise
distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
clip (bool, default=False) – Whether or not to clip the divergence values at 0.0

Returns:

The Kulback-Leibler divergence between p and q

Return type:

float

Notes

This function is not symmetrical, p is treated as a ‘true’ distribution, and q as an approximating distribution. If you want a symmetric metric try the Jenson-Shannon divergence.

Group Divergence

Submodule containing functions which will calculate the divergence between two dataframes for groups of columns

metworkpy.divergence.group_divergence.calculate_divergence_grouped(dataset1: DataFrame, dataset2: DataFrame, divergence_groups: dict[str, list[Hashable]], divergence_type: Literal['kl', 'js'] = 'kl', calculate_pvalue: bool = False, processes: int = 1, **kwargs) → Series | Tuple[Series, Series]

Calculate the divergence between data in two dataframes for a set of groups of columns

Parameters:

dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)
divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon
processes (int) – The number of processes to use for the calculation
kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type

Returns:

divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.

Return type:

pd.Series or tuple of pd.Series,pd.Series

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

metworkpy.divergence.group_divergence.calculate_reaction_neighborhood_divergence(model: Model, dataset1: DataFrame, dataset2: DataFrame, divergence_type: Literal['kl', 'js'] = 'kl', directed: bool = False, nodes_to_remove: list[str] | None = None, radius: int = 2, calculate_pvalue: bool = False, processes: int = 1, **kwargs)

Calculate the divergence between data in two dataframes for a set of groups of columns

Parameters:

dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)
divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon
directed (bool) – Whether the reaction network created to find reaction neighborhoods should be directed or not
nodes_to_remove (list[str] | None) – List of any metabolites or reactions that should be removed from the final network. This can be used to remove metabolites that participate in a large number of reactions, but are not desired in downstream analysis such as water, or ATP, or pseudo reactions like biomass. Each metabolite/reaction should be the string ID associated with them in the cobra model.
radius (int) – The radius determining the sizes of the neighborhoods
calculate_pvalue (bool) – Whether to calculate the p-value of the divergence difference using permutation testing
processes (int) – The number of processes to use for the calculation
kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type

Returns:

divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.

Return type:

pd.Series or tuple of pd.Series,pd.Series

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

Knockout Divergence

Determine the divergence in the network caused by a gene knock out

metworkpy.divergence.ko_divergence_functions.ko_divergence(model: Model, target_networks: list[str] | dict[str, list[str]], genes_to_ko: Iterable[str] | None = None, divergence_type: Literal['js', 'kl'] = 'kl', calculate_pvalue: bool = False, sample_count: int = 1000, progress_bar: bool = False, use_unperturbed_as_true: bool = True, sampler_seed: Generator | int | None = None, sampler_kwargs: dict[str, Any] | None = None, processes: int = 1, **kwargs) → DataFrame | Tuple[DataFrame, DataFrame]

Determine the impacts of gene knock-outs on different target reaction or gene networks

Parameters:

model (cobra.Model) – Base cobra model to test effects of gene knockouts on
target_networks (list[str] | dict[str, list[str]]) – Target networks to investigate the impact of the gene knock-outs on. Can be a list or a dict of lists. If a dict, the keys will be used to name the network and the lists will specify the networks. If a list should be a single network. Entries in the lists can be either reaction or gene ids. Gene ids will be translated into reaction ids using the model. If a list is passed the name of the target network in the returned dataframe will be target_network, if a dict is passed the keys are used as the column names.
genes_to_ko (Iterable[str], optional) – List of genes to investigate impact of their knock-out, defaults to all genes in the model
divergence_type ('kl' or 'js', default='kl') – Which metric to use for divergence, can be ‘kl’ for Kullback-Leibler (default) or ‘js’ for Jensen-Shannon,
calculate_pvalue (bool, default=False) – Whether to calculate the significance value for the divergence
sample_count (int) – The number of samples to take in order to estimate the divergence
progress_bar (bool) – Whether a progress bar is desired
use_unperturbed_as_true (bool, default=True) – Which distribution to use as the “True” distribution (the P distribution) when estimating divergence between the perturbed (that is the model with a gene knock-out) and the unperturbed (model prior to the gene knock-out) flux samples. Doesn’t impact Jensen-Shannon as that is symetric, but will modify the Kullback-Leibler divergence.
sampler_seed (None or int or np.Generator, optional) – Seed used for sampling in order to create reproducible results, can be a numpy generator (in which cae it is used directly), or an integer (in which case it is used to seed a numpy generator).
sampler_kwargs (dict of str to Any) – Arguments passed to the sample method of COBRApy, see COBRApy Documentation
processes (int, default=1) – Number of processes to use for this function, passed to the sampler and also used as the number of processes for calculating the divergence for the different groups. Note that if you want a different number of processes for the sampler, you can use the sampler_kwargs dictionary.
**kwargs – Keyword arguments passed to the divergence method

Returns:

Dataframe with index of genes, and columns representing the different target networks. Values represent the divergence of a particular target network between the unperturbed model and the model following the gene knock-out.

Return type:

pd.DataFrame