Divergence Reference

API Documentation for the divergence submodule

Jensen-Shannon Divergence

Functions for calculating the Jenson-Shannon divergence between two sampled distributions

metworkpy.divergence.js_divergence_functions.js_divergence(p: ArrayLike, q: ArrayLike, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', n_neighbors: int = 5, discrete: bool = False, jitter: float | None = None, jitter_seed: int | None = None, distance_metric: float | str = 'euclidean', clip: bool = False) float | DivergenceResult

Calculate the Jensen-Shannon divergence between two distributions represented by samples p and q

Parameters:
  • p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test

  • alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative hypothesis to use, see metworkpy.utils.permutation.permutation_test

  • permutations (int, default=9999) – The number of permuatations to use when calculating the p-value

  • permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.

  • permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde

  • n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.

  • discrete (bool) – Whether the samples are from discrete distributions

  • jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.

  • jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise

  • distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.

  • clip (bool, default=False) – Whether or not to clip the divergence values at 0.0

Returns:

The Jensen-Shannon divergence between p and q

Return type:

float

See also

Ross, B., 9, e87357.

metworkpy.divergence.js_divergence_functions.js_divergence_array(p: DataFrame | ndarray, q: DataFrame | ndarray, axis: int = 1, processes: int = 1, **kwargs) Series | ndarray[Tuple[int], dtype[float32 | float64]] | Tuple[Series | ndarray[Tuple[int], dtype[float32 | float64]], Series | ndarray[Tuple[int], dtype[float32 | float64]]]

Calculate the Jensen-Shannon divergence between arrays along the specified axis.

Parameters:
  • p (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.

  • q (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.

  • axis (int, default=1) – Axis to slice along to get the arrays representing samples from the distributions to calculate the divergence between. For example, axis=1 specified that 2-dimensional the p and q arrays will be sliced along the columns. The size of p and q along this axis must match.

  • processes (int) – Number of processes to use when calculating the divergence (default 1)

  • kwargs – Keyword arguments are passed to the js_divergence function

Returns:

Array with length equal to the shape along the axis in p and q, the ith value representing the divergence between the ith slice along specified axis of p and q. f both p and q are numpy ndarrays, this returns a ndarray with shape (ncols,). If either p or q are pandas DataFrames then returns a pandas Series with index the same as the columns in the DataFrame (p takes priority if the column names differ).

Return type:

Array1D or Tuple of Array1D, Array1D

Notes

If either p or q are pandas DataFrames, they both must be and their indices along the specified axis must be the same

Kullback-Leibler Divergence

Function for calculating the Kullback-Leibler divergence between two probability distributions based on samples from those distributions.

metworkpy.divergence.kl_divergence_functions.kl_divergence(p: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], q: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', n_neighbors: int = 5, discrete: bool = False, jitter: float | None = None, jitter_seed: int | None = None, distance_metric: float | str = 'euclidean', clip: bool = False) float | DivergenceResult

Calculate the Kulback-Leibler divergence between two distributions represented by samples p and q

Parameters:
  • p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test

  • alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative to use, see metworkpy.utils.permutation.permutation_test

  • permutations (int, default=9999) – The number of permuatations to use when calculating the p-value

  • permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.

  • permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde

  • n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.

  • discrete (bool) – Whether the samples are from discrete distributions

  • jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.

  • jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise

  • distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.

  • clip (bool, default=False) – Whether or not to clip the divergence values at 0.0

Returns:

The Kulback-Leibler divergence between p and q

Return type:

float

Notes

  • This function is not symmetrical, p is treated as a ‘true’ distribution, and q as an approximating distribution. If you want a symmetric metric try the Jenson-Shannon divergence.

See also

1. Q. Wang, S. R. Kulkarni and S. Verdu, “Divergence Estimation for Multidimensional Densities Via k-Nearest-Neighbor Distances” in IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2392-2405, May 2009, doi: 10.1109/TIT.2009.2016060.

Method for estimating the mutual information between samples from two continuous distributions based on nearest-neighbor distances.

metworkpy.divergence.kl_divergence_functions.kl_divergence_array(p: DataFrame | ndarray, q: DataFrame | ndarray, axis: int = 1, processes: int = 1, **kwargs) Series | ndarray[Tuple[int], dtype[float32 | float64]] | Tuple[Series | ndarray[Tuple[int], dtype[float32 | float64]], Series | ndarray[Tuple[int], dtype[float32 | float64]]]

Calculate the Kullback-Leibler divergence between two arrays along the specified axis.

Parameters:
  • p (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.

  • q (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.

  • axis (int, default=1) – Axis to slice along to get the arrays representing samples from the distributions to calculate the divergence between. For example, axis=1 specified that 2-dimensional the p and q arrays will be sliced along the columns. The size of p and q along this axis must match.

  • processes (int) – Number of processes to use when calculating the divergence (default 1)

  • kwargs – Keyword arguments are passed to the kl_divergence function

Returns:

Array with length equal to the shape along the axis in p and q, the ith value representing the divergence between the ith slice along specified axis of p and q. f both p and q are numpy ndarrays, this returns a ndarray with shape (ncols,). If either p or q are pandas DataFrames then returns a pandas Series with index the same as the columns in the DataFrame (p takes priority if the column names differ).

Return type:

Array1D or Tuple of Array1D, Array1D

Notes

If either p or q are pandas DataFrames, they both must be and their indices along the specified axis must be the same

Group Divergence

Submodule containing functions which will calculate the divergence between two dataframes for groups of columns

metworkpy.divergence.group_divergence.calculate_divergence_grouped(dataset1: DataFrame, dataset2: DataFrame, divergence_groups: dict[str, list[Hashable]], divergence_type: Literal['kl', 'js'] = 'kl', calculate_pvalue: bool = False, processes: int = 1, **kwargs) Series | Tuple[Series, Series]

Calculate the divergence between data in two dataframes for a set of groups of columns

Parameters:
  • dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features

  • dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features

  • divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)

  • divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon

  • processes (int) – The number of processes to use for the calculation

  • kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type

Returns:

divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.

Return type:

pd.Series or tuple of pd.Series,pd.Series

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

metworkpy.divergence.group_divergence.calculate_reaction_neighborhood_divergence(model: Model, dataset1: DataFrame, dataset2: DataFrame, divergence_type: Literal['kl', 'js'] = 'kl', directed: bool = False, nodes_to_remove: list[str] | None = None, radius: int = 2, calculate_pvalue: bool = False, processes: int = 1, **kwargs)

Calculate the divergence between data in two dataframes for a set of groups of columns

Parameters:
  • dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features

  • dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features

  • divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)

  • divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon

  • directed (bool) – Whether the reaction network created to find reaction neighborhoods should be directed or not

  • nodes_to_remove (list[str] | None) – List of any metabolites or reactions that should be removed from the final network. This can be used to remove metabolites that participate in a large number of reactions, but are not desired in downstream analysis such as water, or ATP, or pseudo reactions like biomass. Each metabolite/reaction should be the string ID associated with them in the cobra model.

  • radius (int) – The radius determining the sizes of the neighborhoods

  • calculate_pvalue (bool) – Whether to calculate the p-value of the divergence difference using permutation testing

  • processes (int) – The number of processes to use for the calculation

  • kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type

Returns:

divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.

Return type:

pd.Series or tuple of pd.Series,pd.Series

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

Knockout Divergence

Determine the divergence in the network caused by a gene knock out

metworkpy.divergence.ko_divergence_functions.ko_divergence(model: Model, target_networks: list[str] | dict[str, list[str]], genes_to_ko: Iterable[str] | None = None, divergence_type: Literal['js', 'kl'] = 'kl', calculate_pvalue: bool = False, sample_count: int = 1000, progress_bar: bool = False, use_unperturbed_as_true: bool = True, sampler_seed: Generator | int | None = None, sampler_kwargs: dict[str, Any] | None = None, processes: int = 1, **kwargs) DataFrame | Tuple[DataFrame, DataFrame]

Determine the impacts of gene knock-outs on different target reaction or gene networks

Parameters:
  • model (cobra.Model) – Base cobra model to test effects of gene knockouts on

  • target_networks (list[str] | dict[str, list[str]]) – Target networks to investigate the impact of the gene knock-outs on. Can be a list or a dict of lists. If a dict, the keys will be used to name the network and the lists will specify the networks. If a list should be a single network. Entries in the lists can be either reaction or gene ids. Gene ids will be translated into reaction ids using the model. If a list is passed the name of the target network in the returned dataframe will be target_network, if a dict is passed the keys are used as the column names.

  • genes_to_ko (Iterable[str], optional) – List of genes to investigate impact of their knock-out, defaults to all genes in the model

  • divergence_type ('kl' or 'js', default='kl') – Which metric to use for divergence, can be ‘kl’ for Kullback-Leibler (default) or ‘js’ for Jensen-Shannon,

  • calculate_pvalue (bool, default=False) – Whether to calculate the significance value for the divergence

  • sample_count (int) – The number of samples to take in order to estimate the divergence

  • progress_bar (bool) – Whether a progress bar is desired

  • use_unperturbed_as_true (bool, default=True) – Which distribution to use as the “True” distribution (the P distribution) when estimating divergence between the perturbed (that is the model with a gene knock-out) and the unperturbed (model prior to the gene knock-out) flux samples. Doesn’t impact Jensen-Shannon as that is symetric, but will modify the Kullback-Leibler divergence.

  • sampler_seed (None or int or np.Generator, optional) – Seed used for sampling in order to create reproducible results, can be a numpy generator (in which cae it is used directly), or an integer (in which case it is used to seed a numpy generator).

  • sampler_kwargs (dict of str to Any) – Arguments passed to the sample method of COBRApy, see COBRApy Documentation

  • processes (int, default=1) – Number of processes to use for this function, passed to the sampler and also used as the number of processes for calculating the divergence for the different groups. Note that if you want a different number of processes for the sampler, you can use the sampler_kwargs dictionary.

  • **kwargs – Keyword arguments passed to the divergence method

Returns:

Dataframe with index of genes, and columns representing the different target networks. Values represent the divergence of a particular target network between the unperturbed model and the model following the gene knock-out.

Return type:

pd.DataFrame