Divergence Reference
API Documentation for the divergence submodule
Jensen-Shannon Divergence
Functions for calculating the Jenson-Shannon divergence between two sampled distributions
- metworkpy.divergence.js_divergence_functions.js_divergence(p: ArrayLike, q: ArrayLike, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', n_neighbors: int = 5, discrete: bool = False, jitter: float | None = None, jitter_seed: int | None = None, distance_metric: float | str = 'euclidean', clip: bool = False) float | DivergenceResult
Calculate the Jensen-Shannon divergence between two distributions represented by samples p and q
- Parameters:
p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test
alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative hypothesis to use, see metworkpy.utils.permutation.permutation_test
permutations (int, default=9999) – The number of permuatations to use when calculating the p-value
permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.
permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde
n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.
discrete (bool) – Whether the samples are from discrete distributions
jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.
jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise
distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
clip (bool, default=False) – Whether or not to clip the divergence values at 0.0
- Returns:
The Jensen-Shannon divergence between p and q
- Return type:
float
See also
Ross,B.,9,e87357.
- metworkpy.divergence.js_divergence_functions.js_divergence_array(p: DataFrame | ndarray, q: DataFrame | ndarray, axis: int = 1, processes: int = 1, **kwargs) Series | ndarray[Tuple[int], dtype[float32 | float64]] | Tuple[Series | ndarray[Tuple[int], dtype[float32 | float64]], Series | ndarray[Tuple[int], dtype[float32 | float64]]]
Calculate the Jensen-Shannon divergence between arrays along the specified axis.
- Parameters:
p (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.
q (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.
axis (int, default=1) – Axis to slice along to get the arrays representing samples from the distributions to calculate the divergence between. For example, axis=1 specified that 2-dimensional the p and q arrays will be sliced along the columns. The size of p and q along this axis must match.
processes (int) – Number of processes to use when calculating the divergence (default 1)
kwargs – Keyword arguments are passed to the js_divergence function
- Returns:
Array with length equal to the shape along the axis in p and q, the ith value representing the divergence between the ith slice along specified axis of p and q. f both p and q are numpy ndarrays, this returns a ndarray with shape (ncols,). If either p or q are pandas DataFrames then returns a pandas Series with index the same as the columns in the DataFrame (p takes priority if the column names differ).
- Return type:
Array1D or Tuple of Array1D, Array1D
Notes
If either p or q are pandas DataFrames, they both must be and their indices along the specified axis must be the same
Kullback-Leibler Divergence
Function for calculating the Kullback-Leibler divergence between two probability distributions based on samples from those distributions.
- metworkpy.divergence.kl_divergence_functions.kl_divergence(p: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], q: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', n_neighbors: int = 5, discrete: bool = False, jitter: float | None = None, jitter_seed: int | None = None, distance_metric: float | str = 'euclidean', clip: bool = False) float | DivergenceResult
Calculate the Kulback-Leibler divergence between two distributions represented by samples p and q
- Parameters:
p (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If p is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
q (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If q is one dimensional, it will be reshaped to (n_samples,1). If it is not a np.ndarray, this function will attempt to coerce it into one.
calculate_pvalue (bool, default=False) – Whether the p-value should be calculated using a permutation test
alternative ('less', 'greater', or 'two-sided', default='greater') – The alternative to use, see metworkpy.utils.permutation.permutation_test
permutations (int, default=9999) – The number of permuatations to use when calculating the p-value
permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.
permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde
n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.
discrete (bool) – Whether the samples are from discrete distributions
jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.
jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise
distance_metric (Union[str, float]) – Metric to use for computing distance between points in p and q, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
clip (bool, default=False) – Whether or not to clip the divergence values at 0.0
- Returns:
The Kulback-Leibler divergence between p and q
- Return type:
float
Notes
This function is not symmetrical, p is treated as a ‘true’ distribution, and q as an approximating distribution. If you want a symmetric metric try the Jenson-Shannon divergence.
See also
1. Q. Wang, S. R. Kulkarni and S. Verdu, “Divergence Estimation for Multidimensional Densities Via k-Nearest-Neighbor Distances” in IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2392-2405, May 2009, doi: 10.1109/TIT.2009.2016060.
Method for estimating the mutual information between samples from two continuous distributions based on nearest-neighbor distances.
- metworkpy.divergence.kl_divergence_functions.kl_divergence_array(p: DataFrame | ndarray, q: DataFrame | ndarray, axis: int = 1, processes: int = 1, **kwargs) Series | ndarray[Tuple[int], dtype[float32 | float64]] | Tuple[Series | ndarray[Tuple[int], dtype[float32 | float64]], Series | ndarray[Tuple[int], dtype[float32 | float64]]]
Calculate the Kullback-Leibler divergence between two arrays along the specified axis.
- Parameters:
p (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.
q (ArrayInput) – Sample array, where slices along the specified axis represent the distributions to calculate the divergence between.
axis (int, default=1) – Axis to slice along to get the arrays representing samples from the distributions to calculate the divergence between. For example, axis=1 specified that 2-dimensional the p and q arrays will be sliced along the columns. The size of p and q along this axis must match.
processes (int) – Number of processes to use when calculating the divergence (default 1)
kwargs – Keyword arguments are passed to the kl_divergence function
- Returns:
Array with length equal to the shape along the axis in p and q, the ith value representing the divergence between the ith slice along specified axis of p and q. f both p and q are numpy ndarrays, this returns a ndarray with shape (ncols,). If either p or q are pandas DataFrames then returns a pandas Series with index the same as the columns in the DataFrame (p takes priority if the column names differ).
- Return type:
Array1D or Tuple of Array1D, Array1D
Notes
If either p or q are pandas DataFrames, they both must be and their indices along the specified axis must be the same
Group Divergence
Submodule containing functions which will calculate the divergence between two dataframes for groups of columns
- metworkpy.divergence.group_divergence.calculate_divergence_grouped(dataset1: DataFrame, dataset2: DataFrame, divergence_groups: dict[str, list[Hashable]], divergence_type: Literal['kl', 'js'] = 'kl', calculate_pvalue: bool = False, processes: int = 1, **kwargs) Series | Tuple[Series, Series]
Calculate the divergence between data in two dataframes for a set of groups of columns
- Parameters:
dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)
divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon
processes (int) – The number of processes to use for the calculation
kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type
- Returns:
divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.
- Return type:
pd.Series or tuple of pd.Series,pd.Series
Notes
The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager
- metworkpy.divergence.group_divergence.calculate_reaction_neighborhood_divergence(model: Model, dataset1: DataFrame, dataset2: DataFrame, divergence_type: Literal['kl', 'js'] = 'kl', directed: bool = False, nodes_to_remove: list[str] | None = None, radius: int = 2, calculate_pvalue: bool = False, processes: int = 1, **kwargs)
Calculate the divergence between data in two dataframes for a set of groups of columns
- Parameters:
dataset1 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
dataset2 (pd.DataFrame) – Datasets to calculate the divergence between, rows should represent different samples, and columns should represent different features
divergence_groups (dict of str to list of Hashable) – The groups to calculate divergence for, indexed by name of the group, with values of lists of features that belong to the group (the feature names must match names of columns in the dataframes)
divergence_type ('kl' or 'js') – The type of divergence to calculate, either kl for Kullback-Leibler (default) or js for Jenson-Shannon
directed (bool) – Whether the reaction network created to find reaction neighborhoods should be directed or not
nodes_to_remove (list[str] | None) – List of any metabolites or reactions that should be removed from the final network. This can be used to remove metabolites that participate in a large number of reactions, but are not desired in downstream analysis such as water, or ATP, or pseudo reactions like biomass. Each metabolite/reaction should be the string ID associated with them in the cobra model.
radius (int) – The radius determining the sizes of the neighborhoods
calculate_pvalue (bool) – Whether to calculate the p-value of the divergence difference using permutation testing
processes (int) – The number of processes to use for the calculation
kwargs – Keyword arguments passed into the divergence method function either kl_divergence or js_divergence depending on divergence_type
- Returns:
divergence – A pandas series indexed by group name, with values representing the divergence of that group between the two dataframes. If calculate_pvalue is True, then instead returns a tuple, of two pandas Series, the first being the divergence results, and the second being the p-values.
- Return type:
pd.Series or tuple of pd.Series,pd.Series
Notes
The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager
Knockout Divergence
Determine the divergence in the network caused by a gene knock out
- metworkpy.divergence.ko_divergence_functions.ko_divergence(model: Model, target_networks: list[str] | dict[str, list[str]], genes_to_ko: Iterable[str] | None = None, divergence_type: Literal['js', 'kl'] = 'kl', calculate_pvalue: bool = False, sample_count: int = 1000, progress_bar: bool = False, use_unperturbed_as_true: bool = True, sampler_seed: Generator | int | None = None, sampler_kwargs: dict[str, Any] | None = None, processes: int = 1, **kwargs) DataFrame | Tuple[DataFrame, DataFrame]
Determine the impacts of gene knock-outs on different target reaction or gene networks
- Parameters:
model (cobra.Model) – Base cobra model to test effects of gene knockouts on
target_networks (list[str] | dict[str, list[str]]) – Target networks to investigate the impact of the gene knock-outs on. Can be a list or a dict of lists. If a dict, the keys will be used to name the network and the lists will specify the networks. If a list should be a single network. Entries in the lists can be either reaction or gene ids. Gene ids will be translated into reaction ids using the model. If a list is passed the name of the target network in the returned dataframe will be target_network, if a dict is passed the keys are used as the column names.
genes_to_ko (Iterable[str], optional) – List of genes to investigate impact of their knock-out, defaults to all genes in the model
divergence_type ('kl' or 'js', default='kl') – Which metric to use for divergence, can be ‘kl’ for Kullback-Leibler (default) or ‘js’ for Jensen-Shannon,
calculate_pvalue (bool, default=False) – Whether to calculate the significance value for the divergence
sample_count (int) – The number of samples to take in order to estimate the divergence
progress_bar (bool) – Whether a progress bar is desired
use_unperturbed_as_true (bool, default=True) – Which distribution to use as the “True” distribution (the P distribution) when estimating divergence between the perturbed (that is the model with a gene knock-out) and the unperturbed (model prior to the gene knock-out) flux samples. Doesn’t impact Jensen-Shannon as that is symetric, but will modify the Kullback-Leibler divergence.
sampler_seed (None or int or np.Generator, optional) – Seed used for sampling in order to create reproducible results, can be a numpy generator (in which cae it is used directly), or an integer (in which case it is used to seed a numpy generator).
sampler_kwargs (dict of str to Any) – Arguments passed to the sample method of COBRApy, see COBRApy Documentation
processes (int, default=1) – Number of processes to use for this function, passed to the sampler and also used as the number of processes for calculating the divergence for the different groups. Note that if you want a different number of processes for the sampler, you can use the sampler_kwargs dictionary.
**kwargs – Keyword arguments passed to the divergence method
- Returns:
Dataframe with index of genes, and columns representing the different target networks. Values represent the divergence of a particular target network between the unperturbed model and the model following the gene knock-out.
- Return type:
pd.DataFrame