Information Reference

API Documentation for the information submodule

Mutual Information Functions

Module for finding mutual information between two sampled distributions using nearest neighbor methods. Includes methods to compute mutual information between two continuous distributions, between two discrete distributions, and between a continuous and discrete distribution.

metworkpy.information.mutual_information_functions.mutual_information(x: ArrayLike, y: ArrayLike, discrete_x: bool = False, discrete_y: bool = False, n_neighbors: int = 5, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', jitter: None | float = None, jitter_seed: None | int = None, metric_x: str | float = 'euclidean', metric_y: str | float = 'euclidean', clip: bool = False) float | Tuple[float, float]
Parameters:
  • x (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If x is one dimensional, it will be reshaped to (n_samples, 1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • y (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If y is one dimensional, it will be reshaped to (n_samples, 1). If it is not a np.ndarray, this function will attempt to coerce it into one.

  • discrete_x (bool) – Whether x is discrete or continuous

  • discrete_y (bool) – Whether y is discrete or continuous

  • n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.

  • calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test

  • alternative ('less', 'greater', or 'two-sided') – The alternative to use, passed to metworkpy.utils.permutation.permutation_test

  • permutations (int) – The number of permuatations to use when calculating the p-value

  • permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.

  • permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde

  • jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.

  • jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise

  • metric_x (Union[str, int]) – Metric to use for computing distance between points in x, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.

  • metric_y (Union[str, int]) – Metric to use for computing distance between points in y, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.

  • clip (bool) – Whether to ensure the mutual information is non-negative

Returns:

The mutual information between x and y, or if calculate_pvalue is True, a tuple of the mutual information and the p-value

Return type:

float or tuple of float,float

Notes

  • The metrics can either be provided as a float greater than 1 representing the Minkowski p-norm, or a string representing the name of a metric such as ‘Manhattan’, ‘Chebyshev’, or ‘Euclidean’.

  • For scalar samples (samples from a 1-D distribution), all the metrics are the same.

  • In the case of two continuous distributions, the distance in the z space (i.e. the joint (X,Y) space), is determined by the maximum norm (||z-z`|| = max{||x-x`||, ||y-y`||}), see [1] for more details.

  • Always returns value in nats (i.e. mutual information is calculated using the natural logarithm.

See also

  1. Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6), 066138.

    Method for estimating mutual information between samples from two continuous distributions.

  2. Ross, B. C. (2014). Mutual Information between Discrete and Continuous Data Sets. PLoS ONE, 9(2), e87357.

    Method for estimating mutual information between a sample from a discrete distribution and a sample from a continuous distribution.

Mutual Information Networks

Functions for computing the Mutual Information Network for a Metabolic Model

metworkpy.information.mutual_information_network.create_grouped_mi_network(dataset: T, groups: Iterable[Hashable | int64] | dict[Hashable, Iterable[Hashable | int64]], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) Graph

Calculate all pairwise values of mutual information between groups of columns in a dataset

Parameters:
  • dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate grouped pairwise mutual information values for, should be a 2-dimensional array or Dataframe

  • groups (Iterable of indices or dict of Hashable to Iterable of indices) – The groups of columns in the dataset to calculate the mutual information between. Can be an iterable, in which case the groups will be named in order ‘0’, ‘1’, etc., or a dict in which case the groups will be named by the dict key. The definition of the groups themselves should be the indices of the columns in the dataset. If dataset is a numpy array, these should be ints, and if dataset is a pandas DataFrame, these will be passed to pandas.Index.get_indexer of the columns Index.

  • calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test

  • alternative ('less', 'greater', or 'two-sided') – The alternative to use

  • permutations (int) – The number of permuatations to use when calculating the p-value

  • cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0

  • cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.

  • cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.

  • processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information

  • progress_bar (bool, default=False) – Whether a progress bar is desired

  • kwargs – Keyword arguments passed into the mutual_information function

Returns:

The mutual information network with a node for each group, and edges between groups with weights corresponding to the mutual information between them, and if calculate_pvalue is True another edge attribute ‘p-value’ which is the result of the permutation test for significance of the mutual information.

Return type:

nx.Graph

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

metworkpy.information.mutual_information_network.mi_network_adjacency_matrix(samples: T, **kwargs) T | Tuple[T, T]

Create a Mutual Information Network Adjacency matrix from flux samples. Uses kth nearest neighbor method for estimating mutual information.

Parameters:
  • samples (ArrayLike or DataFrame or NDArray) – ArrayLike containing the samples, columns should represent different reactions while rows should represent different samples

  • kwargs – Keyword arguments passed to the mi_pairwise function

Returns:

mutual_information_ – The mutual information adjacency matrix, will share a type with the ArrayLike passed in, and be a square symmetrical array, with the value at the ith row, jth column representing the mutual information between the ith and jth columns of the input samples dataset

Return type:

ArrayLike or DataFrame or NDArray

See also

mi_pairwise

Function wrapped by this function

metworkpy.information.mutual_information_network.mi_pairwise(dataset: T, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) T | Tuple[T, T]

Calculate all pairwise values of mutual information for columns in dataset

Parameters:
  • dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate pairwise mutual information values for, should be a 2-dimensional array or Dataframe

  • calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test

  • alternative ('less', 'greater', or 'two-sided') – The alternative to use

  • permutations (int) – The number of permuatations to use when calculating the p-value

  • cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0

  • cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.

  • cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.

  • processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information

  • progress_bar (bool, default=False) – Whether a progress bar is desired

  • kwargs – Keyword arguments passed into the mutual_information function

Returns:

The mutual information between every pair of columns in dataset. If dataset is a numpy NDArray or an Arraylike, will return a numpy NDArray. If dataset is a pandas DataFrame, will return a pandas DataFrame. If calculate_pvalue is True, will instead return a tuple of the appropriate array type, with the first element being the mutual information array, and the second being the p-values

Return type:

DataFrame or NDArray or Tuple of DataFrame or NDArray

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager

metworkpy.information.mutual_information_network.mi_pairwise_grouped(dataset: T, groups: Iterable[Hashable | int64] | dict[Hashable, Iterable[Hashable | int64]], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) DataFrame | tuple[DataFrame, DataFrame]

Calculate all pairwise values of mutual information between groups of columns in a dataset

Parameters:
  • dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate grouped pairwise mutual information values for, should be a 2-dimensional array or Dataframe

  • groups (Iterable of indices or dict of Hashable to Iterable of indices) –

    The groups of columns in the dataset to calculate the mutual information between. Can be an iterable, in which case the groups will be named in order ‘0’, ‘1’, etc., or a dict in which case the groups will be named by the dict key. The definition of the groups themselves should be the indices of the columns in the dataset. If dataset is a numpy array, these should be ints, and if dataset is a pandas DataFrame, these will be passed to pandas.Index.get_indexer of the columns Index.

  • calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test

  • alternative ('less', 'greater', or 'two-sided') – The alternative to use

  • permutations (int) – The number of permuatations to use when calculating the p-value

  • cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0

  • cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.

  • cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.

  • processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information

  • progress_bar (bool, default=False) – Whether a progress bar is desired

  • kwargs – Keyword arguments passed into the mutual_information function

Returns:

The mutual information between each pair of groups. If calculate_pvalue is False, will be a single pd.DataFrame with a column and row for each group. If calculate_pvalue is True, will instead return a tuple pd.DataFrame, with the first element being the mutual information array, and the second being the p-values.

Return type:

DataFrame or Tuple of DataFrame

Notes

The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager