Information Reference
API Documentation for the information submodule
Mutual Information Functions
Module for finding mutual information between two sampled distributions using nearest neighbor methods. Includes methods to compute mutual information between two continuous distributions, between two discrete distributions, and between a continuous and discrete distribution.
- metworkpy.information.mutual_information_functions.mutual_information(x: ArrayLike, y: ArrayLike, discrete_x: bool = False, discrete_y: bool = False, n_neighbors: int = 5, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, permutation_rng: Generator | int | None = None, permutation_estimation_method: Literal['kernel', 'empirical'] = 'empirical', jitter: None | float = None, jitter_seed: None | int = None, metric_x: str | float = 'euclidean', metric_y: str | float = 'euclidean', clip: bool = False) float | Tuple[float, float]
- Parameters:
x (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If
xis one dimensional, it will be reshaped to (n_samples, 1). If it is not a np.ndarray, this function will attempt to coerce it into one.y (ArrayLike) – Array representing sample from a distribution, should have shape (n_samples, n_dimensions). If
yis one dimensional, it will be reshaped to (n_samples, 1). If it is not a np.ndarray, this function will attempt to coerce it into one.discrete_x (bool) – Whether x is discrete or continuous
discrete_y (bool) – Whether y is discrete or continuous
n_neighbors (int) – Number of neighbors to use for computing mutual information. Will attempt to coerce into an integer. Must be at least 1. Default 5.
calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test
alternative ('less', 'greater', or 'two-sided') – The alternative to use, passed to metworkpy.utils.permutation.permutation_test
permutations (int) – The number of permuatations to use when calculating the p-value
permutation_rng (np.random.Generator or int, Optional) – A numpy random generator to use for sampling, or an int to seed the default generator.
permutation_estimation_method ({"kernel", "empirical"}, default="empirical") – Method to use for estimating p-value, either an empirical cdf, or a gaussian_kde
jitter (Union[None, float, tuple[float,float]]) – Amount of noise to add to avoid ties. If None no noise is added. If a float, that is the standard deviation of the random noise added to the continuous samples. If a tuple, the first element is the standard deviation of the noise added to the x array, the second element is the standard deviation added to the y array.
jitter_seed (Union[None, int]) – Seed for the random number generator used for adding noise
metric_x (Union[str, int]) – Metric to use for computing distance between points in x, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
metric_y (Union[str, int]) – Metric to use for computing distance between points in y, can be “Euclidean”, “Manhattan”, or “Chebyshev”. Can also be a float representing the Minkowski p-norm.
clip (bool) – Whether to ensure the mutual information is non-negative
- Returns:
The mutual information between x and y, or if calculate_pvalue is True, a tuple of the mutual information and the p-value
- Return type:
float or tuple of float,float
Notes
The metrics can either be provided as a float greater than 1 representing the Minkowski p-norm, or a string representing the name of a metric such as ‘Manhattan’, ‘Chebyshev’, or ‘Euclidean’.
For scalar samples (samples from a 1-D distribution), all the metrics are the same.
In the case of two continuous distributions, the distance in the z space (i.e. the joint (X,Y) space), is determined by the maximum norm (||z-z`|| = max{||x-x`||, ||y-y`||}), see [1] for more details.
Always returns value in nats (i.e. mutual information is calculated using the natural logarithm.
See also
- Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6), 066138.
Method for estimating mutual information between samples from two continuous distributions.
- Ross, B. C. (2014). Mutual Information between Discrete and Continuous Data Sets. PLoS ONE, 9(2), e87357.
Method for estimating mutual information between a sample from a discrete distribution and a sample from a continuous distribution.
Mutual Information Networks
Functions for computing the Mutual Information Network for a Metabolic Model
- metworkpy.information.mutual_information_network.create_grouped_mi_network(dataset: T, groups: Iterable[Hashable | int64] | dict[Hashable, Iterable[Hashable | int64]], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) Graph
Calculate all pairwise values of mutual information between groups of columns in a dataset
- Parameters:
dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate grouped pairwise mutual information values for, should be a 2-dimensional array or Dataframe
groups (Iterable of indices or dict of Hashable to Iterable of indices) – The groups of columns in the dataset to calculate the mutual information between. Can be an iterable, in which case the groups will be named in order ‘0’, ‘1’, etc., or a dict in which case the groups will be named by the dict key. The definition of the groups themselves should be the indices of the columns in the dataset. If dataset is a numpy array, these should be ints, and if dataset is a pandas DataFrame, these will be passed to pandas.Index.get_indexer of the columns Index.
calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test
alternative ('less', 'greater', or 'two-sided') – The alternative to use
permutations (int) – The number of permuatations to use when calculating the p-value
cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0
cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.
cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.
processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information
progress_bar (bool, default=False) – Whether a progress bar is desired
kwargs – Keyword arguments passed into the mutual_information function
- Returns:
The mutual information network with a node for each group, and edges between groups with weights corresponding to the mutual information between them, and if calculate_pvalue is True another edge attribute ‘p-value’ which is the result of the permutation test for significance of the mutual information.
- Return type:
nx.Graph
Notes
The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager
- metworkpy.information.mutual_information_network.mi_network_adjacency_matrix(samples: T, **kwargs) T | Tuple[T, T]
Create a Mutual Information Network Adjacency matrix from flux samples. Uses kth nearest neighbor method for estimating mutual information.
- Parameters:
samples (ArrayLike or DataFrame or NDArray) – ArrayLike containing the samples, columns should represent different reactions while rows should represent different samples
kwargs – Keyword arguments passed to the mi_pairwise function
- Returns:
mutual_information_ – The mutual information adjacency matrix, will share a type with the ArrayLike passed in, and be a square symmetrical array, with the value at the ith row, jth column representing the mutual information between the ith and jth columns of the input samples dataset
- Return type:
ArrayLike or DataFrame or NDArray
See also
mi_pairwiseFunction wrapped by this function
- metworkpy.information.mutual_information_network.mi_pairwise(dataset: T, calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) T | Tuple[T, T]
Calculate all pairwise values of mutual information for columns in dataset
- Parameters:
dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate pairwise mutual information values for, should be a 2-dimensional array or Dataframe
calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test
alternative ('less', 'greater', or 'two-sided') – The alternative to use
permutations (int) – The number of permuatations to use when calculating the p-value
cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0
cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.
cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.
processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information
progress_bar (bool, default=False) – Whether a progress bar is desired
kwargs – Keyword arguments passed into the mutual_information function
- Returns:
The mutual information between every pair of columns in dataset. If dataset is a numpy NDArray or an Arraylike, will return a numpy NDArray. If dataset is a pandas DataFrame, will return a pandas DataFrame. If calculate_pvalue is True, will instead return a tuple of the appropriate array type, with the first element being the mutual information array, and the second being the p-values
- Return type:
DataFrame or NDArray or Tuple of DataFrame or NDArray
Notes
The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager
- metworkpy.information.mutual_information_network.mi_pairwise_grouped(dataset: T, groups: Iterable[Hashable | int64] | dict[Hashable, Iterable[Hashable | int64]], calculate_pvalue: bool = False, alternative: Literal['less', 'greater', 'two-sided'] = 'greater', permutations: int = 500, cutoff: float | None = None, cutoff_quantile: float | None = None, cutoff_significance: float | None = None, processes: int = -1, progress_bar: bool = False, **kwargs) DataFrame | tuple[DataFrame, DataFrame]
Calculate all pairwise values of mutual information between groups of columns in a dataset
- Parameters:
dataset (ArrayLike or DataFrame or NDArray) – The dataset to calculate grouped pairwise mutual information values for, should be a 2-dimensional array or Dataframe
groups (Iterable of indices or dict of Hashable to Iterable of indices) –
The groups of columns in the dataset to calculate the mutual information between. Can be an iterable, in which case the groups will be named in order ‘0’, ‘1’, etc., or a dict in which case the groups will be named by the dict key. The definition of the groups themselves should be the indices of the columns in the dataset. If dataset is a numpy array, these should be ints, and if dataset is a pandas DataFrame, these will be passed to pandas.Index.get_indexer of the columns Index.
calculate_pvalue (bool) – Whether to calculate a p-value for the mutual information using a permutation test
alternative ('less', 'greater', or 'two-sided') – The alternative to use
permutations (int) – The number of permuatations to use when calculating the p-value
cutoff (float, optional) – Lower bound for mutual information, all values smaller than this are set to 0
cutoff_quantile (float, optional) – Lower bound for mutual information as a quantile, must be a value between 0 and 1 representing the quantile to use as a cutoff. Any values below this quantile will be set to 0.
cutoff_significance (float, optional) – Upper bound for the significance of the mutual information, any mutual information values with p-values above this cutoff will have their mutual information set to 0. Requires that calculate_pvalue is True.
processes (int, default=-1) – The number of processes to use for calculating the pairwise mutual information
progress_bar (bool, default=False) – Whether a progress bar is desired
kwargs – Keyword arguments passed into the mutual_information function
- Returns:
The mutual information between each pair of groups. If calculate_pvalue is False, will be a single pd.DataFrame with a column and row for each group. If calculate_pvalue is True, will instead return a tuple pd.DataFrame, with the first element being the mutual information array, and the second being the p-values.
- Return type:
DataFrame or Tuple of DataFrame
Notes
The parallelization uses joblib, and so can be configured with joblib’s parallel_config context manager