AutoEncoder Utils¶
The utils package provides utility functions for data processing, visualization, and sequence handling in the AutoEncoder module.
Processing Module¶
The processing module contains functions for data preprocessing, normalization, and transformation.
- mango_autoencoder.utils.processing.abs_mean(data: numpy.ndarray | pandas.Series | polars.Series) numpy.floating | float ¶
Calculate the mean of absolute values in the data.
Computes the arithmetic mean of the absolute values of all elements in the input data. Supports NumPy arrays, Pandas Series, and Polars Series.
- Parameters:
data (Union[np.ndarray, pd.Series, pl.Series]) – Input data for which to calculate absolute mean
- Returns:
Mean of absolute values
- Return type:
Union[np.floating, float]
- Raises:
TypeError – If data type is not supported
- Example:
>>> import numpy as np >>> import pandas as pd >>> data = np.array([-2, -1, 0, 1, 2]) >>> abs_mean(data) 1.2 >>> series = pd.Series([-5, 3, -1, 4]) >>> abs_mean(series) 3.25
- mango_autoencoder.utils.processing.reintroduce_nans(self, df: pandas.DataFrame, id: str) pandas.DataFrame ¶
Reintroduce NaN values back into the dataset based on stored coordinates.
When preparing datasets for autoencoder training, NaN values are typically removed. This function restores the original NaN values to their correct positions using the stored NaN coordinates for the specified dataset ID.
- Parameters:
df (pd.DataFrame) – Data to reintroduce NaNs back into
id (str) – Identifier of dataset (“global” is only one dataset)
- Returns:
Data with reintroduced NaNs in their original positions
- Return type:
pd.DataFrame
- Raises:
ValueError – If “id” not in self._nan_coordinates
- Example:
>>> # Assuming self._nan_coordinates contains stored NaN positions >>> df_clean = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]}) >>> df_with_nans = reintroduce_nans(df_clean, "dataset_1") >>> # NaN values are restored based on stored coordinates
- mango_autoencoder.utils.processing.id_pivot(df: pandas.DataFrame, id: str) pandas.DataFrame ¶
Select subset of data based on ID and pivot to time-series format.
Extracts data for a specific ID and transforms it from long format (with feature, time_step, value columns) to wide format with time_step as rows and features as columns, preserving the original feature order.
- Parameters:
df (pd.DataFrame) – Data in long format with columns: id, feature, time_step, value, data_split
id (str) – Identifier of dataset (“global” is only one dataset)
- Returns:
Data subset pivoted to wide format with time_step as index and features as columns
- Return type:
pd.DataFrame
- Raises:
ValueError – If df does not have required columns or no data found for ID
- Example:
>>> df_long = pd.DataFrame({ ... "id": ["A", "A", "A", "A"], ... "feature": ["temp", "humidity", "temp", "humidity"], ... "time_step": [0, 0, 1, 1], ... "value": [25.5, 60.0, 26.0, 58.0], ... "data_split": ["train", "train", "train", "train"] ... }) >>> df_pivoted = id_pivot(df_long, "A") >>> # Result: time_step as index, temp and humidity as columns
- mango_autoencoder.utils.processing.save_csv(data: pandas.DataFrame, save_path: str, filename: str, save_index: bool = False, decimals: int = 4, compression: str = 'infer', logger_msg: str = 'standard') None ¶
Save a DataFrame as a CSV file with configurable formatting options.
Saves a pandas DataFrame to a CSV file with options for index inclusion, decimal precision, and compression. Creates the directory if it doesn’t exist and provides logging feedback on the save operation.
- Parameters:
data (pd.DataFrame) – DataFrame to save to CSV
save_path (str) – Directory path where the CSV file will be saved
filename (str) – Name of the CSV file (must end with .csv or .csv.zip)
save_index (bool) – Whether to include the DataFrame index in the CSV
decimals (int) – Number of decimal places for floating point numbers
compression (str) – Type of compression to use (‘infer’, ‘gzip’, ‘bz2’, etc.)
logger_msg (str) – Custom logger message or ‘standard’ for default message
- Returns:
None
- Return type:
None
- Raises:
ValueError – If filename doesn’t end with .csv or .csv.zip
Exception – If file saving fails
- Example:
>>> import pandas as pd >>> df = pd.DataFrame({"A": [1.123456, 2.789012], "B": [3.456789, 4.012345]}) >>> save_csv(df, "./output", "data.csv", decimals=2) >>> # Saves data.csv with 2 decimal places in ./output/ directory
- mango_autoencoder.utils.processing.time_series_split(data: numpy.ndarray, train_size: float, val_size: float, test_size: float) Tuple[numpy.ndarray, numpy.ndarray | None, numpy.ndarray | None] ¶
Split time series data into training, validation, and test sets sequentially.
Performs a sequential split of time series data maintaining temporal order, which is crucial for time series analysis. The data is split according to the specified proportions, with training data coming first, followed by validation and test data.
- Parameters:
data (np.ndarray) – Time series data array to split (samples x features)
train_size (float) – Proportion of the dataset for training (0.0 to 1.0)
val_size (float) – Proportion of the dataset for validation (0.0 to 1.0)
test_size (float) – Proportion of the dataset for testing (0.0 to 1.0)
- Returns:
Tuple containing (training_data, validation_data, test_data)
- Return type:
Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]
- Raises:
ValueError – If sizes are None or their sum is not 1.0
- Example:
>>> import numpy as np >>> data = np.random.randn(1000, 5) # 1000 samples, 5 features >>> train, val, test = time_series_split(data, 0.7, 0.2, 0.1) >>> print(f"Train: {train.shape}, Val: {val.shape}, Test: {test.shape}") Train: (700, 5), Val: (200, 5), Test: (100, 5)
- mango_autoencoder.utils.processing.convert_data_to_numpy(data: Any) Tuple[numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], List[str]] ¶
Convert various data formats to numpy arrays for autoencoder processing.
Handles conversion of pandas DataFrames, polars DataFrames, numpy arrays, and tuples of these types to numpy format. Extracts feature names when available and ensures consistency across tuple elements.
- Parameters:
data (Any) – Input data that can be pandas DataFrame, polars DataFrame, numpy array, or tuple of these types
- Returns:
Tuple containing (converted_data, feature_names)
- Return type:
Tuple[Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]], List[str]]
- Raises:
ValueError – If data type is not supported or tuple elements have different feature names
- Example:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> array, features = convert_data_to_numpy(df) >>> print(f"Array shape: {array.shape}, Features: {features}") Array shape: (3, 2), Features: ['A', 'B']
- mango_autoencoder.utils.processing.denormalize_data(data: numpy.ndarray, normalization_method: str | None, min_x: numpy.ndarray | None = None, max_x: numpy.ndarray | None = None, mean_: numpy.ndarray | None = None, std_: numpy.ndarray | None = None) numpy.ndarray ¶
Denormalize data back to its original scale using stored normalization parameters.
Reverses the normalization process applied during training, converting normalized data back to its original scale. Supports minmax and zscore normalization methods with their respective parameter sets.
- Parameters:
data (np.ndarray) – Normalized data to denormalize (samples x features)
normalization_method (Optional[str]) – Method used for original normalization (‘minmax’, ‘zscore’, or None)
min_x (Optional[np.ndarray]) – Minimum values used for minmax normalization (per feature)
max_x (Optional[np.ndarray]) – Maximum values used for minmax normalization (per feature)
mean (Optional[np.ndarray]) – Mean values used for zscore normalization (per feature)
std (Optional[np.ndarray]) – Standard deviation values used for zscore normalization (per feature)
- Returns:
Denormalized data in original scale
- Return type:
np.ndarray
- Raises:
ValueError – If normalization method is invalid
TypeError – If required parameters are not numpy arrays
- Example:
>>> import numpy as np >>> normalized_data = np.array([[0.0, 0.5], [1.0, 1.0]]) # Minmax normalized >>> min_vals = np.array([10, 20]) >>> max_vals = np.array([30, 40]) >>> original_data = denormalize_data(normalized_data, "minmax", min_vals, max_vals) >>> print(original_data) [[10. 30.] [30. 40.]]
- mango_autoencoder.utils.processing.normalize_data_for_training(x_train: numpy.ndarray, x_val: numpy.ndarray, x_test: numpy.ndarray, normalization_method: str | None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Dict[str, Any]] ¶
Normalize training, validation, and test data using the specified method.
Applies normalization to all three datasets using parameters computed from the training data only. This ensures proper data leakage prevention by using only training statistics for normalization. Handles constant columns safely by avoiding division by zero.
- Parameters:
x_train (np.ndarray) – Training data to normalize (samples x features)
x_val (np.ndarray) – Validation data to normalize (samples x features)
x_test (np.ndarray) – Test data to normalize (samples x features)
normalization_method (Optional[str]) – Method to use (‘minmax’, ‘zscore’, or None)
- Returns:
Tuple containing (normalized_train, normalized_val, normalized_test, normalization_params)
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[str, Any]]
- Raises:
ValueError – If normalization method is invalid
- Example:
>>> import numpy as np >>> train = np.array([[1, 10], [2, 20], [3, 30]]) >>> val = np.array([[4, 40], [5, 50]]) >>> test = np.array([[6, 60]]) >>> norm_train, norm_val, norm_test, params = normalize_data_for_training(train, val, test, "minmax") >>> print(f"Normalized train shape: {norm_train.shape}") Normalized train shape: (3, 2)
- mango_autoencoder.utils.processing.normalize_data(data: numpy.ndarray, normalization_method: str | None, normalization_values: Dict[str, Any]) Tuple[numpy.ndarray, Dict[str, numpy.ndarray]] ¶
Normalize data using provided normalization parameters or compute new ones.
Normalizes data using either pre-computed normalization parameters or computes new parameters from the data itself. Supports minmax and zscore normalization methods with safe handling of constant columns.
- Parameters:
data (np.ndarray) – Data to normalize (samples x features)
normalization_method (Optional[str]) – Method to use (‘minmax’, ‘zscore’, or None)
normalization_values (Dict[str, Any]) – Dictionary containing normalization parameters
- Returns:
Tuple containing (normalized_data, normalization_parameters)
- Return type:
Tuple[np.ndarray, Dict[str, np.ndarray]]
- Raises:
ValueError – If normalization method is invalid or required parameters are missing
- Example:
>>> import numpy as np >>> data = np.array([[1, 10], [2, 20], [3, 30]]) >>> norm_data, params = normalize_data(data, "minmax", {}) >>> print(f"Normalized data shape: {norm_data.shape}") Normalized data shape: (3, 2)
- mango_autoencoder.utils.processing.normalize_data_for_prediction(data: numpy.ndarray, normalization_method: str | None, min_x: numpy.ndarray | None = None, max_x: numpy.ndarray | None = None, mean_: numpy.ndarray | None = None, std_: numpy.ndarray | None = None, feature_to_check: int | List[int] | None = None, feature_to_check_filter: bool = False) numpy.ndarray ¶
Normalize new data for prediction using stored normalization parameters.
Normalizes new data for prediction using either pre-computed normalization parameters from training or computes new parameters from the input data. Supports feature filtering for selective normalization of specific features.
- Parameters:
data (np.ndarray) – New data to normalize for prediction (samples x features)
normalization_method (Optional[str]) – Method to use for normalization (‘minmax’, ‘zscore’, or None)
min_x (Optional[np.ndarray]) – Minimum values for minmax normalization (per feature)
max_x (Optional[np.ndarray]) – Maximum values for minmax normalization (per feature)
mean (Optional[np.ndarray]) – Mean values for zscore normalization (per feature)
std (Optional[np.ndarray]) – Standard deviation values for zscore normalization (per feature)
feature_to_check (Optional[Union[int, List[int]]]) – Feature indices to apply normalization to
feature_to_check_filter (bool) – Whether to filter features before normalization
- Returns:
Normalized data ready for prediction
- Return type:
np.ndarray
- Raises:
ValueError – If normalization method is invalid
- Example:
>>> import numpy as np >>> new_data = np.array([[4, 40], [5, 50]]) >>> min_vals = np.array([1, 10]) >>> max_vals = np.array([3, 30]) >>> norm_data = normalize_data_for_prediction(new_data, "minmax", min_vals, max_vals) >>> print(f"Normalized data shape: {norm_data.shape}") Normalized data shape: (2, 2)
- mango_autoencoder.utils.processing.apply_padding(data: numpy.ndarray, reconstructed: numpy.ndarray, context_window: int, time_step_to_check: int | List[int]) numpy.ndarray ¶
Apply padding to reconstructed data to match original data shape.
Handles the padding of reconstructed data to match the original data shape, taking into account the context window and the specific time step being predicted. This is essential for maintaining temporal alignment in time series reconstruction.
- Parameters:
data (np.ndarray) – Original dataset with shape (num_samples, num_features)
reconstructed (np.ndarray) – Predicted values with shape (num_samples - context_window, num_features)
context_window (int) – Size of the context window used for prediction
time_step_to_check (Union[int, List[int]]) – Time step to predict within the window (0 to context_window-1)
- Returns:
Padded reconstructed dataset with shape matching the original data
- Return type:
np.ndarray
- Raises:
ValueError – If time_step_to_check is not within valid range
- Example:
>>> import numpy as np >>> original = np.random.randn(100, 5) # 100 samples, 5 features >>> reconstructed = np.random.randn(90, 5) # 90 samples after context window >>> padded = apply_padding(original, reconstructed, 10, 0) >>> print(f"Padded shape: {padded.shape}") Padded shape: (100, 5)
- mango_autoencoder.utils.processing.handle_id_columns(data: numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], id_columns: str | int | List[str] | List[int] | None, features_name: List[str] | None, context_window: int | None) Tuple[numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], numpy.ndarray | None, Dict[str, numpy.ndarray], List[int]] ¶
Handle ID column processing for data grouping and validation.
Processes data to extract ID columns for grouping while ensuring each ID has sufficient samples for the specified context window. Removes ID columns from the data and creates mappings for grouped processing.
- Parameters:
data (Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]]) – Data to process, can be single array or tuple of arrays (train, val, test)
id_columns (Union[str, int, List[str], List[int], None]) – Column(s) to use for grouping data by IDs
features_name (Optional[List[str]]) – List of feature names for column identification
context_window (Optional[int]) – Context window size for the model
- Returns:
Tuple containing: - Processed data (with ID columns removed) - ID mapping array - Dictionary with grouped data by ID - List of column indices that were ID columns
- Return type:
Tuple[Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]], Optional[np.ndarray], Dict[str, np.ndarray], List[int]]
- Raises:
ValueError – If id_columns format is invalid or minimum samples per ID is less than context_window
- Example:
>>> import numpy as np >>> data = np.array([[1, 2, 3], [1, 4, 5], [2, 6, 7]]) # First column is ID >>> processed, ids, grouped, indices = handle_id_columns(data, 0, ["id", "feat1", "feat2"], 2) >>> print(f"Processed shape: {processed.shape}, IDs: {ids}") Processed shape: (3, 2), IDs: ['1' '1' '2']
Plots Module¶
The plots module provides visualization functions for model evaluation and analysis.
- mango_autoencoder.utils.plots.plot_loss_history(train_loss: List[float], val_loss: List[float], save_path: str)¶
Plot training and validation loss history over epochs.
Creates an interactive Plotly line chart showing the progression of training and validation losses during model training. The plot is saved as an HTML file in the specified directory.
- Parameters:
train_loss (List[float]) – Training loss values for each epoch
val_loss (List[float]) – Validation loss values for each epoch
save_path (str) – Directory path where the plot HTML file will be saved
- Returns:
None
- Return type:
None
- Example:
>>> train_losses = [0.5, 0.3, 0.2, 0.15, 0.1] >>> val_losses = [0.6, 0.4, 0.25, 0.18, 0.12] >>> plot_loss_history(train_losses, val_losses, "./plots") # Saves loss_history.html in ./plots/ directory
- mango_autoencoder.utils.plots.plot_actual_and_reconstructed(df_actual: pandas.DataFrame, df_reconstructed: pandas.DataFrame, save_path: str, feature_labels: List[str] | None = None)¶
Plot actual vs reconstructed values for each feature and save to specified folder.
Creates comprehensive visualizations comparing original data with autoencoder reconstructions. Supports different data structures including ID-based data and dataset splits (train/validation/test). Generates multiple plot types including separate views, overlapped views, and combined feature plots.
- Parameters:
df_actual (pd.DataFrame) – DataFrame containing actual/original values
df_reconstructed (pd.DataFrame) – DataFrame containing reconstructed values from autoencoder
save_path (str) – Directory path where plots will be saved as HTML files
feature_labels (Optional[List[str]]) – Optional list of labels for each feature column
- Returns:
None
- Return type:
None
- Example:
>>> import pandas as pd >>> actual_df = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]}) >>> reconstructed_df = pd.DataFrame({"feature1": [1.1, 1.9, 3.1], "feature2": [3.9, 5.2, 5.8]}) >>> plot_actual_and_reconstructed(actual_df, reconstructed_df, "./plots", ["sensor1", "sensor2"])
- mango_autoencoder.utils.plots.plot_reconstruction_iterations(original_data: numpy.ndarray, reconstructed_iterations: dict, save_path: str, feature_labels: List[str] | None = None, id_iter: str | None = None)¶
Plot the original data with missing values and iterative reconstruction progress.
Creates detailed visualizations showing the progression of NaN value reconstruction across multiple iterations. Displays original data (with NaNs), intermediate reconstruction iterations, and final reconstruction results for each feature.
- Parameters:
original_data (np.ndarray) – 2D numpy array (features x timesteps) with the original data including NaNs
reconstructed_iterations (dict) – Dictionary mapping iteration numbers to 2D numpy arrays containing reconstructions
save_path (str) – Directory path where plots will be saved as HTML files
feature_labels (Optional[List[str]]) – Optional list of labels for each feature
id_iter (Optional[str]) – Optional identifier to distinguish plots when working with multiple datasets
- Returns:
None
- Return type:
None
- Example:
>>> import numpy as np >>> original = np.array([[1, np.nan, 3], [4, 5, np.nan]]) >>> iterations = {1: np.array([[1, 2, 3], [4, 5, 6]]), 2: np.array([[1, 2.1, 3], [4, 5, 5.9]])} >>> plot_reconstruction_iterations(original, iterations, "./plots", ["feature1", "feature2"])
- mango_autoencoder.utils.plots.create_error_analysis_dashboard(error_df: pandas.DataFrame, save_path: str | None = None, filename: str = 'error_analysis_dashboard.html', show: bool = True, height: int = 1000, width: int = 1200, template: str = 'plotly_white') plotly.graph_objects.Figure ¶
Create an interactive dashboard for comprehensive error analysis with multiple plots.
Generates a multi-panel dashboard containing bar plots of mean errors by feature, box plots showing error distributions, and correlation heatmaps between features. Provides comprehensive visualization for understanding reconstruction error patterns.
- Parameters:
error_df (pd.DataFrame) – DataFrame containing reconstruction error data (samples x features)
save_path (Optional[str]) – Optional directory path to save the dashboard HTML file
filename (str) – Name of the HTML file to save
show (bool) – Whether to display the dashboard in browser
height (int) – Height of the figure in pixels
width (int) – Width of the figure in pixels
template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)
- Returns:
Plotly figure object containing the dashboard
- Return type:
go.Figure
- Example:
>>> import pandas as pd >>> error_data = pd.DataFrame({"feature1": [0.1, 0.2, 0.3], "feature2": [0.2, 0.1, 0.4]}) >>> dashboard = create_error_analysis_dashboard(error_data, save_path="./plots")
- mango_autoencoder.utils.plots.boxplot_reconstruction_error(reconstruction_error_df: pandas.DataFrame, save_path: str | None = None, filename: str = 'reconstruction_error_boxplot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', xaxis_tickangle: int = -45, color_palette: List[str] | None = None) plotly.graph_objects.Figure ¶
Generate and optionally save a boxplot for reconstruction error analysis using Plotly.
Creates interactive boxplots showing the distribution of reconstruction errors across features and optionally across dataset splits (train/validation/test). Provides statistical insights into error patterns and outliers.
- Parameters:
reconstruction_error_df (pd.DataFrame) – DataFrame with reconstruction error values (samples x features)
save_path (Optional[str]) – Optional directory path to save the plot HTML file
filename (str) – Name of the HTML file to save
show (bool) – Whether to display the plot in browser
height (Optional[int]) – Height of the figure in pixels (None for auto-sizing)
width (Optional[int]) – Width of the figure in pixels (None for auto-sizing)
template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)
xaxis_tickangle (int) – Angle for x-axis labels in degrees
color_palette (Optional[List[str]]) – Optional list of colors for data splits
- Returns:
Plotly figure object containing the boxplot
- Return type:
go.Figure
- Example:
>>> import pandas as pd >>> error_df = pd.DataFrame({ ... "feature1": [0.1, 0.2, 0.3, 0.4], ... "feature2": [0.2, 0.1, 0.4, 0.3], ... "data_split": ["train", "train", "val", "val"] ... }) >>> boxplot = boxplot_reconstruction_error(error_df, save_path="./plots")
- mango_autoencoder.utils.plots.create_actual_vs_reconstructed_plot(df_actual: pandas.DataFrame, df_reconstructed: pandas.DataFrame, save_path: str | None = None, filename: str = 'actual_vs_reconstructed.html', show: bool = True, height: int | None = None, width: int | None = None, template: str = 'plotly_white') plotly.graph_objects.Figure ¶
Create an interactive plot comparing actual and reconstructed values.
Generates a comprehensive line plot showing the comparison between original data and autoencoder reconstructions across all features. Combines data with type indicators for clear visualization of reconstruction quality.
- Parameters:
df_actual (pd.DataFrame) – DataFrame containing actual/original values
df_reconstructed (pd.DataFrame) – DataFrame containing reconstructed values from autoencoder
save_path (Optional[str]) – Optional directory path to save the plot HTML file
filename (str) – Name of the HTML file to save
show (bool) – Whether to display the plot in browser
height (Optional[int]) – Height of the figure in pixels
width (Optional[int]) – Width of the figure in pixels
template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)
- Returns:
Plotly figure object containing the comparison plot
- Return type:
go.Figure
- Example:
>>> import pandas as pd >>> actual_df = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]}) >>> reconstructed_df = pd.DataFrame({"feature1": [1.1, 1.9, 3.1], "feature2": [3.9, 5.2, 5.8]}) >>> plot = create_actual_vs_reconstructed_plot(actual_df, reconstructed_df, save_path="./plots")
- mango_autoencoder.utils.plots.plot_corrected_data(actual_data_df: pandas.DataFrame, autoencoder_output_df: pandas.DataFrame, anomaly_mask: pandas.DataFrame, save_path: str | None = None, filename: str = 'corrected_data_plot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', color_palette: List[str] | None = None) plotly.graph_objects.Figure ¶
Plot original sensor data, autoencoder reconstruction, and corrected (replaced) anomaly points.
Creates comprehensive visualizations showing the data correction process by displaying original data, autoencoder reconstructions, and specifically highlighting points that were identified as anomalies and replaced with reconstructed values.
- Parameters:
actual_data_df (pd.DataFrame) – Original sensor data including the context window
autoencoder_output_df (pd.DataFrame) – Autoencoder output after context window removal
anomaly_mask (pd.DataFrame) – DataFrame of boolean values indicating where values were categorized as anomalies
save_path (Optional[str]) – Optional directory path to save the plot as an HTML file
filename (str) – Filename to use if saving the plot
show (bool) – Whether to display the plot in a browser window
height (Optional[int]) – Optional height of the figure in pixels
width (Optional[int]) – Optional width of the figure in pixels
template (str) – Plotly template for figure styling (e.g., ‘plotly_white’, ‘ggplot2’)
color_palette (Optional[List[str]]) – Optional list of colors to use for plotting each sensor
- Returns:
Plotly figure object with actual, reconstructed, and corrected data traces
- Return type:
go.Figure
- Raises:
ValueError – If input DataFrames have mismatched lengths, columns, or invalid structure
Exception – If an error occurs during plotting or file saving
- Example:
>>> import pandas as pd >>> actual_df = pd.DataFrame({"sensor1": [1, 2, 3, 4], "sensor2": [5, 6, 7, 8]}) >>> reconstructed_df = pd.DataFrame({"sensor1": [1.1, 1.9, 3.1, 3.9], "sensor2": [5.2, 5.8, 7.1, 7.9]}) >>> mask_df = pd.DataFrame({"sensor1": [False, True, False, True], "sensor2": [True, False, True, False]}) >>> plot = plot_corrected_data(actual_df, reconstructed_df, mask_df, save_path="./plots")
- mango_autoencoder.utils.plots.plot_anomaly_proportions(anomaly_mask: pandas.DataFrame, save_path: str | None = None, filename: str = 'anomaly_proportions_plot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', color_palette: List[str] | None = None, xaxis_tickangle: int = -45) plotly.graph_objects.Figure ¶
Generate a bar chart showing anomaly proportions by sensor and data split.
Creates interactive bar charts displaying the proportion of anomalies for each sensor across different dataset splits (train/validation/test). Anomaly proportions are calculated as the number of anomalies divided by the total number of observations for each sensor, providing insights into data quality patterns.
- Parameters:
anomaly_mask (pd.DataFrame) – DataFrame containing boolean anomaly mask (True = anomaly) and a ‘data_split’ column
save_path (Optional[str]) – Optional directory path to save the output HTML plot
filename (str) – Filename for the saved plot (HTML format)
show (bool) – Whether to display the plot interactively in browser
height (Optional[int]) – Optional height of the figure in pixels
width (Optional[int]) – Optional width of the figure in pixels
template (str) – Plotly layout template to use (e.g., ‘plotly_white’, ‘ggplot2’)
color_palette (Optional[List[str]]) – Optional list of color hex codes or names to use per data split
xaxis_tickangle (int) – Angle for x-axis labels in degrees
- Returns:
Plotly Figure object containing the bar chart
- Return type:
go.Figure
- Raises:
ValueError – If required columns are missing or data is invalid
Exception – If plot creation or file saving fails
- Example:
>>> import pandas as pd >>> mask_df = pd.DataFrame({ ... "sensor1": [True, False, True, False], ... "sensor2": [False, True, False, True], ... "data_split": ["train", "train", "val", "val"] ... }) >>> plot = plot_anomaly_proportions(mask_df, save_path="./plots")
Sequences Module¶
The sequences module provides functions for handling time series sequences and transformations.
- mango_autoencoder.utils.sequences.time_series_to_sequence(data: numpy.ndarray | pandas.DataFrame | polars.DataFrame, context_window: int, val_data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | None = None, test_data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] ¶
Convert time series data into sequences of fixed length for RNN-based models.
Transforms time series data into overlapping sequences suitable for training recurrent neural networks. Handles both single dataset and multiple dataset scenarios with proper temporal continuity between train/validation/test splits.
This function can handle two main cases: 1. Single dataset: Converts a single time series into sequences 2. Multiple datasets: Converts train, validation, and test datasets into sequences, ensuring continuity between splits by prepending the last context_window - 1 rows of the previous split to the next one.
- Parameters:
data (Union[np.ndarray, pd.DataFrame, pl.DataFrame]) – Time series data (training data in case of multiple datasets)
context_window (int) – Length of each time window/sequence
val_data (Optional[Union[np.ndarray, pd.DataFrame, pl.DataFrame]]) – Validation dataset (optional, required for multiple datasets)
test_data (Optional[Union[np.ndarray, pd.DataFrame, pl.DataFrame]]) – Test dataset (optional, required for multiple datasets)
- Returns:
Either: - Single array of shape (n_sequences, context_window, n_features) for single dataset - Tuple of three arrays (train_sequences, val_sequences, test_sequences) for multiple datasets
- Return type:
Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]]
- Raises:
ValueError – If inputs are not of valid types, context_window is invalid, or dataset lengths are insufficient
- Example:
>>> import numpy as np >>> # Single dataset case >>> data = np.random.randn(100, 5) # 100 time steps, 5 features >>> sequences = time_series_to_sequence(data, 10) >>> print(f"Single dataset sequences shape: {sequences.shape}") Single dataset sequences shape: (91, 10, 5)
>>> # Multiple datasets case >>> train = np.random.randn(70, 5) >>> val = np.random.randn(20, 5) >>> test = np.random.randn(10, 5) >>> train_seq, val_seq, test_seq = time_series_to_sequence(train, 10, val, test) >>> print(f"Train: {train_seq.shape}, Val: {val_seq.shape}, Test: {test_seq.shape}") Train: (61, 10, 5), Val: (20, 10, 5), Test: (10, 10, 5)