AutoEncoder Utils

The utils package provides utility functions for data processing, visualization, and sequence handling in the AutoEncoder module.

Processing Module

The processing module contains functions for data preprocessing, normalization, and transformation.

mango_autoencoder.utils.processing.abs_mean(data: numpy.ndarray | pandas.Series | polars.Series) numpy.floating | float

Calculate the mean of absolute values in the data.

Computes the arithmetic mean of the absolute values of all elements in the input data. Supports NumPy arrays, Pandas Series, and Polars Series.

Parameters:

data (Union[np.ndarray, pd.Series, pl.Series]) – Input data for which to calculate absolute mean

Returns:

Mean of absolute values

Return type:

Union[np.floating, float]

Raises:

TypeError – If data type is not supported

Example:
>>> import numpy as np
>>> import pandas as pd
>>> data = np.array([-2, -1, 0, 1, 2])
>>> abs_mean(data)
1.2
>>> series = pd.Series([-5, 3, -1, 4])
>>> abs_mean(series)
3.25
mango_autoencoder.utils.processing.reintroduce_nans(self, df: pandas.DataFrame, id: str) pandas.DataFrame

Reintroduce NaN values back into the dataset based on stored coordinates.

When preparing datasets for autoencoder training, NaN values are typically removed. This function restores the original NaN values to their correct positions using the stored NaN coordinates for the specified dataset ID.

Parameters:
  • df (pd.DataFrame) – Data to reintroduce NaNs back into

  • id (str) – Identifier of dataset (“global” is only one dataset)

Returns:

Data with reintroduced NaNs in their original positions

Return type:

pd.DataFrame

Raises:

ValueError – If “id” not in self._nan_coordinates

Example:
>>> # Assuming self._nan_coordinates contains stored NaN positions
>>> df_clean = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]})
>>> df_with_nans = reintroduce_nans(df_clean, "dataset_1")
>>> # NaN values are restored based on stored coordinates
mango_autoencoder.utils.processing.id_pivot(df: pandas.DataFrame, id: str) pandas.DataFrame

Select subset of data based on ID and pivot to time-series format.

Extracts data for a specific ID and transforms it from long format (with feature, time_step, value columns) to wide format with time_step as rows and features as columns, preserving the original feature order.

Parameters:
  • df (pd.DataFrame) – Data in long format with columns: id, feature, time_step, value, data_split

  • id (str) – Identifier of dataset (“global” is only one dataset)

Returns:

Data subset pivoted to wide format with time_step as index and features as columns

Return type:

pd.DataFrame

Raises:

ValueError – If df does not have required columns or no data found for ID

Example:
>>> df_long = pd.DataFrame({
...     "id": ["A", "A", "A", "A"],
...     "feature": ["temp", "humidity", "temp", "humidity"],
...     "time_step": [0, 0, 1, 1],
...     "value": [25.5, 60.0, 26.0, 58.0],
...     "data_split": ["train", "train", "train", "train"]
... })
>>> df_pivoted = id_pivot(df_long, "A")
>>> # Result: time_step as index, temp and humidity as columns
mango_autoencoder.utils.processing.save_csv(data: pandas.DataFrame, save_path: str, filename: str, save_index: bool = False, decimals: int = 4, compression: str = 'infer', logger_msg: str = 'standard') None

Save a DataFrame as a CSV file with configurable formatting options.

Saves a pandas DataFrame to a CSV file with options for index inclusion, decimal precision, and compression. Creates the directory if it doesn’t exist and provides logging feedback on the save operation.

Parameters:
  • data (pd.DataFrame) – DataFrame to save to CSV

  • save_path (str) – Directory path where the CSV file will be saved

  • filename (str) – Name of the CSV file (must end with .csv or .csv.zip)

  • save_index (bool) – Whether to include the DataFrame index in the CSV

  • decimals (int) – Number of decimal places for floating point numbers

  • compression (str) – Type of compression to use (‘infer’, ‘gzip’, ‘bz2’, etc.)

  • logger_msg (str) – Custom logger message or ‘standard’ for default message

Returns:

None

Return type:

None

Raises:
  • ValueError – If filename doesn’t end with .csv or .csv.zip

  • Exception – If file saving fails

Example:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1.123456, 2.789012], "B": [3.456789, 4.012345]})
>>> save_csv(df, "./output", "data.csv", decimals=2)
>>> # Saves data.csv with 2 decimal places in ./output/ directory
mango_autoencoder.utils.processing.time_series_split(data: numpy.ndarray, train_size: float, val_size: float, test_size: float) Tuple[numpy.ndarray, numpy.ndarray | None, numpy.ndarray | None]

Split time series data into training, validation, and test sets sequentially.

Performs a sequential split of time series data maintaining temporal order, which is crucial for time series analysis. The data is split according to the specified proportions, with training data coming first, followed by validation and test data.

Parameters:
  • data (np.ndarray) – Time series data array to split (samples x features)

  • train_size (float) – Proportion of the dataset for training (0.0 to 1.0)

  • val_size (float) – Proportion of the dataset for validation (0.0 to 1.0)

  • test_size (float) – Proportion of the dataset for testing (0.0 to 1.0)

Returns:

Tuple containing (training_data, validation_data, test_data)

Return type:

Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]

Raises:

ValueError – If sizes are None or their sum is not 1.0

Example:
>>> import numpy as np
>>> data = np.random.randn(1000, 5)  # 1000 samples, 5 features
>>> train, val, test = time_series_split(data, 0.7, 0.2, 0.1)
>>> print(f"Train: {train.shape}, Val: {val.shape}, Test: {test.shape}")
Train: (700, 5), Val: (200, 5), Test: (100, 5)
mango_autoencoder.utils.processing.convert_data_to_numpy(data: Any) Tuple[numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], List[str]]

Convert various data formats to numpy arrays for autoencoder processing.

Handles conversion of pandas DataFrames, polars DataFrames, numpy arrays, and tuples of these types to numpy format. Extracts feature names when available and ensures consistency across tuple elements.

Parameters:

data (Any) – Input data that can be pandas DataFrame, polars DataFrame, numpy array, or tuple of these types

Returns:

Tuple containing (converted_data, feature_names)

Return type:

Tuple[Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]], List[str]]

Raises:

ValueError – If data type is not supported or tuple elements have different feature names

Example:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> array, features = convert_data_to_numpy(df)
>>> print(f"Array shape: {array.shape}, Features: {features}")
Array shape: (3, 2), Features: ['A', 'B']
mango_autoencoder.utils.processing.denormalize_data(data: numpy.ndarray, normalization_method: str | None, min_x: numpy.ndarray | None = None, max_x: numpy.ndarray | None = None, mean_: numpy.ndarray | None = None, std_: numpy.ndarray | None = None) numpy.ndarray

Denormalize data back to its original scale using stored normalization parameters.

Reverses the normalization process applied during training, converting normalized data back to its original scale. Supports minmax and zscore normalization methods with their respective parameter sets.

Parameters:
  • data (np.ndarray) – Normalized data to denormalize (samples x features)

  • normalization_method (Optional[str]) – Method used for original normalization (‘minmax’, ‘zscore’, or None)

  • min_x (Optional[np.ndarray]) – Minimum values used for minmax normalization (per feature)

  • max_x (Optional[np.ndarray]) – Maximum values used for minmax normalization (per feature)

  • mean (Optional[np.ndarray]) – Mean values used for zscore normalization (per feature)

  • std (Optional[np.ndarray]) – Standard deviation values used for zscore normalization (per feature)

Returns:

Denormalized data in original scale

Return type:

np.ndarray

Raises:
  • ValueError – If normalization method is invalid

  • TypeError – If required parameters are not numpy arrays

Example:
>>> import numpy as np
>>> normalized_data = np.array([[0.0, 0.5], [1.0, 1.0]])  # Minmax normalized
>>> min_vals = np.array([10, 20])
>>> max_vals = np.array([30, 40])
>>> original_data = denormalize_data(normalized_data, "minmax", min_vals, max_vals)
>>> print(original_data)
[[10. 30.]
 [30. 40.]]
mango_autoencoder.utils.processing.normalize_data_for_training(x_train: numpy.ndarray, x_val: numpy.ndarray, x_test: numpy.ndarray, normalization_method: str | None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Dict[str, Any]]

Normalize training, validation, and test data using the specified method.

Applies normalization to all three datasets using parameters computed from the training data only. This ensures proper data leakage prevention by using only training statistics for normalization. Handles constant columns safely by avoiding division by zero.

Parameters:
  • x_train (np.ndarray) – Training data to normalize (samples x features)

  • x_val (np.ndarray) – Validation data to normalize (samples x features)

  • x_test (np.ndarray) – Test data to normalize (samples x features)

  • normalization_method (Optional[str]) – Method to use (‘minmax’, ‘zscore’, or None)

Returns:

Tuple containing (normalized_train, normalized_val, normalized_test, normalization_params)

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[str, Any]]

Raises:

ValueError – If normalization method is invalid

Example:
>>> import numpy as np
>>> train = np.array([[1, 10], [2, 20], [3, 30]])
>>> val = np.array([[4, 40], [5, 50]])
>>> test = np.array([[6, 60]])
>>> norm_train, norm_val, norm_test, params = normalize_data_for_training(train, val, test, "minmax")
>>> print(f"Normalized train shape: {norm_train.shape}")
Normalized train shape: (3, 2)
mango_autoencoder.utils.processing.normalize_data(data: numpy.ndarray, normalization_method: str | None, normalization_values: Dict[str, Any]) Tuple[numpy.ndarray, Dict[str, numpy.ndarray]]

Normalize data using provided normalization parameters or compute new ones.

Normalizes data using either pre-computed normalization parameters or computes new parameters from the data itself. Supports minmax and zscore normalization methods with safe handling of constant columns.

Parameters:
  • data (np.ndarray) – Data to normalize (samples x features)

  • normalization_method (Optional[str]) – Method to use (‘minmax’, ‘zscore’, or None)

  • normalization_values (Dict[str, Any]) – Dictionary containing normalization parameters

Returns:

Tuple containing (normalized_data, normalization_parameters)

Return type:

Tuple[np.ndarray, Dict[str, np.ndarray]]

Raises:

ValueError – If normalization method is invalid or required parameters are missing

Example:
>>> import numpy as np
>>> data = np.array([[1, 10], [2, 20], [3, 30]])
>>> norm_data, params = normalize_data(data, "minmax", {})
>>> print(f"Normalized data shape: {norm_data.shape}")
Normalized data shape: (3, 2)
mango_autoencoder.utils.processing.normalize_data_for_prediction(data: numpy.ndarray, normalization_method: str | None, min_x: numpy.ndarray | None = None, max_x: numpy.ndarray | None = None, mean_: numpy.ndarray | None = None, std_: numpy.ndarray | None = None, feature_to_check: int | List[int] | None = None, feature_to_check_filter: bool = False) numpy.ndarray

Normalize new data for prediction using stored normalization parameters.

Normalizes new data for prediction using either pre-computed normalization parameters from training or computes new parameters from the input data. Supports feature filtering for selective normalization of specific features.

Parameters:
  • data (np.ndarray) – New data to normalize for prediction (samples x features)

  • normalization_method (Optional[str]) – Method to use for normalization (‘minmax’, ‘zscore’, or None)

  • min_x (Optional[np.ndarray]) – Minimum values for minmax normalization (per feature)

  • max_x (Optional[np.ndarray]) – Maximum values for minmax normalization (per feature)

  • mean (Optional[np.ndarray]) – Mean values for zscore normalization (per feature)

  • std (Optional[np.ndarray]) – Standard deviation values for zscore normalization (per feature)

  • feature_to_check (Optional[Union[int, List[int]]]) – Feature indices to apply normalization to

  • feature_to_check_filter (bool) – Whether to filter features before normalization

Returns:

Normalized data ready for prediction

Return type:

np.ndarray

Raises:

ValueError – If normalization method is invalid

Example:
>>> import numpy as np
>>> new_data = np.array([[4, 40], [5, 50]])
>>> min_vals = np.array([1, 10])
>>> max_vals = np.array([3, 30])
>>> norm_data = normalize_data_for_prediction(new_data, "minmax", min_vals, max_vals)
>>> print(f"Normalized data shape: {norm_data.shape}")
Normalized data shape: (2, 2)
mango_autoencoder.utils.processing.apply_padding(data: numpy.ndarray, reconstructed: numpy.ndarray, context_window: int, time_step_to_check: int | List[int]) numpy.ndarray

Apply padding to reconstructed data to match original data shape.

Handles the padding of reconstructed data to match the original data shape, taking into account the context window and the specific time step being predicted. This is essential for maintaining temporal alignment in time series reconstruction.

Parameters:
  • data (np.ndarray) – Original dataset with shape (num_samples, num_features)

  • reconstructed (np.ndarray) – Predicted values with shape (num_samples - context_window, num_features)

  • context_window (int) – Size of the context window used for prediction

  • time_step_to_check (Union[int, List[int]]) – Time step to predict within the window (0 to context_window-1)

Returns:

Padded reconstructed dataset with shape matching the original data

Return type:

np.ndarray

Raises:

ValueError – If time_step_to_check is not within valid range

Example:
>>> import numpy as np
>>> original = np.random.randn(100, 5)  # 100 samples, 5 features
>>> reconstructed = np.random.randn(90, 5)  # 90 samples after context window
>>> padded = apply_padding(original, reconstructed, 10, 0)
>>> print(f"Padded shape: {padded.shape}")
Padded shape: (100, 5)
mango_autoencoder.utils.processing.handle_id_columns(data: numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], id_columns: str | int | List[str] | List[int] | None, features_name: List[str] | None, context_window: int | None) Tuple[numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], numpy.ndarray | None, Dict[str, numpy.ndarray], List[int]]

Handle ID column processing for data grouping and validation.

Processes data to extract ID columns for grouping while ensuring each ID has sufficient samples for the specified context window. Removes ID columns from the data and creates mappings for grouped processing.

Parameters:
  • data (Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]]) – Data to process, can be single array or tuple of arrays (train, val, test)

  • id_columns (Union[str, int, List[str], List[int], None]) – Column(s) to use for grouping data by IDs

  • features_name (Optional[List[str]]) – List of feature names for column identification

  • context_window (Optional[int]) – Context window size for the model

Returns:

Tuple containing: - Processed data (with ID columns removed) - ID mapping array - Dictionary with grouped data by ID - List of column indices that were ID columns

Return type:

Tuple[Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]], Optional[np.ndarray], Dict[str, np.ndarray], List[int]]

Raises:

ValueError – If id_columns format is invalid or minimum samples per ID is less than context_window

Example:
>>> import numpy as np
>>> data = np.array([[1, 2, 3], [1, 4, 5], [2, 6, 7]])  # First column is ID
>>> processed, ids, grouped, indices = handle_id_columns(data, 0, ["id", "feat1", "feat2"], 2)
>>> print(f"Processed shape: {processed.shape}, IDs: {ids}")
Processed shape: (3, 2), IDs: ['1' '1' '2']

Plots Module

The plots module provides visualization functions for model evaluation and analysis.

mango_autoencoder.utils.plots.plot_loss_history(train_loss: List[float], val_loss: List[float], save_path: str)

Plot training and validation loss history over epochs.

Creates an interactive Plotly line chart showing the progression of training and validation losses during model training. The plot is saved as an HTML file in the specified directory.

Parameters:
  • train_loss (List[float]) – Training loss values for each epoch

  • val_loss (List[float]) – Validation loss values for each epoch

  • save_path (str) – Directory path where the plot HTML file will be saved

Returns:

None

Return type:

None

Example:
>>> train_losses = [0.5, 0.3, 0.2, 0.15, 0.1]
>>> val_losses = [0.6, 0.4, 0.25, 0.18, 0.12]
>>> plot_loss_history(train_losses, val_losses, "./plots")
# Saves loss_history.html in ./plots/ directory
mango_autoencoder.utils.plots.plot_actual_and_reconstructed(df_actual: pandas.DataFrame, df_reconstructed: pandas.DataFrame, save_path: str, feature_labels: List[str] | None = None)

Plot actual vs reconstructed values for each feature and save to specified folder.

Creates comprehensive visualizations comparing original data with autoencoder reconstructions. Supports different data structures including ID-based data and dataset splits (train/validation/test). Generates multiple plot types including separate views, overlapped views, and combined feature plots.

Parameters:
  • df_actual (pd.DataFrame) – DataFrame containing actual/original values

  • df_reconstructed (pd.DataFrame) – DataFrame containing reconstructed values from autoencoder

  • save_path (str) – Directory path where plots will be saved as HTML files

  • feature_labels (Optional[List[str]]) – Optional list of labels for each feature column

Returns:

None

Return type:

None

Example:
>>> import pandas as pd
>>> actual_df = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]})
>>> reconstructed_df = pd.DataFrame({"feature1": [1.1, 1.9, 3.1], "feature2": [3.9, 5.2, 5.8]})
>>> plot_actual_and_reconstructed(actual_df, reconstructed_df, "./plots", ["sensor1", "sensor2"])
mango_autoencoder.utils.plots.plot_reconstruction_iterations(original_data: numpy.ndarray, reconstructed_iterations: dict, save_path: str, feature_labels: List[str] | None = None, id_iter: str | None = None)

Plot the original data with missing values and iterative reconstruction progress.

Creates detailed visualizations showing the progression of NaN value reconstruction across multiple iterations. Displays original data (with NaNs), intermediate reconstruction iterations, and final reconstruction results for each feature.

Parameters:
  • original_data (np.ndarray) – 2D numpy array (features x timesteps) with the original data including NaNs

  • reconstructed_iterations (dict) – Dictionary mapping iteration numbers to 2D numpy arrays containing reconstructions

  • save_path (str) – Directory path where plots will be saved as HTML files

  • feature_labels (Optional[List[str]]) – Optional list of labels for each feature

  • id_iter (Optional[str]) – Optional identifier to distinguish plots when working with multiple datasets

Returns:

None

Return type:

None

Example:
>>> import numpy as np
>>> original = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> iterations = {1: np.array([[1, 2, 3], [4, 5, 6]]), 2: np.array([[1, 2.1, 3], [4, 5, 5.9]])}
>>> plot_reconstruction_iterations(original, iterations, "./plots", ["feature1", "feature2"])
mango_autoencoder.utils.plots.create_error_analysis_dashboard(error_df: pandas.DataFrame, save_path: str | None = None, filename: str = 'error_analysis_dashboard.html', show: bool = True, height: int = 1000, width: int = 1200, template: str = 'plotly_white') plotly.graph_objects.Figure

Create an interactive dashboard for comprehensive error analysis with multiple plots.

Generates a multi-panel dashboard containing bar plots of mean errors by feature, box plots showing error distributions, and correlation heatmaps between features. Provides comprehensive visualization for understanding reconstruction error patterns.

Parameters:
  • error_df (pd.DataFrame) – DataFrame containing reconstruction error data (samples x features)

  • save_path (Optional[str]) – Optional directory path to save the dashboard HTML file

  • filename (str) – Name of the HTML file to save

  • show (bool) – Whether to display the dashboard in browser

  • height (int) – Height of the figure in pixels

  • width (int) – Width of the figure in pixels

  • template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)

Returns:

Plotly figure object containing the dashboard

Return type:

go.Figure

Example:
>>> import pandas as pd
>>> error_data = pd.DataFrame({"feature1": [0.1, 0.2, 0.3], "feature2": [0.2, 0.1, 0.4]})
>>> dashboard = create_error_analysis_dashboard(error_data, save_path="./plots")
mango_autoencoder.utils.plots.boxplot_reconstruction_error(reconstruction_error_df: pandas.DataFrame, save_path: str | None = None, filename: str = 'reconstruction_error_boxplot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', xaxis_tickangle: int = -45, color_palette: List[str] | None = None) plotly.graph_objects.Figure

Generate and optionally save a boxplot for reconstruction error analysis using Plotly.

Creates interactive boxplots showing the distribution of reconstruction errors across features and optionally across dataset splits (train/validation/test). Provides statistical insights into error patterns and outliers.

Parameters:
  • reconstruction_error_df (pd.DataFrame) – DataFrame with reconstruction error values (samples x features)

  • save_path (Optional[str]) – Optional directory path to save the plot HTML file

  • filename (str) – Name of the HTML file to save

  • show (bool) – Whether to display the plot in browser

  • height (Optional[int]) – Height of the figure in pixels (None for auto-sizing)

  • width (Optional[int]) – Width of the figure in pixels (None for auto-sizing)

  • template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)

  • xaxis_tickangle (int) – Angle for x-axis labels in degrees

  • color_palette (Optional[List[str]]) – Optional list of colors for data splits

Returns:

Plotly figure object containing the boxplot

Return type:

go.Figure

Example:
>>> import pandas as pd
>>> error_df = pd.DataFrame({
...     "feature1": [0.1, 0.2, 0.3, 0.4],
...     "feature2": [0.2, 0.1, 0.4, 0.3],
...     "data_split": ["train", "train", "val", "val"]
... })
>>> boxplot = boxplot_reconstruction_error(error_df, save_path="./plots")
mango_autoencoder.utils.plots.create_actual_vs_reconstructed_plot(df_actual: pandas.DataFrame, df_reconstructed: pandas.DataFrame, save_path: str | None = None, filename: str = 'actual_vs_reconstructed.html', show: bool = True, height: int | None = None, width: int | None = None, template: str = 'plotly_white') plotly.graph_objects.Figure

Create an interactive plot comparing actual and reconstructed values.

Generates a comprehensive line plot showing the comparison between original data and autoencoder reconstructions across all features. Combines data with type indicators for clear visualization of reconstruction quality.

Parameters:
  • df_actual (pd.DataFrame) – DataFrame containing actual/original values

  • df_reconstructed (pd.DataFrame) – DataFrame containing reconstructed values from autoencoder

  • save_path (Optional[str]) – Optional directory path to save the plot HTML file

  • filename (str) – Name of the HTML file to save

  • show (bool) – Whether to display the plot in browser

  • height (Optional[int]) – Height of the figure in pixels

  • width (Optional[int]) – Width of the figure in pixels

  • template (str) – Plotly template for styling (e.g., ‘plotly_white’, ‘ggplot2’)

Returns:

Plotly figure object containing the comparison plot

Return type:

go.Figure

Example:
>>> import pandas as pd
>>> actual_df = pd.DataFrame({"feature1": [1, 2, 3], "feature2": [4, 5, 6]})
>>> reconstructed_df = pd.DataFrame({"feature1": [1.1, 1.9, 3.1], "feature2": [3.9, 5.2, 5.8]})
>>> plot = create_actual_vs_reconstructed_plot(actual_df, reconstructed_df, save_path="./plots")
mango_autoencoder.utils.plots.plot_corrected_data(actual_data_df: pandas.DataFrame, autoencoder_output_df: pandas.DataFrame, anomaly_mask: pandas.DataFrame, save_path: str | None = None, filename: str = 'corrected_data_plot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', color_palette: List[str] | None = None) plotly.graph_objects.Figure

Plot original sensor data, autoencoder reconstruction, and corrected (replaced) anomaly points.

Creates comprehensive visualizations showing the data correction process by displaying original data, autoencoder reconstructions, and specifically highlighting points that were identified as anomalies and replaced with reconstructed values.

Parameters:
  • actual_data_df (pd.DataFrame) – Original sensor data including the context window

  • autoencoder_output_df (pd.DataFrame) – Autoencoder output after context window removal

  • anomaly_mask (pd.DataFrame) – DataFrame of boolean values indicating where values were categorized as anomalies

  • save_path (Optional[str]) – Optional directory path to save the plot as an HTML file

  • filename (str) – Filename to use if saving the plot

  • show (bool) – Whether to display the plot in a browser window

  • height (Optional[int]) – Optional height of the figure in pixels

  • width (Optional[int]) – Optional width of the figure in pixels

  • template (str) – Plotly template for figure styling (e.g., ‘plotly_white’, ‘ggplot2’)

  • color_palette (Optional[List[str]]) – Optional list of colors to use for plotting each sensor

Returns:

Plotly figure object with actual, reconstructed, and corrected data traces

Return type:

go.Figure

Raises:
  • ValueError – If input DataFrames have mismatched lengths, columns, or invalid structure

  • Exception – If an error occurs during plotting or file saving

Example:
>>> import pandas as pd
>>> actual_df = pd.DataFrame({"sensor1": [1, 2, 3, 4], "sensor2": [5, 6, 7, 8]})
>>> reconstructed_df = pd.DataFrame({"sensor1": [1.1, 1.9, 3.1, 3.9], "sensor2": [5.2, 5.8, 7.1, 7.9]})
>>> mask_df = pd.DataFrame({"sensor1": [False, True, False, True], "sensor2": [True, False, True, False]})
>>> plot = plot_corrected_data(actual_df, reconstructed_df, mask_df, save_path="./plots")
mango_autoencoder.utils.plots.plot_anomaly_proportions(anomaly_mask: pandas.DataFrame, save_path: str | None = None, filename: str = 'anomaly_proportions_plot.html', show: bool = False, height: int | None = None, width: int | None = None, template: str = 'plotly_white', color_palette: List[str] | None = None, xaxis_tickangle: int = -45) plotly.graph_objects.Figure

Generate a bar chart showing anomaly proportions by sensor and data split.

Creates interactive bar charts displaying the proportion of anomalies for each sensor across different dataset splits (train/validation/test). Anomaly proportions are calculated as the number of anomalies divided by the total number of observations for each sensor, providing insights into data quality patterns.

Parameters:
  • anomaly_mask (pd.DataFrame) – DataFrame containing boolean anomaly mask (True = anomaly) and a ‘data_split’ column

  • save_path (Optional[str]) – Optional directory path to save the output HTML plot

  • filename (str) – Filename for the saved plot (HTML format)

  • show (bool) – Whether to display the plot interactively in browser

  • height (Optional[int]) – Optional height of the figure in pixels

  • width (Optional[int]) – Optional width of the figure in pixels

  • template (str) – Plotly layout template to use (e.g., ‘plotly_white’, ‘ggplot2’)

  • color_palette (Optional[List[str]]) – Optional list of color hex codes or names to use per data split

  • xaxis_tickangle (int) – Angle for x-axis labels in degrees

Returns:

Plotly Figure object containing the bar chart

Return type:

go.Figure

Raises:
  • ValueError – If required columns are missing or data is invalid

  • Exception – If plot creation or file saving fails

Example:
>>> import pandas as pd
>>> mask_df = pd.DataFrame({
...     "sensor1": [True, False, True, False],
...     "sensor2": [False, True, False, True],
...     "data_split": ["train", "train", "val", "val"]
... })
>>> plot = plot_anomaly_proportions(mask_df, save_path="./plots")

Sequences Module

The sequences module provides functions for handling time series sequences and transformations.

mango_autoencoder.utils.sequences.time_series_to_sequence(data: numpy.ndarray | pandas.DataFrame | polars.DataFrame, context_window: int, val_data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | None = None, test_data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | None = None) numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Convert time series data into sequences of fixed length for RNN-based models.

Transforms time series data into overlapping sequences suitable for training recurrent neural networks. Handles both single dataset and multiple dataset scenarios with proper temporal continuity between train/validation/test splits.

This function can handle two main cases: 1. Single dataset: Converts a single time series into sequences 2. Multiple datasets: Converts train, validation, and test datasets into sequences, ensuring continuity between splits by prepending the last context_window - 1 rows of the previous split to the next one.

Parameters:
  • data (Union[np.ndarray, pd.DataFrame, pl.DataFrame]) – Time series data (training data in case of multiple datasets)

  • context_window (int) – Length of each time window/sequence

  • val_data (Optional[Union[np.ndarray, pd.DataFrame, pl.DataFrame]]) – Validation dataset (optional, required for multiple datasets)

  • test_data (Optional[Union[np.ndarray, pd.DataFrame, pl.DataFrame]]) – Test dataset (optional, required for multiple datasets)

Returns:

Either: - Single array of shape (n_sequences, context_window, n_features) for single dataset - Tuple of three arrays (train_sequences, val_sequences, test_sequences) for multiple datasets

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]]

Raises:

ValueError – If inputs are not of valid types, context_window is invalid, or dataset lengths are insufficient

Example:
>>> import numpy as np
>>> # Single dataset case
>>> data = np.random.randn(100, 5)  # 100 time steps, 5 features
>>> sequences = time_series_to_sequence(data, 10)
>>> print(f"Single dataset sequences shape: {sequences.shape}")
Single dataset sequences shape: (91, 10, 5)
>>> # Multiple datasets case
>>> train = np.random.randn(70, 5)
>>> val = np.random.randn(20, 5)
>>> test = np.random.randn(10, 5)
>>> train_seq, val_seq, test_seq = time_series_to_sequence(train, 10, val, test)
>>> print(f"Train: {train_seq.shape}, Val: {val_seq.shape}, Test: {test_seq.shape}")
Train: (61, 10, 5), Val: (20, 10, 5), Test: (10, 10, 5)