AutoEncoder API Reference

This page provides the complete API reference for the AutoEncoder class and its methods.

AutoEncoder Class

class mango_autoencoder.autoencoder.AutoEncoder

Bases: object

Autoencoder model for time series data reconstruction and anomaly detection.

An autoencoder is a neural network that learns to compress and reconstruct data. This implementation is designed specifically for time series data, allowing for sequence-based encoding and decoding with various architectures (LSTM, GRU, RNN).

The model can be highly configurable but is already set up for quick training and profiling. It supports data normalization, masking for missing values, and various training options including early stopping and checkpointing.

Parameters:
  • TRAIN_SIZE (float) – Proportion of data used for training (default: 0.8)

  • VAL_SIZE (float) – Proportion of data used for validation (default: 0.1)

  • TEST_SIZE (float) – Proportion of data used for testing (default: 0.1)

Example:
>>> autoencoder = AutoEncoder()
>>> autoencoder.form = "LSTM"
>>> autoencoder.context_window = 10
>>> autoencoder.fit(data)
>>> predictions = autoencoder.predict(test_data)
TRAIN_SIZE = 0.8
VAL_SIZE = 0.1
TEST_SIZE = 0.1
property nan_coordinates: Dict

Get the NaN coordinates from the input data.

Dictionary maps each id (id_columns or global if not used) to a NumPy array of shape (n, 2), where each row contains the [row_index, column_index] of a NaN found in the input data.

Returns:

Dictionary mapping id to array of NaN positions

Return type:

Dict

property seed: int | None

Get the seed for reproducibility (all random generators).

Returns:

Seed for reproducibility

Return type:

Optional[str]

property save_path: str | None

Get the path where model artifacts will be saved.

Returns:

Path to save model or None if not set

Return type:

Optional[str]

property form: str

Get the encoder/decoder architecture type.

Returns:

Architecture type (‘lstm’, ‘gru’, ‘rnn’, or ‘dense’)

Return type:

str

property time_step_to_check: List[int] | None

Get the time step indices to check during reconstruction.

Returns:

Time step indices

Return type:

Optional[List[int]]

property context_window: int | None

Get the context window size.

Returns:

Context window size or None if not initialized

Return type:

Optional[int]

property feature_weights: List[float] | None

Get the feature weights.

Returns:

Feature weights

Return type:

Optional[List[float]]

property features_name: List[str] | None

Get the feature names used for training the model.

Returns:

List of feature names or None if not set

Return type:

Optional[List[str]]

property data: numpy.ndarray | None

Get the data used for training the model.

Returns:

Data used for training the model

Return type:

Optional[np.ndarray]

property feature_to_check: List[int] | None

Get the feature index or indices to check during reconstruction.

Returns:

Feature index or indices to check

Return type:

Optional[Union[int, List[int]]]

property normalize: bool

Get the normalization flag.

Returns:

Normalization flag

Return type:

bool

property normalization_method: str | None

Get the normalization method.

Returns:

Normalization method

Return type:

Optional[str]

property hidden_dim: int | List[int] | None

Get the hidden dimensions.

Returns:

Hidden dimensions

Return type:

Optional[Union[int, List[int]]]

property bidirectional_encoder: bool

Get the bidirectional encoder flag.

Returns:

Bidirectional encoder flag

Return type:

bool

property bidirectional_decoder: bool

Get the bidirectional decoder flag.

Returns:

Bidirectional decoder flag

Return type:

bool

property activation_encoder: str | None

Get the activation function for the encoder.

Returns:

Activation function

Return type:

Optional[str]

property activation_decoder: str | None

Get the activation function for the decoder.

Returns:

Activation function

Return type:

Optional[str]

property verbose: bool

Get the verbose flag.

Returns:

Verbose flag

Return type:

bool

property train_size: float

Get the training set size proportion.

Returns:

Training set size (0.0-1.0)

Return type:

float

property val_size: float

Get the validation set size proportion.

Returns:

Validation set size (0.0-1.0)

Return type:

float

property test_size: float

Get the test set size proportion.

Returns:

Test set size (0.0-1.0)

Return type:

float

property num_layers: int

Get the number of layers in the encoder/decoder architecture.

Returns:

Number of layers (0 if hidden_dim is not set)

Return type:

int

property id_data: numpy.ndarray | None

Get the ID data used for grouping time series.

Returns:

ID data array or None if not using ID-based processing

Return type:

Optional[np.ndarray]

property id_data_dict: Dict[str, numpy.ndarray] | None

Get the ID data dictionary mapping IDs to their respective datasets.

Returns:

Dictionary mapping ID strings to numpy arrays or None if not using ID-based processing

Return type:

Optional[Dict[str, np.ndarray]]

property id_data_mask: numpy.ndarray | None

Get the ID data mask.

property id_data_dict_mask: Dict[str, numpy.ndarray] | None

Get the ID data mask dictionary.

property id_columns_indices: List[int]

Get the indices of the ID columns used for grouping data.

Returns:

List of column indices that contain ID information

Return type:

List[int]

property use_mask: bool

Get the use_mask flag indicating whether masking is enabled for missing values.

Returns:

True if masking is enabled, False otherwise

Return type:

bool

property custom_mask: numpy.ndarray | None

Get the custom mask for missing values.

Returns:

Custom mask array or None if not set

Return type:

Optional[np.ndarray]

property mask_train: numpy.ndarray

Get the training mask for missing values.

Returns:

Training mask array

Return type:

np.ndarray

property mask_val: numpy.ndarray

Get the validation mask for missing values.

Returns:

Validation mask array

Return type:

np.ndarray

property mask_test: numpy.ndarray

Get the test mask for missing values.

Returns:

Test mask array

Return type:

np.ndarray

property imputer: DataImputer | None

Get the data imputer used for handling missing values.

Returns:

DataImputer instance or None if not set

Return type:

Optional[DataImputer]

property shuffle: bool

Get the shuffle flag indicating whether training data should be shuffled.

Returns:

True if shuffling is enabled, False otherwise

Return type:

bool

property shuffle_buffer_size: int | None

Get the shuffle buffer size for training data shuffling.

Returns:

Buffer size for shuffling or None if not set

Return type:

Optional[int]

property x_train_no_shuffle: numpy.ndarray

Get the unshuffled training data for reconstruction purposes.

Returns:

Training data without shuffling applied

Return type:

np.ndarray

property checkpoint: int

Get the checkpoint value.

Returns:

Number of epochs between checkpoints (0 to disable)

Return type:

int

property model_optimizer: keras.src.optimizers.Adam | keras.src.optimizers.SGD | keras.src.optimizers.RMSprop | keras.src.optimizers.Adagrad | keras.src.optimizers.Adadelta | keras.src.optimizers.Adamax | keras.src.optimizers.Nadam

Get the model’s optimizer.

Returns:

The optimizer instance

Return type:

Union[Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam]

classmethod load_from_pickle(path: str) AutoEncoder

Load an AutoEncoder model from a pickle file.

Parameters:

path (str) – Path to the pickle file containing the saved model

Returns:

An instance of AutoEncoder with loaded parameters

Return type:

AutoEncoder

Raises:
  • FileNotFoundError – If the pickle file does not exist

  • ValueError – If the pickle file format is invalid

  • RuntimeError – If there’s an error loading the model

static create_folder_structure(folder_structure: List[str]) None

Create a folder structure if it does not exist.

Parameters:

folder_structure (List[str]) – List of folder paths to create

Returns:

None

Return type:

None

build_model(context_window: int, data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], time_step_to_check: int | List[int], feature_to_check: int | List[int], hidden_dim: int | List[int], form: str = 'lstm', bidirectional_encoder: bool = False, bidirectional_decoder: bool = False, activation_encoder: str | None = None, activation_decoder: str | None = None, normalize: bool = False, normalization_method: str = 'minmax', optimizer: str = 'adam', batch_size: int = 32, save_path: str | None = None, verbose: bool = False, feature_names: List[str] | None = None, feature_weights: List[float] | None = None, shuffle: bool = False, shuffle_buffer_size: int | None = None, use_mask: bool = False, custom_mask: Any = None, imputer: DataImputer | None = None, train_size: float = 0.8, val_size: float = 0.1, test_size: float = 0.1, id_columns: str | int | List[str] | List[int] | None = None, use_post_decoder_dense: bool = False, seed: int | None = 42) None

Build the Autoencoder model with specified configuration.

Parameters:
  • context_window (int) – Size of the context window for sequence transformation

  • form (str) – Type of encoder architecture to use

  • data (Union[np.ndarray, pd.DataFrame, pl.DataFrame, Tuple[np.ndarray, np.ndarray, np.ndarray]]) – Input data for model training. Can be: * A single numpy array/pandas DataFrame for automatic train/val/test split * A tuple of three arrays/DataFrames for predefined splits

  • time_step_to_check (Union[int, List[int]]) – Index or indices of time steps to check in prediction

  • feature_to_check (Union[int, List[int]]) – Index or indices of features to check in prediction

  • hidden_dim (Union[int, List[int]]) – Dimensions of hidden layers. Can be single int or list of ints

  • bidirectional_encoder (bool) – Whether to use bidirectional layers in encoder

  • bidirectional_decoder (bool) – Whether to use bidirectional layers in decoder

  • activation_encoder (Optional[str]) – Activation function for encoder layers

  • activation_decoder (Optional[str]) – Activation function for decoder layers

  • normalize (bool) – Whether to normalize input data

  • normalization_method (str) – Method for data normalization (‘minmax’ or ‘zscore’)

  • optimizer (str) – Name of optimizer to use for training

  • batch_size (int) – Size of batches for training

  • save_path (Optional[str]) – Directory path to save model checkpoints

  • verbose (bool) – Whether to print detailed information during training

  • feature_names (Optional[List[str]]) – Custom names for features

  • feature_weights (Optional[List[float]]) – Weights for each feature in loss calculation

  • shuffle (bool) – Whether to shuffle training data

  • shuffle_buffer_size (Optional[int]) – Size of buffer for shuffling

  • use_mask (bool) – Whether to use masking for missing values

  • custom_mask (Any) – Custom mask for missing values

  • imputer (Optional[DataImputer]) – Instance of DataImputer for handling missing values

  • train_size (float) – Proportion of data for training (0-1)

  • val_size (float) – Proportion of data for validation (0-1)

  • test_size (float) – Proportion of data for testing (0-1)

  • id_columns (Union[str, int, List[str], List[int], None]) – Column(s) to use for grouping data

  • use_post_decoder_dense (bool) – Whether to add dense layer after decoder

  • seed (Optional[int]) – Seed for reproducibility (sets all random generators)

Raises:
  • NotImplementedError – If form=’dense’ is specified

  • ValueError – If invalid parameters are provided

Returns:

None

Return type:

None

Example:
>>> autoencoder = AutoEncoder()
>>> data = np.random.randn(1000, 5)  # 1000 samples, 5 features
>>> autoencoder.build_model(
...     context_window=10,
...     data=data,
...     time_step_to_check=[5, 7],
...     feature_to_check=[0, 1, 2],
...     hidden_dim=[64, 32],
...     form="lstm",
...     normalize=True
... )
train(epochs: int = 100, checkpoint: int = 10, use_early_stopping: bool = True, patience: int = 10) None

Train the model using the train and validation datasets and save the best model.

Parameters:
  • epochs (int) – Number of epochs to train the model

  • checkpoint (int) – Number of epochs to save a checkpoint (0 to disable)

  • use_early_stopping (bool) – Whether to use early stopping or not

  • patience (int) – Number of epochs to wait before stopping the training

Returns:

None

Return type:

None

reconstruct(save_path: str | None = None, reconstruction_diagnostic: bool = False) pandas.DataFrame

Reconstruct the data using the trained model and plot the actual and reconstructed values.

Parameters:
  • save_path (Optional[str]) – Path to save reconstruction results, plots, and diagnostics

  • reconstruction_diagnostic (bool) – If True, shows and optionally saves reconstruction error data and plots

Returns:

Reconstruction results

Return type:

pd.DataFrame

save(save_path: str | None = None, filename: str = 'model.pkl') None

Save the model (Keras model + training parameters) into a single .pkl file.

Parameters:
  • save_path (Optional[str]) – Path to save the model

  • filename (str) – Name of the file to save the model

Raises:

Exception – If there’s an error saving the model

Returns:

None

Return type:

None

build_and_train(context_window: int, data: numpy.ndarray | pandas.DataFrame | polars.DataFrame | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], time_step_to_check: int | List[int], feature_to_check: int | List[int], hidden_dim: int | List[int], form: str = 'lstm', bidirectional_encoder: bool = False, bidirectional_decoder: bool = False, activation_encoder: str | None = None, activation_decoder: str | None = None, normalize: bool = False, normalization_method: str = 'minmax', optimizer: str = 'adam', batch_size: int = 32, save_path: str | None = None, verbose: bool = False, feature_names: List[str] | None = None, feature_weights: List[float] | None = None, shuffle: bool = False, shuffle_buffer_size: int | None = None, use_mask: bool = False, custom_mask: Any = None, imputer: DataImputer | None = None, train_size: float = 0.8, val_size: float = 0.1, test_size: float = 0.1, id_columns: str | int | List[str] | List[int] | None = None, epochs: int = 100, checkpoint: int = 10, use_early_stopping: bool = True, patience: int = 10, use_post_decoder_dense: bool = False, seed: int | None = 42) AutoEncoder

Build and train the Autoencoder model in a single step.

This method combines the functionality of build_model() and train() methods, allowing for a more streamlined workflow.

Parameters:
  • context_window (int) – Context window for the model used to transform tabular data into sequence data (2D tensor to 3D tensor)

  • data (Union[np.ndarray, pd.DataFrame, Tuple[np.ndarray, np.ndarray, np.ndarray]]) – Data to train the model. It can be: * A single numpy array/pandas DataFrame for automatic train/val/test split * A tuple of three arrays/DataFrames for predefined splits

  • time_step_to_check (Union[int, List[int]]) – Time steps to check for the autoencoder

  • feature_to_check (Union[int, List[int]]) – Features to check in the autoencoder

  • form (str) – Type of encoder, one of “dense”, “rnn”, “gru” or “lstm”

  • hidden_dim (Union[int, List[int]]) – Number of hidden dimensions in the internal layers

  • bidirectional_encoder (bool) – Whether to use bidirectional LSTM in encoder

  • bidirectional_decoder (bool) – Whether to use bidirectional LSTM in decoder

  • activation_encoder (Optional[str]) – Activation function for the encoder layers

  • activation_decoder (Optional[str]) – Activation function for the decoder layers

  • normalize (bool) – Whether to normalize the data

  • normalization_method (str) – Method to normalize the data “minmax” or “zscore”

  • optimizer (str) – Optimizer to use for training

  • batch_size (int) – Batch size for training

  • save_path (Optional[str]) – Folder path to save model checkpoints

  • verbose (bool) – Whether to log model summary and training progress

  • feature_names (Optional[List[str]]) – List of feature names to use

  • feature_weights (Optional[List[float]]) – List of feature weights for loss scaling

  • shuffle (bool) – Whether to shuffle the training dataset

  • shuffle_buffer_size (Optional[int]) – Buffer size for shuffling

  • use_mask (bool) – Whether to use a mask for missing values

  • custom_mask (Any) – Custom mask to use for missing values

  • imputer (Optional[DataImputer]) – Imputer to use for missing values

  • train_size (float) – Proportion of dataset for training

  • val_size (float) – Proportion of dataset for validation

  • test_size (float) – Proportion of dataset for testing

  • id_columns (Union[str, int, List[str], List[int], None]) – Column(s) to process data by groups

  • epochs (int) – Number of epochs for training

  • checkpoint (int) – Number of epochs between model checkpoints

  • use_early_stopping (bool) – Whether to use early stopping

  • patience (int) – Number of epochs to wait before early stopping

  • use_post_decoder_dense (bool) – Whether to use a dense layer after the decoder

  • seed (Optional[int]) – Seed for reproducibility (sets all random generators)

Returns:

Self for method chaining

Return type:

AutoEncoder

reconstruct_new_data(data: numpy.ndarray | pandas.DataFrame | polars.DataFrame, iterations: int = 1, id_columns: str | int | List[str] | List[int] | None = None, save_path: str | None = None, reconstruction_diagnostic: bool = False) Dict[str, pandas.DataFrame]

Predict and reconstruct unknown data, iterating over NaN values to improve predictions. Uses stored context_window, normalization parameters, and the trained model.

Parameters:
  • data (Union[np.ndarray, pd.DataFrame]) – Input data (numpy array or pandas DataFrame or polars DataFrame)

  • iterations (int) – Number of reconstruction iterations (None = no iteration)

  • id_columns (Optional[Union[str, int, List[str], List[int]]]) – Column(s) that define IDs to process reconstruction separately

  • save_path (Optional[str]) – Path to save reconstruction results, plots, and diagnostics

  • reconstruction_diagnostic (bool) – If True, shows and optionally saves reconstruction error data and plots

Returns:

Dictionary with reconstructed data per ID (or “global” if no ID)

Return type:

Dict[str, pd.DataFrame]

Raises:

ValueError – If no model is loaded or if id_columns format is invalid

prepare_datasets(data: numpy.ndarray | Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], context_window: int, normalize: bool, id_iter: str | int | None = None) bool

Prepare the datasets for the model training and testing.

Parameters:
  • data (Union[np.ndarray, Tuple[np.ndarray, np.ndarray, np.ndarray]]) – Data to train the model. It can be a single numpy array with the whole dataset from which a train, validation and test split is created, or a tuple with three numpy arrays, one for the train, one for the validation and one for the test.

  • context_window (int) – Context window for the model

  • normalize (bool) – Whether to normalize the data or not

  • id_iter (Optional[Union[str, int]]) – ID of the iteration

Returns:

True if the datasets are prepared successfully

Return type:

bool

Raises:

ValueError – If data format is invalid or if NaNs are present when use_mask is False

concatenate_by_id() None

Concatenate datasets by ID. This method combines the training, validation, and test datasets for each ID into a single dataset. It also concatenates the masks if they are used.

Returns:

None

Return type:

None

static masked_weighted_mse(y_true: tensorflow.Tensor, y_pred: tensorflow.Tensor, time_step_to_check: int | List[int], feature_to_check: int | List[int], feature_weights: tensorflow.Tensor | None = None, mask: tensorflow.Tensor | None = None) tensorflow.Tensor

Compute Mean Squared Error (MSE) with optional masking and feature weights.

Parameters:
  • y_true (tf.Tensor) – Ground truth values with shape (batch_size, seq_length, num_features)

  • y_pred (tf.Tensor) – Predicted values with shape (batch_size, seq_length, num_features)

  • time_step_to_check (Union[int, List[int]]) – Time step to check

  • feature_to_check (Union[int, List[int]]) – Feature to check

  • feature_weights (Optional[tf.Tensor]) – Feature weights

  • mask (Optional[tf.Tensor]) – Optional binary mask with shape (batch_size, seq_length, num_features) 1 for observed values, 0 for missing values

Returns:

Masked and weighted MSE loss value

Return type:

tf.Tensor