src.dfencoder package

Submodules

src.dfencoder.autoencoder module

src.dfencoder.autoencoder.ohe(input_vector, dim, device='cpu')[source]

Does one-hot encoding of input vector.

src.dfencoder.autoencoder.compute_embedding_size(n_categories)[source]

Applies a standard formula to choose the number of feature embeddings to use in a given embedding layers.

n_categories is the number of unique categories in a column.

class src.dfencoder.autoencoder.CompleteLayer(in_dim, out_dim, activation=None, dropout=None, *args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Impliments a layer with linear transformation and optional activation and dropout.

interpret_activation(act=None)[source]
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class src.dfencoder.autoencoder.AutoEncoder(encoder_layers=None, decoder_layers=None, encoder_dropout=None, decoder_dropout=None, encoder_activations=None, decoder_activations=None, activation='relu', min_cats=10, swap_p=0.15, lr=0.01, batch_size=256, eval_batch_size=1024, optimizer='adam', amsgrad=False, momentum=0, betas=(0.9, 0.999), dampening=0, weight_decay=0, lr_decay=None, nesterov=False, verbose=False, device=None, logger='basic', logdir='logdir/', project_embeddings=True, run=None, progress_bar=True, n_megabatches=1, scaler='standard', eps=1e-06, *args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

get_scaler(name)[source]
init_numeric(df)[source]
init_cats(df)[source]
init_binary(df)[source]
init_features(df)[source]
build_inputs()[source]
build_outputs(dim)[source]
prepare_df(df)[source]

Does data preparation on copy of input dataframe. Returns copy.

build_optimizer()[source]
build_model(df)[source]

Takes a pandas dataframe as input. Builds autoencoder model.

Returns the dataframe after making changes.

compute_targets(df)[source]
encode_input(df)[source]

Handles raw df inputs. Passes categories through embedding layers.

compute_outputs(x)[source]
encode(x, layers=None)[source]
decode(x, layers=None)[source]
forward(df)[source]

We do the thang. Takes pandas dataframe as input.

compute_loss(num, bin, cat, target_df, logging=False, _id=False)[source]
do_backward(mse, bce, cce)[source]
compute_baseline_performance(in_, out_)[source]
Baseline performance is computed by generating a strong

prediction for the identity function (predicting input==output) with a swapped (noisy) input, and computing the loss against the unaltered original data.

This should be roughly the loss we expect when the encoder degenerates

into the identity function solution.

Returns net loss on baseline performance computation

(sum of all losses)

fit(df, epochs=1, val=None)[source]

Does training.

train_epoch(n_updates, input_df, df, pbar=None)[source]

Run regular epoch.

train_megabatch_epoch(n_updates, df)[source]

Run epoch doing ‘megabatch’ updates, preprocessing data in large chunks.

get_representation(df, layer=0)[source]
Computes latent feature vector from hidden layer

given input dataframe.

argument layer (int) specifies which layer to get. by default (layer=0), returns the “encoding” layer.

layer < 0 counts layers back from encoding layer. layer > 0 counts layers forward from encoding layer.

get_deep_stack_features(df)[source]

records and outputs all internal representations of input df as row-wise vectors. Output is 2-d array with len() == len(df)

get_anomaly_score(df)[source]

Returns a per-row loss of the input dataframe. Does not corrupt inputs.

decode_to_df(x, df=None)[source]

Runs input embeddings through decoder and converts outputs into a dataframe.

df_predict(df)[source]

Runs end-to-end model. Interprets output and creates a dataframe. Outputs dataframe with same shape as input

containing model predictions.

training: bool

src.dfencoder.dataframe module

class src.dfencoder.dataframe.EncoderDataFrame(*args, **kwargs)[source]

Bases: pandas.core.frame.DataFrame

swap(likelihood=0.15)[source]

Performs random swapping of data. Each value has a likelihood of argument likelihood

of being randomly replaced with a value from a different row.

Returns a copy of the dataframe with equal size.

src.dfencoder.scalers module

class src.dfencoder.scalers.StandardScaler[source]

Bases: object

Impliments standard (mean/std) scaling.

fit(x)[source]
transform(x)[source]
inverse_transform(x)[source]
fit_transform(x)[source]
class src.dfencoder.scalers.GaussRankScaler[source]

Bases: object

So-called “Gauss Rank” scaling. Forces a transformation, uses bins to perform

inverse mapping.

Uses sklearn QuantileTransformer to work.

fit(x)[source]
transform(x)[source]
inverse_transform(x)[source]
fit_transform(x)[source]
class src.dfencoder.scalers.NullScaler[source]

Bases: object

fit(x)[source]
transform(x)[source]
inverse_transform(x)[source]
fit_transform(x)[source]

Module contents