Breaking Down ETSformer

Paul Bruffett
3 min readSep 21, 2022

Refactoring a novel time-series Transformer Architecture.

Implementation covered below can be found here.

ETSformer, an algorithm developed by Salesforce utilizes PyTorch and a neural transformer architecture combined with exponential smoothing to build a time-series forecasting architecture with robust results.

The implementation on the Salesforce github is great but is spread across many files and scripts with layers of abstraction between. I refactored it and built a reference implementation with the Weather dataset that can be applied to any Pandas Dataframe. In this article I’ll talk a bit about the code and the algorithm components.

Let’s work backwards from the ETSformer model class;

We can see the model contains three primary components, an embedding layer, an encoder and a decoder. The encoder itself can be decomposed into modules for deriving level, growth and seasonal trends.

We’ll break it down in the order of forward steps:

  • Transforming — this is mostly adding noise and to training data and will be skipped.
  • Embeddings
  • Encoder
  • Decoder

Embedding

The intent of the embedding layer is to avoid manually engineering time-dependent features like month-of-year or day-of-week flagging. The embedding layer and frequency attention module should uncover these patterns.

The embedding layer consists of a 1D conv filter with c_in and d_model equating to two of the model parameters, these set the input and output sizes for the convolution. In the output (line 10), the tensor is permuted from [batch, sequence_length, in_channels] to [batch, in_channels, sequence_length]. This transposed tensor is fed into the convolution layer and the output is truncated with 2 steps being removed […, :-2]. This is because the conv output in this example would be [32,862,14]. We truncate the final 2 steps added by padding in order to align to the original sequence length of 12.

Encoder

The encoder extracts and returns the key trend components.

We can see that there are four primary custom layers that make up the Encoder:

  • Feedforward
  • Growth
  • Level
  • Season or Fourier

Seasonality is derived (line 27) with the residual being used to extract the growth. The residual is normalized and utilized in a feedforward network.

Feedforward

The feedfoward layer builds a set of linear layers with dropout.

Season

The season layer utilizes fourier transformations from PyTorch to extract the seasonal signals from the inputs.

The results from the embedding or conv1d layer are fed into the forward pass, this is the x fed to the def forward. This shape is [batch, model_dimensions, sequence_length]. FFT.RFFT extracts a signal, the top K frequencies or signals are extracted giving us a [batch, k, sequence_length].

This is passed to the extrapolation which derives the phase and amplitude from the fft transformed input and reshapes it.

The result is passed back with the seasonal trends then being removed before passing the result to the growth layer.

Growth

The growth layer reshapes the inputs before applying exponential smoothing. The exponential smoothing layer, the model’s namesake, is invoked in the growth layer and provides most of the custom logic;

The exponential smoothing layer generates a tensor that increments by 1 and is 0 to nheads in length. This provides the basis against which the self.weight, alpha, is applied. The tensor is then rearranged to be returned as [batch, sequence, nheads, model dimensions].

The output from the growth layer is sent through normalization layers and another Feedforward layer before being sent to the decoder.

The growth and season are passed to the decoder.

Decoder

The decoder dampens the seasonal and growth layers, passing these dampened values back to be reintegrated with the level for the output prediction.

Training

Data is loaded into a dataframe before being wrapped in a Data Loader, this data loader performs feature engineering including scaling the data, creating time windows both for the features and the targets, and making them available as an iterator.

Once the data has been loaded and wrapped in a provider, we are ready to train. Here training has been broken into a function for running the iteration, this performs the transformation and inferencing, returning results which are then used to calculate gradients in the case of training, and loss in the case of test and train.

As a final note, the model has many parameters that can be tuned

These parameters can be broken down into those which are model hyperparameters and those that are dataset or input dependent.

--

--

Paul Bruffett

Enterprise Architect specializing in data and analytics.