Effort­less Time Series Feature Genera­tion: A Practical Approach for Simple Models

Share post:

Among all the steps involved in time series analysis and forecas­ting, feature genera­tion is one of the most important. Time series features are able to capture the trend, seaso­na­lity, lags, or rolling statis­tics that help the model compre­hend temporal dynamics.

In this post, we will demons­trate how to use some practical and simple techni­ques for time series feature genera­tion. We will avoid deep learning compli­ca­tions and focus on effec­tive, simple approa­ches to create features that will empower simpler models to deliver robust explainable predic­tions.

How to generate features: Feature enginee­ring vs. deep learning methods?

While deep learning models can automa­ti­cally learn complex repre­sen­ta­tions from raw time series data and can be trained directly with limited or even no feature trans­for­ma­tion, more classical models such as linear regres­sion and decision trees rely heavily on well-engineered features. These models perform best when they are fed with meaningful features extra­cted from the raw data; feature enginee­ring is then an essen­tial ingre­dient for good model perfor­mance and explaina­bi­lity.

The good news is that you don’t need a sophisti­cated AI to generate meaningful time series features. Instead, tools and libra­ries focused on deter­mi­ni­stic functional trans­for­ma­tions make it possible to extract them with very little effort: efficient, inter­pr­e­table methods are ideal for simple machine learning models.

Python libra­ries for genera­ting time series Features

A few libra­ries are able to automate the process of extra­c­ting features from time series data. Among the best known are Catch22 which provides 22 carefully curated time series features, and tsfresh a compre­hen­sive library for extra­c­ting a wide range of time series features. While these tools are extre­mely powerful, they can be very compu­ta­tio­nally inten­sive and are there­fore not as well suited to very large data sets or real-time appli­ca­tions common in industry.

In this post, we will focus on Functime, a light­weight and efficient open-source library for fast feature genera­tion built on Rust and seamlessly integrated with Polars. Functime offers an excel­lent balance between compu­ta­tional speed and flexi­bi­lity, making it ideal for scena­rios where simpli­city and perfor­mance are crucial. It is optimized to allow effort­less calcu­la­tion of statis­tical features, inclu­ding lags, over a desired periodi­city. With Functime, features for simpler models can be generated quickly, without the comple­xity and overhead of more elabo­rate tools.

UseCase: Predic­ting the shutdown of a water pump

For this use case, we aim to predict poten­tial shutdowns of a water pump using time series data. The data set consists of raw sensor readings collected at regular inter­vals from 52 sensors, together with a timestamp and a status label (machine_status). Each sensor records a specific aspect of the pump’s opera­tion, such as pressure, tempe­ra­ture or flow rate.

The dataset is struc­tured as follows:

				
					timestamp # Zeitstempel jeder Beobachtung
sensor_00 # Messwerte von Sensor 00
...
sensor_51 # Messwerte von Sensor 51
maschine_status # Betriebsstatus der Pumpe (NORMAL, WARNUNG oder ABGESCHALTET)
				
			

The data for our example comes from the use case of the same name on Kaggle

				
					ts_sensor_path = Path(os.path.abspath("")).parents[1] / "data" / "ts_data"/ "ts_sensor_data.csv"
ts_sensor_data = pl.read_csv(source=ts_sensor_path)
ts_sensor_data.head()

				
			

Create features with FUNCTIME

With Functime, we take on the challenges of modeling: the genera­tion of time series features. The tool computed statis­tical features over specific time periods and helps trans­form the original data into a set of intui­tive charac­te­ristics. For the purposes of this presen­ta­tion, we calcu­lated statis­tical features such as the absolute maximum and root mean square for all six hours of data from each sensor. Over 312 features were computed from the data from 52 sensors in just 73 milli­se­conds, showca­sing the effici­ency of the feature genera­tion process. This rapid compu­ta­tion makes it feasible to handle real-time or high-frequency sensor data streams without signi­fi­cant compu­ta­tional effort. The computed features can be used directly as input for classical models such as Linear Regres­sion, Random Forests, or Gradient Boosted Trees.

				
					def generate_features_for_timeseries(column_name: str) -> dict:
    ts = pl.col(column_name).ts
    return {
        f"mean_n_absolute_max_{column_name}": ts.mean_n_absolute_max(n_maxima=3),
        f"range_over_mean_{column_name}": ts.range_over_mean(),
        f"root_mean_square_{column_name}": ts.root_mean_square(),
        f"first_location_of_maximum_{column_name}": ts.first_location_of_maximum(),
        f"last_location_of_maximum_{column_name}": ts.last_location_of_maximum(),
        f"absolute_maximum_{column_name}": ts.absolute_maximum()
    }
sensor_columns = [col for col in ts_sensor_data.columns if col not in ['timestamp', 'machine_status', '']]
new_features = {
    feature_name: calculation
    for sensor_column in sensor_columns
    for feature_name, calculation in generate_features_for_timeseries(sensor_column).items()
}
timeseries_features = (
    ts_sensor_data.group_by_dynamic(
        index_column="timestamp",
        every="6h",
        group_by="machine_status",
        start_by="window"
    )
    .agg(**new_features)
)
timeseries_features.head()

				
			

The follo­wing graph shows how well our features corre­late with our targets. These time series features provide valuable insights into the behavior of the sensors and their relati­onship to the pump status (NORMAL, BROKEN or RECOVERING). For example, we can deter­mine that the pump is broken when the root mean square value of sensor 48 is close to 0 . We also expect that higher absolute maximum values for sensors 3, 4 and 11 increase the proba­bi­lity that the pump is in a NORMAL state.

Model predic­tions with our time series features

Using the computed features, we built a predic­tive model to classify the water pump’s opera­tional status.
Lever­aging SelectKBest, we select the 30 most important features based on ANOVA F-Statistiken. As a base model, we chose a HistGradientBoostingClassifier that is robust to unbalanced classes and inher­ently provides good predic­tions. This stream­lined approach, powered by light­weight feature genera­tion, shows how classical models can provide high-quality predic­tions when combined with well-engineered time series features.
Our key insights for model use and forecast evalua­tion:
  • Feature genera­tion lever­ages Rust-based functime and polars data proces­sing libra­ries, which make it possible to work with large data sets even on a simple notebook.
  • The model handles effec­tively class imbalances, achie­ving high metrics across all catego­ries. This demons­trates the strength of HistGradientBoostingClassifier combined with well-crafted time series features.
  • Minor perfor­mance dips for the RECOVERING class indicate possible impro­ve­ments, such as fine-tuning the model or inclu­ding additional features tailored to transi­tion states.
				
					X = timeseries_features[timeseries_features.columns[2:]]
y = timeseries_features["machine_status"]
selector = SelectKBest(score_func=f_classif, k=30).set_output(transform="pandas")
X_selected = selector.fit_transform(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42, stratify=y)
model = HistGradientBoostingClassifier(random_state=42, class_weight="balanced")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

				
			
PRECISIONrecallF1-SCOREN (Support)
BROKEN1.001.001.002
NORMAL0.991.001.00346
RECOVE­RING0.960.920.9426
accuracy0.99374
macro avg0.980.970.98374
weighted avg0.990.990.99374

With stream­lined feature genera­tion, the model demons­trates excep­tio­nally promi­sing perfor­mance across all classes.

Our inter­pre­ta­tion of the model perfor­mance with the generated time series features:
  • BROKEN: The model makes perfect predic­tions with precision, recall and F1 score of 1.00, but may not be very reliable as there are only two examples (support = 2);
  • NORMAL: The model is almost perfect with 99% precision and 100% recall, which shows that almost all normal examples have been correctly identi­fied;
  • RECOVE­RING: There was a slight drop in perfor­mance (F1 score = 0.94) due to some false negatives, suggesting that impro­ve­ments are possible through feature enginee­ring or hyper­pa­ra­meter tuning .

Conclu­sion: Functime package as a real help for time series features

The Python package Functime can really make life easier by creating the time series features in a matter of seconds. For our model for predic­ting the function­a­lity of water pumps, the perfor­mance was already really promi­sing without us having to do any time-consuming fine tuning. Another advan­tage of automated feature creation is, of course, that no feature is acciden­tally forgotten and the proce­dure can be easily repeated with new or extended data.

Picture of Mark Willhoughby

Mark Willhoughby

Data Scien­tist

Projektanfrage

Vielen Dank für Ihr Interesse an den Leistungen von m²hycon. Wir freuen uns sehr, von Ihrem Projekt zu erfahren und legen großen Wert darauf, Sie ausführlich zu beraten.

Von Ihnen im Formular eingegebene Daten speichern und verwenden wir ausschließlich zur Bearbeitung Ihrer Anfrage. Ihre Daten werden verschlüsselt übermittelt. Wir verarbeiten Ihre personenbezogenen Daten im Einklang mit unserer Datenschutzerklärung.

Project request

Thank you for your interest in m²hycon’s services. We look forward to hearing about your project and attach great importance to providing you with detailed advice.

We store and use the data you enter in the form exclusively for processing your request. Your data is transmitted in encrypted form. We process your personal data in accordance with our privacy policy.