ECGAN - A framework for reproducible research on ECG data
Enforcing reproducible and consistent research can be difficult. It is, however, also a crucial requirement to quickly iterate through novel ideas. Reproducibility is not always possible, for example when using medical data that is not publicly available, but a pipeline to test new datasets and investigate existing open source datasets can be very useful. In a joint work with the research group Data Mining in Medicine of Ludwig-Maximilians-Universität München we have built a framework focusing on generating and classifying time series data. Our proposed framework ECGAN focuses on electrocardiographic data. It allows the automatic download and preprocessing of various commonly used datasets and strives to perform training and evaluation of models as deterministic as possible with a generateable and easy-to-use configuration. The design allows an easy addition of novel datasets, preprocessing methods and machine learning models, currently focusing on deep learning models using PyTorch. This post targets an audience familiar with basic terminology of machine learning.
Motivation
Comparing scientific work is restricted by different - and often equally valid - problem settings. These might depend on available data, a specific use case or given requirements. Another factor influencing the comparability is the lack of reproducibility of previous approaches due to e.g. the lack of information on hyperparameters or the use of proprietary code.
Reasons for this can vary and range from lack of time or lack of confidence to fraud. A core task is thus to build structures that facilitate the integration of new datasets, preprocessing, models or ideas into the scientific community. This can help to diminish some of the previously mentioned problems - the required time is reduced since one can focus on investigations of the subset relevant for a given research question instead of the total workflow. Reproducing experiments with a single command allows for faster and easier development, increases the confidence and makes fraud more difficult. ECGAN attempts to offer reproducibility, as well as well documented and readable code, which simplifies working through novel ideas.
We will first introduce a quick example of its usage before taking a glance at the architecture and finishing with a quick discussion of its use. While we will (directly or indirectly) mention some issues we encountered during the creation of this framework, this blog post is not a deep dive into technical details.
Usage
The basic usage is designed to be very simple and ECGAN tries to ease the problems mentioned before.
Setup and first working example
Let’s start with a quick example and the basic steps required to get started before shining some light on the most relevant parameters. The setup is simple: Make sure to install Python3.8 as well as pip and - for the sake of your sanity - consider using a tool to isolate your environment such as virtualenv. Afterwards:
|
|
If you use Windows, you might need to use Chocolatey or manually execute the setup steps specified in the Makefile. Thats it!
But what is happening here, what are those parameters?
When generating the config using ecgan-init
, the -d
flag defines the dataset used (see supported datasets), the -m
flag the respective model (see supported models) and -o
will be the name of the resulting YAML config file.
The first three arguments, ENTITY
, PROJECT
and EXPERIMENT_NAME
, are related to the experimental structure of the framework which is intended to be used both by individuals as well as teams. Both often utilize an advanced tracking tool, such as Weights and Biases. In our configuration, ENTITY
is the name of the user/team and PROJECT
as well as EXPERIMENT_NAME
are used to organize your runs. While the Weights and Biases interface is the only third party tracking tool currently supported, you can easily support your own experiment tracking tool by implementing a simple interface, the BaseTracker class. The default configuration is to locally save your information in a subdirectory based on ENTITY/PROJECT/EXPERIMENT_NAME
, but we really recommend you take a look at Weights and Biases or a similar tracking tool!
By default, the described setup will run without additional setup of external tools. However, there are some options and components which can require further setup: This includes the tracking tool, saving models in AWS S3 buckets as well as downloading data from Kaggle. The latter will likely be the most relevant: several datasets are downloaded using the Kaggle API, including a very relevant baseline dataset, the beatwise processed MITBIH arrhythmia dataset (-d mitbih_beats
).
ecgan-init
generates a configuration file (here we named it rnn_config.yml
) which specifies parameters for preprocessing as well as training. You can easily change the default contents of the file before continuing the process. These parameters include experiment settings such as your tracker, preprocessing settings like the target sequence length and the corresponding up-/downsample algorithms, and model settings like the optimizer with its hyperparameters. Based on these settings, ecgan-preprocess
downloads and prepares the data which is then used as input to train a model using ecgan-train
. The resulting model can be used for arbitrary additional tasks such as anomaly detection. The next section will contain more information on the specific outputs of the various steps!
Architecture
- Initialization and Configuration (
ecgan-init
)- Basis for preprocessing and training.
- Easy configuration via YAML file, controlling a variety of (hyper)parameters.
- The generated config can depend on the choice of the options set upon creation, e.g. different optimal transformations for different datasets.
- Goal/Output: Offer a deterministic description to reproduce experiments as well as possible. In the long term, we also aim to provide good hyperparameters in the default configuration. This means that you can simply generate the config, run it and have a decent baseline for your problem at hand.
- Preprocessing (
ecgan-preprocess
)- Based on the config, a dataset is downloaded to a specified data directory (
./data/
by default) into its own subdirectory (data/DATASET/raw
). This folder contains the raw data as downloaded from the dataset source. - Afterwards, heavy preprocessing steps such as resampling and data cleaning are performed and the resulting data is saved into a new directory in a predefined format for ECGAN -
data/DATASET/processed/data.pkl
as well asdata/DATASET/processed/label.pkl
. - Goal/Output: Dataset divided into data and labels which can later be loaded into the framework for training and evaluation. Operations which require few computational resources and/or change frequently are NOT yet applied!
- Based on the config, a dataset is downloaded to a specified data directory (
- Training (
ecgan-train
)- Data is loaded and preprocessed based on the training and model configuration. This includes frequently changed steps such as data transformations, data scaling based on a networks output function (e.g. when using a tanh instead of a sigmoidal output layer) as well as the amount and size of cross-validation folds.
- Based on the configuration, training of the selected architecture is started. We currently focus on deep learning architectures used for anomaly detection. This includes some RNN/CNN baseline models for binary classification, but the current focus lies on generative models which are subsequently used for one-class anomaly detection.
- Goal/Output: Evaluation metrics - graphical representations as well as numeric values - of the training, saved models, as well as relevant information to reproduce the data splits using an index lists of datapoints.
- Evaluation (e.g.
ecgan-detect
)- This component is a placeholder for any tasks that require a trained model. In practice, we currently focus on anomaly detection in a very broad sense. The procedure is generally as follows:
- Generate a new configuration specifying the task, trained model and tracking information (e.g.
ecgan-detect -i MODEL_IDENTIFICATION ENTITY PROJECT EXPERIMENT_NAME -o evaluation_config.yml
- Start the detection using
ecgan-detect evaluation_config.yml
. Arbitrary information used to reproduce the experimental setting are loaded into memory. This includes the trained model which is then used to carry out the evaluation.
- Generate a new configuration specifying the task, trained model and tracking information (e.g.
- This component is a placeholder for any tasks that require a trained model. In practice, we currently focus on anomaly detection in a very broad sense. The procedure is generally as follows:
In the case of anomaly detection, the evaluation is based on specific AnomalyDetector
s which can be changed by the user. Example: Using various generative anomaly detection approaches, we generate new samples, calculate anomaly scores based on parameters saved during validation, retrieve measures for the anomaly detection task itself, including F1 score, phi coefficient and AUROC. Furthermore, model specific variables (e.g. distribution of samples drawn from latent space) are calculated and other frequent tasks related to generative models or anomaly detection are carried out, e.g. embedding the generated data using UMAP trained on the real train/validation data.
Supported Datasets
While we focus on ECG data, the framework supports the easy addition of further time-series datasets. Some of these datasets are already implemented to be used as reference datasets in evaluations.
The dataset can be set using the -d
flag during initialization. Currently available options include various processed versions of the MITBIH dataset as well as data from the PTB and Shaoxing datasets. Additional datasets which do not focus on ECG data include the Wafer and CMU MoCap datasets. We currently suggest the usage of beatwise processed MITBIH data from Kachuee et al. 2018 (-d mitbih_beats
) which requires access to the Kaggle API. If you want to avoid setting up the Kaggle API access for first experiments, you can utilize the Wafer dataset (-d wafer
).
More information on the datasets and their source can be found in the ECGAN docs - Supported datasets.
Note: To ensure compatibility with all models it is recommended to make sure that your target sequence length is compatible with the model, especially for generative models. While this can be easily achieved by adding fully connected layers, we have used several architectures which do not support arbitrary sequence lengths. Currently, the best way is to make sure that your sequence length is divisible by 32. To achieve this, change the preprocessing config: Say your original sequence length is 167 and your target sequence length thus 160. Simply use the RESAMPLING_ALGORITHM: lttb
with TARGET_SEQUENCE_LENGTH: 160
before performing the preprocessing. Similarly, if you would like to upsample the data, linear interpolations are often sufficient, e.g. when changing from a sequence length of 152 to 160. This is achieved by using RESAMPLING_ALGORITHM: interpolate
once again with TARGET_SEQUENCE_LENGTH: 160
.
Supported Models
The current focus lies on generative modeling and focuses on deep generative models, including various GAN-based models (e.g. RGAN, DCGAN), autoencoder-based models (autoencoder, variational autoencoder) and a combination of both (BeatGAN, β-VAEGAN). Such generative models are intended to be used in a one-class classification setting. Additionally, we support more traditional supervised deep learning classification models such as RNNs and CNNs. More information on the models and their source can be found in the ECGAN docs - Supported Models.
Why the hassle?
To allow the flexibility one would like to have, additional layers of complexity are required. Using Python to build large frameworks further leads to a variety of technical challenges which can be avoided using other programming languages. Some of our solutions to these challenges, such as nested dataclasses, might be discussed in a future blog post!
Experimenting and getting familiar with ECGAN will take you some time. Why should you invest your time? Here are some good reasons:
- Easy retrieval and preprocessing of public datasets: Most of the typical preprocessing pipeline can be set via configuration files. Even if you do not want to train the models supported by ECGAN, you can easily retrieve raw and preprocessed datasets as inputs for your own models. Furthermore, you can quickly update the preprocessing in the respective procedures: Let’s say you have a nice new idea for a preprocessing step (maybe remove some data? Augment some data?): Add the step itself and you can just run various models and compare the performance only applying ablations of this specific preprocessing step.
- Simple to add new datasets or models: You want to evaluate the performance of the supported models on your own dataset or want to evaluate your novel model on various time series datasets? The implementation is simple and fast for such standard procedures. This is currently mainly true for a very typical deep learning evaluation scenario using cross validation across multiple epochs and will be adapted for models with different training procedures used for e.g. OLS.
- Easy way to publish your results: Apart from offering an easily configurable pipeline, the framework focuses on reproducible research, making it easy for reviewers to reproduce the results.
Publish your results
Let’s say you evaluated your new {dataset, processing method, model} and want to allow easy inspection of all the results from training and anomaly detection for reviewers.
- Implement your novel approach which includes easy configurability via configuration files.
- Create a directory containing your training and anomaly detection configs: While we think that the visualization of third party tools is often preferable, training configs should currently be set to the local tracker to allow reproduction without the need to sign up for such tools. Another option is to use an anonymous account for tracking tools and set the project to be publicly visible.
- Make your model publicly available - either by including it in your code or by publishing the models using your tracking tool - and use this information in your anomaly detection configuration. If implemented correctly, your model will be downloaded automatically in the background, the evaluation will be executed and the reviewer does not require any additional setup.
It is best to highlight your own contributions, e.g. by creating a merge request into the cloned version to highlight the changes you have made. To do so, you could use an anonymous GitHub account to avoid conflicts with double blind reviews. If this is not possible, make sure to describe the changes you have made or explain how reviewers can investigate the changes you have performed. We have prepared a short example video below.
Summary
ECGAN provides an easy-to-use way to develop novel machine learning models in a standardized pipeline suitable for a large variety of models and datasets. We focus on datasets typical in academia up to several million samples but not necessarily suitable for billions of series. The framework is still in an experimental phase but runs out-of-the-box for many common use-cases. Extensive documentation is available and continuously improved - inside as well as outside the code. Various choices relevant for reproduction, such as random seeds, can be set via configuration files and models can be run via CPU or GPU. Parallelized experiments use nondeterministic operations when possible. Minor differences are still possible due to various hardware/software issues and we are happy for any suggestions and improvements. Please get in touch for any requests or questions!
ECGAN is worth considering for several parts of your workflow - it offers pointers and means to easily retrieve and preprocess (ECG) datasets and offers a powerful core to train and evaluate deep learning models in this domain. We plan to significantly expand ECGAN in the future, including models not based on deep learning, a stronger focus on classification, the support of more datasets and preprocessing steps, and utilizing AutoML for fair and public comparisons of methods. Stay tuned!
We would like to express our gratitude towards the Bavarian Research Foundation for supporting this project under grant AZ-1419-20.