Bi-directional DB-converter
This GitHub repository contains the tool described in the paper titled Self-Supervised Generative AI Enables Conversion of Two Non-Overlapping Cohorts by Das, S. et al. The tool implements a self-supervised deep learning architecture leveraging category theory to convert data between two cohorts with distinct data structures.
Functionality
The system takes inputs of source (DB1) and target (DB2) data in train, dev, and test splits, and outputs a trained model containing weights and biases for both:
Forward DB-converter (
Backward DB-converter (
After training, the model processes test sets to generate:
Converted-DB1
Reconverted-DB1
Converted-DB2
Reconverted-DB2
How to use DB-converter
We deploy this app in 3 possible ways. 1. GitHub (one needs to set up one's own environment) ` Google Colab (one doesn't need to set up the environment) 3. Docker image (one doesn't need to set up the environment).
Clone Repo
git clone https://github.com/Mycheaux/DB-conv.git
Solution 1: Directly run this Github repo
Environment Set up:
We currently provide both CPU and GPU (NVIDIA) support. The app is tested on a Mac M1 CPU environment and a Linux GPU cluster.
If you are on Windows and find errors due to the path due to / vs ', you may try the following fix: You can create a batch script that automatically translates paths with / into \ before passing them to tools that require backslashes. For example:
@echo off
set "input_path=%1"
set "converted_path=%input_path:/=\%"
echo %converted_path%
If you prefer not to deal with Windows path quirks, you can run your repository in a Unix-like environment such as Windows Subsystem for Linux (WSL),Git Bash.
Python Libraries:
We assume we already have Anaconda or miniconda; if not, check here how to get one. https://www.anaconda.com/download or https://www.anaconda.com/docs/getting-started/miniconda/main
- Best, most general way:
conda create --name db-conv python=3.12 -y
conda activate db-conv
conda install pip -y
conda install numpy scipy pandas -y
conda install -c conda-forge pyyaml
Now, install either cpu or the GPU version of PyTorch
CPU version:
conda install pytorch=2.5.1 torchvision torchaudio cpuonly -c pytorch # CPU-only
GPU version:
pip3 install torch==2.6.0 # GPU-only #conda is removed
Now, install the lightning API and weights and biases
conda install lightning -c conda-forge
conda install wandb -c conda-forge
This method works in general (tested on Mac M1 2021 and in a Linux server). If it doesn't work, try the following options to make sure you have the exact versions of PyTorch, depending on the availability of CUDA devices.
- In general, for any CPU environment, you should first create a fresh conda environment using
conda create --name db-conv python=3.12 -y
conda activate db-conv
conda install pip -y
and install all packages
pip install -r cpu_requirements.txt
This method breaks when pip can't find the right version, try the conda alternative as suggested above.
- In general, for any devices with an available CUDA-supported NVIDIA GPU, you should first create a fresh conda environment using
conda create --name db-conv python=3.12 -y
conda activate db-conv
conda install pip -y
pip install -r gpu_requirements.txt
This method breaks when pip can't find the right version, try the conda alternative as suggested above.
-
If you are using a Mac M1 2021, try this if you encounter dependency problems in the general way.
conda env create -f m1cpu_environment.yml
Then from the activated environment
pip freeze > m1cpu_requirements.txt # From within activated environment
This method breaks when pip can't find the right version, try the conda alternative as suggested above.
Weights and Biases API:
You require a weights and bias account to monitor your Model training. Make a free account here: https://wandb.ai/site, and find your API key in your settings after logging in. When the app starts, it will ask for your API key to log in, and the App with create loss function plots during training.
Configs:
-
Data path and names of DB1's train, dev, test (referred as x_train, x_val, x_test) respectively, and DB2's train, dev, test (referred as y_train, y_val, y_test), are to be given in
data_path.yml. Alterntively, you can also replace/create the input files in 'data/preprocessed' folder with following namesza_train.npy,za_val.npy,za_test.npyfor x_train, x_val, x_test orzb_train.npy,zb_val.npy,zb_test.npyfor y_train, y_val, y_test. This only expects data in.npyformat. -
config.yamlhas most commonly tuned hyperparameters and options for the output folder's name and experiment name details. For testing the app, the total number of epochsnum_epochsis set to a low number, but for real training, the value was$5000$ and the frequency of saving the model is also set to a low usually it is kept at$100$ . -
architecture.yamlhas parameters on architecture input and output size details. Make sure your data, your architecture of$m$ (referred to as mapper in the code),$i$ (referred to as inverter in the code), given inscr/model.py, and details in this config match. -
advanced_config.yamlhas some more hyperparameters which are rarely changed.
Data:
The code expects all data in either .npy or .csv format. If you don't have it in this format, here is how you can convert your data into .npy format.
- Read your data into pandas dataframe. e.g.
df = pd.read_excel('your_file.xlsx',skiprows=1) - Convert it to
df_numpy = df.to_numpy(). - Save it using
np.save('your_file_path/your_file.npy', df_numpy)
*Note: For a .csv file, we assume the first row is column names, and that's why it's skipped. If you need it, either add a false first row or remove the tag 'skiprows=1' from the load_data function in src/data_loader.py and test/test_data_loader.py.
The sample size of x_train, x_val, x_test should match that of y_train, y_val, y_test respectively. If not, you can preprocess them by subsampling the bigger database to the size of the smaller one. e.g.
smaller_size = min(x.shape[0], y.shape[0]) # x = whole DB1, y = whole DB2
x_subsampled = x_train[:smaller_size]
y_subsampled = y_train[:smaller_size]
The feature size of x_train, x_val, and x_test should be the same with each other (all coming from the same database DB1). The feature size of y_train, y_val, and y_test should be the same as each other (all coming from the same database DB2). Here is an example to split the data into train, dev, and test sets if your data is not already split.
import numpy as np
# Assuming x and y are NumPy arrays
x = np.load('path_to_x.npy') # Replace with your actual file path
y = np.load('path_to_y.npy') # Replace with your actual file path
# Ensure x and y have the same sample size
assert x.shape[0] == y.shape[0], "x and y must have the same number of samples!"
# Shuffle the data (optional, but recommended)
indices = np.arange(x.shape[0])
np.random.shuffle(indices)
x = x[indices]
y = y[indices]
# Split the data into 90% train, 10% dev, 10% test
train_size = int(0.9 * len(x))
dev_size = int(0.1 * len(x))
x_train, x_dev, x_test = x[:train_size], x[train_size:train_size+dev_size], x[train_size+dev_size:]
y_train, y_dev, y_test = y[:train_size], y[train_size:train_size+dev_size], y[train_size+dev_size:]
# Save the datasets
np.save('x_train.npy', x_train)
np.save('x_dev.npy', x_dev)
np.save('x_test.npy', x_test)
np.save('y_train.npy', y_train)
np.save('y_dev.npy', y_dev)
np.save('y_test.npy', y_test)
print("Data successfully split into train, dev, and test sets!")