Running Synergos Basic Locally

This chapter serves as a guide on the steps to run Synergos with the core components (mainly Synergos TTP and Synergos Worker) locally on your computer. The configuration of Synergos with core components is referred to as Synergos Basic.

This is useful for the scenarios where we want to quickly test the federated learning setup before training a model in full federated manner (with hyper-parameter tuning and model tracking).

If you wish to run Synergos Basic in a distributed mode, please refer to Running Synergos Basic in distributed mode. If you wish to run a more complex configuration, e.g. running with multiple federated grids as a cluster with the support of extended components (this configuration is referred to as Synergos Cluster), please refer to Running Synergos Cluster in distributed mode.

In this chapter, a three-party federated grid will be created, with one party acting as TTP and the other two as workers.

Setting up

1.1 Preparing data

On your local machine, prepare two separate datasets (in separate folders) for both workers individually. Preprocess the data and format the directories as instructed here.

1.2 Installing Synergos

Follow the instructions here to install Synergos. Since you are running Synergos Basic configuration, install Synergos Driver python package, Synergos TTP and Synergos Worker on your local computer.

1.3 Setting up the environment

Launch the TTP and worker containers

Launch the workers and the TTP docker containers on your computer, before beginning the configuration process. Volume mounts are defined and the --name parameter is the identifier for each particular running container. Replace the path-to-dataset with the correct one for your local environment.

Run these workers in two separate terminals.

Worker 1

docker container run -v <path-to-dataset>/data1:/worker/data \
    -v <path-to-dataset>/outputs_1:/worker/outputs \
    --name worker_1 synergos_worker:v0.1.0 \
    --logging_variant basic

Worker 2

docker container run -v <path-to-dataset>/data2:/worker/data \
    -v <path-to-dataset>/outputs_2:/worker/outputs \
    --name worker_2 synergos_worker:v0.1.0 \
    --logging_variant basic

TTP

In the command below, volume mounts are provided for both the TTP outputs and data folders, modify the local path according to your setup. The command also links to the two workers launched above using the --link parameter.

docker run -p 0.0.0.0:5000:5000 -p 5678:5678 -p 8020:8020 \
    -v <path-to-dataset>/ttp_data:/orchestrator/data \
    -v <path-to-dataset>/ttp_outputs:/orchestrator/outputs \
    -v <path-to-dataset>/mlflow_test:/orchestrator/mlflow \
    --name ttp --link worker_1 --link worker_2 synergos_ttp:v0.1.0 \
    --logging_variant basic -c

In this example, the local environment's port 5000 is mapping to the TTP container's port 5000. You can select other available local port to map to TTP's port 5000. This port mapping allows the transmission of commands from Synergos Driver to TTP.

The output for both workers and TTP should look like this if it ran successfully:

2020-10-01 08:05:04,176 -  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

Retrieve network information for different parties

When both TTP and workers are running on the same local computer, you must determine which local IP address the workers are listening on.

First, list the running containers.

$docker ps
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS             PORTS                                                                        NAMES
d7694e689102        synergos_ttp:v0.1.0      "python ./main.py --…"   3 minutes ago       Up 3 minutes        0.0.0.0:5000->5000/tcp, 0.0.0.0:5678->5678/tcp, 0.0.0.0:8020->8020/tcp, 0.0.0.0:8080->8080/tcp   ttp
72e98f3721fe        synergos_worker:v0.1.0   "python ./main.py --…"   3 minutes ago       Up 3 minutes        5000/tcp, 8020/tcp                                                                               worker_2
dd83e7a42713        synergos_worker:v0.1.0   "python ./main.py --…"   4 minutes ago       Up 4 minutes        5000/tcp, 8020/tcp                                                                               worker_1

Then use the docker inspect command on each of the worker container IDs to get the network information for that container. In the code example below, dd83e7a42713 is the container ID of worker_1.

$docker inspect dd83e7a42713
[
    {
        "Id": "dd83e7a4271363d872060d9e8e46176047e332ffbde833d06e45c464f155f519",
        "Created": "2020-09-29T06:26:35.40834721Z",
        "Path": "python",
        "Args": [
            "./main.py",
            "--help"
        ],
        "State": {

Near the end of the information displayed, look for the Networks block. Within that block, locate the IPAddress parameter, where its value address will be used in the configuration script. Retrieve this address for both workers.

            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "fb78aaea34b613b08c556663b589f1974b5941d26bb557e5387c030f92c210f4",
                    "EndpointID": "a1b5dc0381e2e924eee73170bc89bb104ecc66ad25353e8f8205d86d30607e52",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02",
                    "DriverOpts": null
                }

In this example, the containers corresponding to different parties have the following addresses:

worker_1 (for Worker 1) address: 172.17.0.2
worker_2 (for Worker 2) address: 172.17.0.3
ttp (for TTP) address: 172.17.0.4

Preparing the Synergos script

As mentioned here, there are three phases in users' interaction with Synergos. During the three phases, necessary meta-data is supplied by the users, including both the orchestrator and participants. In this guide, all the meta-data supplied by different parties, for different phases are included in a single script. However, in a real setup, different parties can run different scripts to supply their respective information.

As you are running Synergos Basic, after all the meta-data has been collected, the script(s) leverages Synergos Driver package to send information and instructions to the TTP, which then coordinates with other workers to complete the federated learning process.

Phase 1: Registration

In this phase. orchestrator and participants would provide different information. Orchestrator defines Collaboration, Project, Experiment, and Run). Participants register to the orchestrator, their intention to join the collaboration and project, hence submitting information about the compute and data they are using.

Create a Python script, named script.py. Let's start by importing Synergos Driver package.

from synergos import Driver

host = "0.0.0.0"
port = 5000

driver = Driver(host=host, port=port)

The port variable shown above should be the port on your local environment which you have mapped to the TTP container's port 5000.

1A. Orchestator creates a collaboration

collab_task = driver.collaborations
collab_task.create("collaboration_1") #collaboration_1 is the ID of the collaboration

1B. Orchestator creates a project

There are two kinds of action currently supported - "classify" if you are building a classification model or "regress" for regression model.

driver.projects.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    action="classify",
    incentives={
        'tier_1': [],
        'tier_2': [],
        'tier_3': []
    }
)

incentives is related to the Contribution & Reward block, which is still under development. It will be ignored in current version of Synergos.

1C. Orchestator creates an experiment

In this section, you, as the orchestrator, create an experiment, and define the model architecture to be built. The model is defined as a list of dictionaries - each dictionary represents a layer; the order of elements in the list correspond to the order of model layers. Examples are shown below for a simple model with tabular data and a CNN model with image data.

A simple model with tabular data

driver.experiments.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    expt_id="experiment_1",
    model=[
        {
            "activation": "sigmoid",
            "is_input": True,
            "l_type": "Linear",
            "structure": {
                "bias": True,
                "in_features": 18,
                "out_features": 1
            }
        }
    ]
)

A CNN model with image data

driver.experiments.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    expt_id="experiment_1",
    model = [
        {
            "activation": "relu",
            "is_input": True,
            "l_type": "Conv2d",
            "structure": {
                "in_channels": 1,
                "out_channels": 4,
                "kernel_size": 3,
                "stride": 1,
                "padding": 1
            }
        },
        {
            "activation": None,
            "is_input": False,
            "l_type": "Flatten",
            "structure": {}
        },
        {
            "activation": "sigmoid",
            "is_input": False,
            "l_type": "Linear",
            "structure": {
                "bias": True,
                "in_features": 4 * 32 * 32,
                "out_features": 1
            }
        }
    ]
)

When defining each layer in the model argument, the parameters are:

activation - Any activation function found in the PyTorch's torch.nn.functional module
is_input - Indicates if the current layer is an input layer. If a layer is an input layer, it is considered to be "wobbly" layer, meaning that the in-features may be modified automatically to accommodate changes in input structure post-alignment.
l_type - Type of layer to be used, which can be found in PyTorch's torch.nn module.
structure - Any input parameters accepted in the layer class specified in l_type

Note: If you want to use FedGKT as the federated aggregation algorithm, the model has to be defined in a different manner due to implementation nuances. In this case, each activation layer in the neural network has to be defined as its own element, in the model parameter list. The type of activation is defined in the l_type parameter. Suitable activations to use can be found in PyTorch's torch.nn module. For example, for a layer originally defined in model such as:

model=[
  {
      "activation": "sigmoid",
      "is_input": True,
      "l_type": "Linear",
      "structure": {
          "bias": True,
          "in_features": 18,
          "out_features": 1
      }
  }
]

will need to be in the following form, for FedGKT.

# model definition for FedGKT
model=[
  {
      "activation": None,
      "is_input": True,
      "l_type": "Linear",
      "structure": {
          "bias": True,
          "in_features": 18,
          "out_features": 1
      }
  },
  {
      "activation": None,
      "is_input": False,
      "l_type": "Sigmoid",
      "structure": {}
  }
]

1D. Orchestrator creates a run

A run is specific to a set of hyper-parameter values used.

driver.runs.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    expt_id="experiment_1",
    run_id="run_1",
    rounds=2,
    epochs=1,
    base_lr=0.0005,
    max_lr=0.005,
    criterion="L1Loss",
    optimizer="SGD",
    lr_scheduler="CyclicLR"
)

The arguments to this method first requires the run identification parameters. Next it requires the following arguments:

round: number of federated aggregated rounds
epoch: number of epochs for each worker, during each round
criterion: loss function
lr_scheduler: torch scheduler module (optional)
optimizer: torch optim module (optional)

Further keyword arguments are required to pass model and loss hyper-parameters values to use. These keywords must match the argument attribute name in PyTorch for each module used. From the code block above, base_lr and max_lr are the arguments for the torch "CyclicLR" scheduler module used.

Most of the criterions found in PyTorch's torch.nn Loss Function are supported, except MarginRankingLoss, CosineEmbeddingLoss, TripletMarginLoss and CTCLoss.

1E. Participants register to the collaboration created above

It is now the participants' turns to supply information.

First, let's create individual participants.

driver.participants.create(
    participant_id="participant_1",
)

driver.participants.create(
    participant_id="participant_2",
)

1F. Participants declare the compute resource and data they are using

After this, each participant provides information about its compute resource. When adding a compute resource, provide information about host, port, and fport. The host parameter is set to be the IPAddress retrieved from the container network information above. port is the port that is used to communicate with other parties during the federated learning process to exchange intermediate training results, while fport is the port that is used to receive commands from Synergos Driver, e.g. to dismantle the grid after training completes. The compute resource one participant declares will be registered by calling create(). When calling create(), provide the corresponding collab_id, project_id, and participant_id.

registration_task = driver.registrations

registration_task.add_node(
    host='172.17.0.2',
    port=8020,
    f_port=5000,
    log_msgs=True,
    verbose=True
)
registration_task.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_1",
    role="guest"
)

registration_task.add_node(
    host='172.17.0.3',
    port=8020,
    f_port=5000,
    log_msgs=True,
    verbose=True
)
registration_task.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_2",
    role="host"
)

Although participants are contributing data, they are not exposing data to other parties. They declare the tags of their data, for the project which they registered above. For more information on how to define the data tag, refer to this.

 driver.tags.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_1",
    train=[["train"]], # data used in training
    evaluate=[["evaluate"]], # data used to evaluate model performance
    predict = [["predict"]] # data used for prediction/inference
)

driver.tags.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_2",
    train=[["train"]],
    evaluate=[["evaluate"]],
    predict=[["predict"]]
)

Phase 2: Training

Once all the meta-data is collected in the previous phase, you can move to start the federated training.

2A. Perform feature alignment to dynamically configure multiple datasets and models for cross-party compatibility

In machine learning, one-hot encoding is usually applied on categorical variables. In Federated Learning, since different parties do not expose data to one another, one-hot encoding is done locally without the knowledge of other parties' data. Due to the issue of non-IID data, it is possible that the features from different parties will not align, after one-hot encoding.

To illustrate this point, we use the example where a few hospitals are collaboratively trying to train a federated model, to predict mortality in ICUs. One of the predictor features used in the model is a patient's ethnicity. Assuming that there are 5 ethnicities recorded in total, for all the patients served across these hospitals. One of these hospitals, however, does not have patients from a particular ethnicity due to its geographic location. If each hospital applies one-hot encoding locally before federated training starts, all the hospitals will have 5 ethnicity-related features, except the one which has only 4 ethnicity-related features. This causes error in federated learning.

Therefore feature alignment is important. Synergos aligns the dataset, inputs and outputs to get the proper symmetry across all participants before federated training starts.

driver.alignments.create(
    collab_id = "collaboration_1",
    project_id="project_1"
)

2B. Start training

Once feature alignment has been completed, training can be started.

model_resp = driver.models.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    expt_id="experiment_1",
    run_id="run_1"
)

Phase 3: Evaluation

In Synergos, the orchestrator is not allowed see different parties' data, even though it is the one who coordinates the federated training. Only the participants are allowed to interact with the prediction results, and compare it with its own ground truth to derive the performance evaluation.

3A. Performance evaluation

The participants conduct performance evaluation of the federated model with the evaluate data they declared previously in Step 1F. The local evaluation by individual participants are then sent to the orchestrator to aggregate and derive the final evaluation.

driver.validations.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    expt_id="experiment_1",
    run_id="run_1"
)

3B. Perform prediction

The participants are also allowed to submit new data that was not declared for evaluation or training previously in Step 1F. The trained global model is used to obtain inferences on this new data. Do note that the predict tag declared here will override the one in Step 1F, if any.

driver.predictions.create(
    collab_id = "collaboration_1",
    tags={"project_1": [["predict"]]},
    participant_id="participant_1",
    project_id="project_1",
    expt_id="experiment_1",
    run_id="run_1"
)

Running the script

Now that the script has all the information and instructions needed to run the complete federated learning process, you can proceed to run the script. The script will be sent to the TTP, which then dispatches the instructions to all of the workers.

#Activate <synergos_env> virtual environment
conda activate <synergos_env>

#Navigate into the repository
cd ./synergos

#Run script.py
python script.py

The output for both workers and TTP should look like this if it is run successfully:

Worker

2020-10-01 09:31:10,238 - 172.17.0.4 - - [01/Oct/2020 09:31:10] "POST /worker/terminate/collaboration_1/project_1/experiment_1/run_1 HTTP/1.1" 200 -

TTP

2020-10-01 09:31:10,268 - 172.17.0.1 - - [01/Oct/2020 09:31:10] "POST /ttp/evaluate/participants/participant_1/predictions/collaboration_1/project_1/experiment_1/run_1 HTTP/1.1" 200 -

Running Synergos Basic