Guide to Prepare Data for Synergos

This guide will provide instructions on preparing data for federated training in Synergos. A Workers may contribute data for training, evaluation or prediction. The Worker container expects data to be mounted to and located at the /worker/data directory. This guide specifies the format required.

Three types of data are supported: tabular, image and text. Each dataset must have an accompanying metadata.json file in its directory, specifying two items: datatype and operations

    {

        "datatype": "tabular", # tabular, image or text
        "operations": {...}
    }

    Supported Operation options for Tabular data:
        seed: int = 42
        boost_iter: int = 100
        thread_count: int = None

    Supported Operation options for Image data:
        use_grayscale: bool = True
        use_alpha: bool = False
        use_deepaugment: bool = True

    Supported Operation options for Text data:
        max_df: int = 30000
        max_features: int = 1000
        strip_accents: str = 'unicode'
        keep_html: bool = False
        keep_contractions: bool = False
        keep_punctuations: bool = False
        keep_numbers: bool = False
        keep_stopwords: bool = False
        spellcheck: bool = True
        lemmatize: bool = True

Operations are automatically applied to participating datasets, with the default values shown above. Participants may specify custom values as required.

In addition, each type of data has further requirements:

I. Tabular Data

In current version of Synergos, all tabular data must be stored as a .csv file and have a schema.json file declared alongside it, which contains all the datatype mappings of each feature of the dataset.

Example of a tabular dataset in a directory:

  example/
  |-- metadata.json
  |-- tabular_data.csv
  |-- schema.json
  ...

Content of the example schema.json file

  {
      "age": "int32",
      "sex": "category",
      "cp": "category",
      "trestbps": "int32",
      "chol": "int32",
      "fbs": "category",
      "restecg": "category",
      "thalach": "int32",
      "exang": "category",
      "oldpeak": "float64",
      "slope": "category",
      "ca": "category",
      "thal": "category",
      "target": "category"
  }

Content of the data file in .csv format:

For classification models, the target variable must be named target and has type category. The target variable must take numeric values to represent each class label. String labels are not supported in current version.
For prediction tasks, the target variable must be named target and be numeric.
It is strongly recommended that users should minimise the presence of null values in the data. Features with large percentage of null values should be dropped. This is because null values will incur greater computation cost when Synergos is conducting alignment of data across multiple parties before federated training starts.

II. Image Data

Store all images for one task in one folder. In each folder, it has metadata.json and mapping.csv, alongside the image files.
Common image types (eg. .png, .gif, .jpg etc.) are supported.
The file mapping.csv maps each image file in the folder to its label, with two fields image and target. image contains the image file names; target contains the true class/category of the corresponding image file.

MNIST dataset as an example:

  example/
  |-- mapping.csv
  |-- metadata.json
  |-- 31369.png
  |-- 31392.png
  |-- 31410.png
  |-- 31512.png
  |-- 31513.png
  |-- 31529.png
  |-- 31537.png
  |-- 31570.png
  ...

Sample MNIST images

Corresponding mapping.csv for this dataset shall be:

  image, target
  31369.png, 0
  31392.png, 0
  31410.png, 0
  31512.png, 2
  31513.png, 2
  31529.png, 1
  31537.png, 1
  31570.png, 1

III. Text Data

In current version, corpora must be declared as one or more .csv files with two fields (i.e. ['text', 'target'])
The target variable must be named target and have type category.

Example text data:

  ```
  example/
  |-- metadata.json
  |-- text_corpus.csv
  ...
  ```

content of the example text_corpus.csv

Guide to Declare Data Tags

Although participants are contributing data, they are not exposing data to other parties. So they are declaring tags of their data to the project they registered above. When starting a training, evaluation or prediction jobs, the specific data to be used is declared using data tags.

In Synergos, usually all the data a worker container has is at the /worker/data mount point. Data tag is a Python list, whose elements are all the sub-directory names visited if traversing from /worker/data/ to the sub-directory where the chosen dataset resides (where its corresponding metadata.json resides).

For example, whem starting a worker container with docker container run, it is specified with a flag -v /tabular:/worker/data. And in the corresponding VM file system, the folder /tabular has the following structure.

/tabular
    /2019
        /train
            metadata.json
            tabular_data.csv
            schema.json
        /evaluate
            metadata.json
            tabular_data.csv
            schema.json
    /2020
        /train
            metadata.json
            tabular_data.csv
            schema.json
        /evaluate
            metadata.json
            tabular_data.csv
            schema.json
        /predict
            metadata.json
            tabular_data.csv
            schema.json

Now, for this given worker, to perform training using data in folder /tabular/2019/train and evaluation using data in folder /tabular/2019/evaluate, specify the following tags in the Synergos Driver script:

 driver.tags.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_1",
    train=[["2019", "train"]], # data used in training
    evaluate=[["2019", "evaluate"]], # data used to evaluate model performance
)

or the following JSON string in the Synergos GUI:

{
    "train": [["2019", "train"]],
    "evaluate": [["2019", "evaluate"]]
}

To perform training and evaluation using data from both 2019 and 2020, specify the following tags in the Synergos Driver script:

 driver.tags.create(
    collab_id = "collaboration_1",
    project_id="project_1",
    participant_id="participant_1",
    train=[["2019", "train"],
            ["2020", "train"]], # data used in training
    evaluate=[["2019", "evaluate"],
            ["2020", "train"]], # data used to evaluate model performance
)

or the following JSON string in the Synergos GUI:

{
    "train": [
        ["2019", "train"],
        ["2020", "train"]
    ],
    "evaluate": [
        ["2019", "evaluate"],
        ["2020", "train"]
    ]
}

Guide to prepare data