Guide to Prepare Data for Synergos
This guide will provide instructions on preparing data for federated training in Synergos. A Workers may contribute data for training, evaluation or prediction. The Worker container expects data to be mounted to and located at the /worker/data directory. This guide specifies the format required.
Three types of data are supported: tabular, image and text.
Each dataset must have an accompanying metadata.json
file in its directory, specifying two items: datatype
and operations
{
"datatype": "tabular", # tabular, image or text
"operations": {...}
}
Supported Operation options for Tabular data:
seed: int = 42
boost_iter: int = 100
thread_count: int = None
Supported Operation options for Image data:
use_grayscale: bool = True
use_alpha: bool = False
use_deepaugment: bool = True
Supported Operation options for Text data:
max_df: int = 30000
max_features: int = 1000
strip_accents: str = 'unicode'
keep_html: bool = False
keep_contractions: bool = False
keep_punctuations: bool = False
keep_numbers: bool = False
keep_stopwords: bool = False
spellcheck: bool = True
lemmatize: bool = True
Operations are automatically applied to participating datasets, with the default values shown above. Participants may specify custom values as required.
In addition, each type of data has further requirements:
I. Tabular Data
In current version of Synergos, all tabular data must be stored as a
.csv
file and have aschema.json
file declared alongside it, which contains all the datatype mappings of each feature of the dataset.- Example of a tabular dataset in a directory:
example/ |-- metadata.json |-- tabular_data.csv |-- schema.json ...
- Content of the example schema.json file
{ "age": "int32", "sex": "category", "cp": "category", "trestbps": "int32", "chol": "int32", "fbs": "category", "restecg": "category", "thalach": "int32", "exang": "category", "oldpeak": "float64", "slope": "category", "ca": "category", "thal": "category", "target": "category" }
- Content of the data file in
.csv
format:
- Example of a tabular dataset in a directory:
For classification models, the target variable must be named
target
and has typecategory
. Thetarget
variable must take numeric values to represent each class label. String labels are not supported in current version.For prediction tasks, the target variable must be named
target
and be numeric.It is strongly recommended that users should minimise the presence of null values in the data. Features with large percentage of null values should be dropped. This is because null values will incur greater computation cost when Synergos is conducting alignment of data across multiple parties before federated training starts.
II. Image Data
- Store all images for one task in one folder. In each folder, it has
metadata.json
andmapping.csv
, alongside the image files. - Common image types (eg.
.png
,.gif
,.jpg
etc.) are supported. - The file
mapping.csv
maps each image file in the folder to its label, with two fieldsimage
andtarget
.image
contains the image file names;target
contains the true class/category of the corresponding image file. MNIST dataset as an example:
example/ |-- mapping.csv |-- metadata.json |-- 31369.png |-- 31392.png |-- 31410.png |-- 31512.png |-- 31513.png |-- 31529.png |-- 31537.png |-- 31570.png ...
Corresponding
mapping.csv
for this dataset shall be:image, target 31369.png, 0 31392.png, 0 31410.png, 0 31512.png, 2 31513.png, 2 31529.png, 1 31537.png, 1 31570.png, 1
III. Text Data
- In current version, corpora must be declared as one or more
.csv
files with two fields (i.e.['text', 'target']
) - The target variable must be named
target
and have typecategory
. Example text data:
``` example/ |-- metadata.json |-- text_corpus.csv ... ```
content of the example text_corpus.csv
Guide to Declare Data Tags
Although participants are contributing data, they are not exposing data to other parties. So they are declaring tags
of their data to the project they registered above. When starting a training, evaluation or prediction jobs, the specific data to be used is declared using data tags.
In Synergos, usually all the data a worker container has is at the /worker/data
mount point. Data tag is a Python list, whose elements are all the sub-directory names visited if traversing from /worker/data/
to the sub-directory where the chosen dataset resides (where its corresponding metadata.json
resides).
For example, whem starting a worker container with docker container run
, it is specified with a flag -v /tabular:/worker/data
. And in the corresponding VM file system, the folder /tabular
has the following structure.
/tabular
/2019
/train
metadata.json
tabular_data.csv
schema.json
/evaluate
metadata.json
tabular_data.csv
schema.json
/2020
/train
metadata.json
tabular_data.csv
schema.json
/evaluate
metadata.json
tabular_data.csv
schema.json
/predict
metadata.json
tabular_data.csv
schema.json
Now, for this given worker, to perform training using data in folder /tabular/2019/train
and evaluation using data in folder /tabular/2019/evaluate
, specify the following tags in the Synergos Driver script:
driver.tags.create(
collab_id = "collaboration_1",
project_id="project_1",
participant_id="participant_1",
train=[["2019", "train"]], # data used in training
evaluate=[["2019", "evaluate"]], # data used to evaluate model performance
)
or the following JSON string in the Synergos GUI:
{
"train": [["2019", "train"]],
"evaluate": [["2019", "evaluate"]]
}
To perform training and evaluation using data from both 2019
and 2020
, specify the following tags in the Synergos Driver script:
driver.tags.create(
collab_id = "collaboration_1",
project_id="project_1",
participant_id="participant_1",
train=[["2019", "train"],
["2020", "train"]], # data used in training
evaluate=[["2019", "evaluate"],
["2020", "train"]], # data used to evaluate model performance
)
or the following JSON string in the Synergos GUI:
{
"train": [
["2019", "train"],
["2020", "train"]
],
"evaluate": [
["2019", "evaluate"],
["2020", "train"]
]
}