# Datasets ## Create a compatible dataset In order to use this package, you need to provide the datasets. Only two formats are accepted so far: - HDF5 file (`.h5`). These files are used for binary datasets. It should have a `"samples"` key inside referring to a `num_samples x num_dimension` float matrix. Optionally, a `"labels"` key can be set with a vector of numerical labels associated to the samples. - FASTA file (`.fasta`). These files are used for proteins datasets. An example of the code used to create a valid HDF5 dataset in python: ```python import h5py import numpy as np num_samples = 100 num_dimensions = 27 data = np.random.randn(num_dimensions, num_dimension) # begin optional num_classes = 3 labels = np.random.randint(0, num_classes, size=(num_samples,)) # end optional with h5py.File("my_dataset.h5", "w") as f: f["samples"] = data # begin optional f["labels"] = labels # end optional ``` ## Command line arguments When running a script requiring a dataset, you have several flags: - `-d` or `--data` The path to the dataset (e.g.`-d ./data/dataset.h5`). - `--subset_labels` For datasets with labels specified, it allows to select only a subset of the dataset matching the specified labels. For example setting `-d MNIST --subset_labels 0 1` will load the $0$ and $1$ digits of the MNIST dataset (what we refer to as `MNIST-01` dataset). If specified in a dataset without labels, the full dataset is selected. - `--train_size` The proportion of the dataset to use as training dataset. It can go from $0$ to $1$ and the default is $0.6$. - `--test_size` Same as above, but for the test dataset. Defaults to $1-$ train size - `--use_weights` Compute the weights for protein sequences. - `--alphabet` One of `{protein,rna,dna}`. Depends on the type of fasta file you are using.