Datasets
Create a compatible dataset
In order to use this package, you need to provide the datasets. Only two formats are accepted so far:
HDF5 file (
.h5). These files are used for binary datasets. It should have a"samples"key inside referring to anum_samples x num_dimensionfloat matrix. Optionally, a"labels"key can be set with a vector of numerical labels associated to the samples.FASTA file (
.fasta). These files are used for proteins datasets.
An example of the code used to create a valid HDF5 dataset in python:
import h5py
import numpy as np
num_samples = 100
num_dimensions = 27
data = np.random.randn(num_samples, num_dimension)
# begin optional
num_classes = 3
labels = np.random.randint(0, num_classes, size=(num_samples,))
# end optional
with h5py.File("my_dataset.h5", "w") as f:
f["samples"] = data
# begin optional
f["labels"] = labels
# end optional
Command line arguments
When running a script requiring a dataset, you have several flags:
-dor--dataThe path to the dataset (e.g.-d ./data/dataset.h5).--subset_labelsFor datasets with labels specified, it allows to select only a subset of the dataset matching the specified labels. For example setting-d MNIST --subset_labels 0 1will load the \(0\) and \(1\) digits of the MNIST dataset (what we refer to asMNIST-01dataset). If specified in a dataset without labels, the full dataset is selected.--train_sizeThe proportion of the dataset to use as training dataset. It can go from \(0\) to \(1\) and the default is \(0.6\).--test_sizeSame as above, but for the test dataset. Defaults to \(1-\) train size--use_weightsCompute the weights for protein sequences.--alphabetOne of{protein,rna,dna}. Depends on the type of fasta file you are using.