Datasets

Create a compatible dataset

In order to use this package, you need to provide the datasets. Only two formats are accepted so far:

  • HDF5 file (.h5). These files are used for binary datasets. It should have a "samples" key inside referring to a num_samples x num_dimension float matrix. Optionally, a "labels" key can be set with a vector of numerical labels associated to the samples.

  • FASTA file (.fasta). These files are used for proteins datasets.

An example of the code used to create a valid HDF5 dataset in python:

import h5py
import numpy as np

num_samples = 100
num_dimensions = 27

data = np.random.randn(num_dimensions, num_dimension)

# begin optional
num_classes = 3
labels = np.random.randint(0, num_classes, size=(num_samples,))
# end optional

with h5py.File("my_dataset.h5", "w") as f:
   f["samples"] = data

   # begin optional
   f["labels"] = labels
   # end optional

Command line arguments

When running a script requiring a dataset, you have several flags:

  • -d or --data The path to the dataset (e.g.-d ./data/dataset.h5).

  • --subset_labels For datasets with labels specified, it allows to select only a subset of the dataset matching the specified labels. For example setting -d MNIST --subset_labels 0 1 will load the \(0\) and \(1\) digits of the MNIST dataset (what we refer to as MNIST-01 dataset). If specified in a dataset without labels, the full dataset is selected.

  • --train_size The proportion of the dataset to use as training dataset. It can go from \(0\) to \(1\) and the default is \(0.6\).

  • --test_size Same as above, but for the test dataset. Defaults to \(1-\) train size

  • --use_weights Compute the weights for protein sequences.

  • --alphabet One of {protein,rna,dna}. Depends on the type of fasta file you are using.