Dataset

DataAttribute

class matorage.DataAttribute(name, type, shape, itemsize=0)[source]

DataAttribute classes.

Parameters
  • name (string, require) – data attribute name .

  • type (string, require) – data attribute type. select in string, bool, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64

  • shape (tuple, require) – data attribute shape. For example, if you specify a shape with (2, 2), you can store an array of (B, 2, 2) shapes.

  • itemsize (integer, optional, defaults to 0) – itemsize(bytes) for string type attribute. Must be set for string type attribute.

Examples:

>>> from matorage import DataAttribute
>>> attribute = DataAttribute('array', 'uint8', (2, 2))
>>> attribute.name
'array'
>>> attribute.shape
(2, 2)
>>> attribute.type
UInt8Atom(shape=(), dflt=0)
to_dict()[source]

Serializes this instance to a Python dictionary.

Returns

Dictionary of all the attributes that make up this configuration instance

Return type

Dict[string, any]

Examples:

>>> from matorage import DataAttribute
>>> attribute = DataAttribute('array', 'uint8', (2, 2))
>>> attribute.to_dict()
{'name': 'array', 'type': 'uint8', 'shape': (2, 2)}

DataConfig

class matorage.DataConfig(**kwargs)[source]

Dataset configuration classes. This class overrides StorageConfig.

Parameters
  • endpoint (string, require) – S3 object storage endpoint. or If use NAS setting, NAS folder path.

  • access_key (string, optional, defaults to None) – Access key for the object storage endpoint. (Optional if you need anonymous access).

  • secret_key (string, optional, defaults to None) – Secret key for the object storage endpoint. (Optional if you need anonymous access).

  • secure (boolean, optional, defaults to False) – Set this value to True to enable secure (HTTPS) access. (Optional defaults to False unlike the original MinIO).

  • max_object_size (integer, optional, defaults to 10MB) – One object file is divided into max_object_size and stored.

  • dataset_name (string, require) – dataset name.

  • attributes (list, require) – DataAttribute type of list for data attributes

  • additional (dict, optional, defaults to {}) – Parameters for additional description of datasets. The key and value of the dictionay can be specified very freely.

  • compressor (dict, optional, defaults to {"complevel" : 0, "complib" : "zlib"}) –

    Data compressor option. It consists of a dict type that has complevel and complib as keys. For further reference, read pytable’s Filter.

    • complevel (integer, defaults to 0) : compressor level(0~9). The larger the number, the more compressed it is.

    • complib (string, defaults to ‘zlib’) : compressor library. choose in zlib, lzo, bzip2, blosc

  • max_object_size – One object file is divided into max_object_size and stored.

Examples:

from matorage import DataConfig, DataAttribute
data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

data_config.to_json_file('data_config.json')
data_config2 = DataConfig.from_json_file('data_config.json')

If you have NAS(network access storage) settings, You can save/load faster by using the endpoint as a NAS folder path.

Examples:

from matorage import DataConfig

# NAS example
data_config = DataConfig(
    endpoint='~/shared',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)
to_dict()[source]

Serializes this instance to a Python dictionary.

Returns

Dictionary of all the attributes that make up this configuration instance,

Return type

Dict[str, any]

classmethod from_json_file(json_file)[source]

Constructs a Config from the path to a json file of parameters.

Parameters

json_file (string) – Path to the JSON file containing the parameters.

Returns

An instance of a configuration object

Return type

DataConfig

property get_length

Get length of dataset in DataConfig

Returns

length of dataset

Return type

integer

DataSaver

class matorage.DataSaver(config, multipart_upload_size=5242880, num_worker_threads=4, inmemory=False, refresh=False)[source]

This class must be created independently for the process. The independent process uses multiple threads to upload to storage and generates unique metadata information when upload is complete. Update the file, push the upload queue if it exceeds a certain size, close the file, and create a new file. After saving, you should disconnect the data saver.

To make This procedure easier to understand, the following is written in the pseudo-code.

per_one_batch_data_size = array_size // num_batch
per_one_file_batch_size = max_object_size // per_one_batch_data_size
for batch_idx in range(num_batch):
    if get_current_stored_batch_size() < per_one_file_batch_size:
        file.append(data[batch_idx])
    else:
        file_closing()
        new_file is opened
        new_file.append(data[batch_idx])
All files are closed.

Note

  • Deep Learning Framework Type : All(pure python is also possible)

  • All processes should call the constructors of this class independently.

  • After data save is over, you must disconnect through the disconnect function.

Parameters
  • config (matorage.DataConfig, require) – A DataConfig instance object

  • multipart_upload_size (integer, optional, defaults to 5 * 1024 * 1024) – size of the incompletely uploaded object. You can sync files faster with multipart upload in MinIO. This is because MinIO clients use multi-threading, which improves IO speed more efficiently regardless of Python’s Global Interpreter Lock(GIL).

  • num_worker_threads (integer, optional, defaults to 4) – number of backend storage worker to upload or download.

  • inmemory (boolean, optional, defaults to False) – If you use this value as True, then you can use HDF5_CORE driver so the temporary file for uploading or downloading to backend storage, such as MinIO, is not stored on disk but is in the memory. Keep in mind that using memory is fast because it doesn’t use disk IO, but it’s not always good. If default option(False), then HDF5_SEC2 driver will be used on posix OS(or HDF5_WINDOWS in Windows).

  • refresh (boolean, optional, defaults to False) – All existing data is erased and overwritten.

Single Process example

Examples:

import numpy as np
from tqdm import tqdm
from matorage import DataConfig, DataSaver

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

data_saver = DataSaver(config=data_config)
row = 100
data = np.random.rand(64, 3, 224, 224)

for _ in tqdm(range(row)):
    data_saver({
        'array' : data
    })

data_saver.disconnect()
__call__(datas, filetype=False)[source]
Parameters
  • datas (Dict[str, numpy.ndarray] or Dict[str, str], require) – if filetype is false, datas is Dict[str, numpy.ndarray] type, `value` is `numpy.ndarray` type with (B, *) shape, B means batch size. else true, datas is Dict[str, str] type, `value` is file path of `str` type.

  • filetype (boolean, optional) – Indicates whether the type of data to be added to this bucket is a simple file type.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'image' : np.random.rand(16, 28, 28),
    'target' : np.random.rand(16)
})

When used as shown below, filetype data is saved with a key called <bucket_name>/raw_image.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'raw_image' : 'test.jpg'
})
print(data_config.get_filetype_list)
property get_downloaded_dataset

get local paths of downloaded dataset in local storage

Returns

local path of downloaded datasets

Return type

list

disconnect()[source]

disconnecting datasaver. close all opened files and upload to backend storage. Must be called after datasaver function to store data safely.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'image' : np.random.rand(16, 28, 28),
    'target' : np.random.rand(16)
})
data_saver.disconnect()

torch.Dataset

class matorage.torch.Dataset(config, **kwargs)[source]

Dataset class for Pytorch Dataset

This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.

  1. The _object_file_mapper manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in _object_file_maper.

  2. We read _object_file_mapper and download only new objects that are not there.

  3. __getitem__ brings numpy data in local data from data index.

Parameters
  • config (matorage.DataConfig, require) – dataset configuration

  • num_worker_threads (int, optional, defaults to 4) – Number of backend storage worker to upload or download.

  • clear (boolean, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.

  • cache_folder_path (str, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.

  • index (boolean, optional, defaults to False) – Setting for index mode.

Examples:

from matorage import DataConfig
from matorage.torch import Dataset
from torch.utils.data import DataLoader

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

dataset = Dataset(config=data_config, clear=True)

# iterative mode
for array in DataLoader(dataset):
    print(array)

# index mode
print(dataset[0])

tensorflow.Dataset

class matorage.tensorflow.Dataset(config, batch_size=1, **kwargs)[source]

Dataset class for Tensorflow Dataset

This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.

  1. The _object_file_mapper manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in _object_file_maper.

  2. We read _object_file_mapper and download only new objects that are not there.

  3. if Tensorflow v2(2.2.0>=), we use tfio.IODataset.from_hdf5 and parallel interleave more fast

Parameters
  • config (matorage.DataConfig, require) – dataset configuration

  • num_worker_threads (int, optional, defaults to 4) – Number of backend storage worker to upload or download.

  • clear (boolean, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.

  • cache_folder_path (str, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.

  • index (boolean, optional, defaults to False) – Setting for index mode.

  • batch_size (integer, optional, defaults to 1) – how many samples per batch to load.

  • shuffle (boolean, optional, defaults to False) – set to True to have the data reshuffled at every epoch.

  • seed (integer, optional, defaults to 0) – random seed used to shuffle the sampler if shuffle=True.

Examples:

from matorage import DataConfig
from matorage.tensorflow import Dataset

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

dataset = Dataset(config=data_config, clear=True)

# iterative mode
for array in dataset.dataloader:
    print(array)

# index mode
print(dataset[0])
property filenames

Get filenames(file absolute path) in local storage

Returns

filenames(file absolute path) in local storage

Return type

list

property dataloader

Get iterative dataloader

Returns

iterative tf.data.dataset

Return type

InterleaveDataset