Dataset¶

DataAttribute¶

class matorage.DataAttribute(name, type, shape, itemsize=0)[source]¶

DataAttribute classes.

Parameters

name (string, require) – data attribute name .
type (string, require) – data attribute type. select in string, bool, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
shape (tuple, require) – data attribute shape. For example, if you specify a shape with (2, 2), you can store an array of (B, 2, 2) shapes.
itemsize (integer, optional, defaults to 0) – itemsize(bytes) for string type attribute. Must be set for string type attribute.

Examples:

>>> from matorage import DataAttribute
>>> attribute = DataAttribute('array', 'uint8', (2, 2))
>>> attribute.name
'array'
>>> attribute.shape
(2, 2)
>>> attribute.type
UInt8Atom(shape=(), dflt=0)

to_dict()[source]¶

Serializes this instance to a Python dictionary.

Returns: Dictionary of all the attributes that make up this configuration instance
Return type: Dict[string, any]

Examples:

>>> from matorage import DataAttribute
>>> attribute = DataAttribute('array', 'uint8', (2, 2))
>>> attribute.to_dict()
{'name': 'array', 'type': 'uint8', 'shape': (2, 2)}

DataConfig¶

class matorage.DataConfig(**kwargs)[source]¶

Dataset configuration classes. This class overrides StorageConfig.

Parameters

endpoint (string, require) – S3 object storage endpoint. or If use NAS setting, NAS folder path.
access_key (string, optional, defaults to None) – Access key for the object storage endpoint. (Optional if you need anonymous access).
secret_key (string, optional, defaults to None) – Secret key for the object storage endpoint. (Optional if you need anonymous access).
secure (boolean, optional, defaults to False) – Set this value to True to enable secure (HTTPS) access. (Optional defaults to False unlike the original MinIO).
max_object_size (integer, optional, defaults to 10MB) – One object file is divided into max_object_size and stored.
dataset_name (string, require) – dataset name.
attributes (list, require) – DataAttribute type of list for data attributes
additional (dict, optional, defaults to {}) – Parameters for additional description of datasets. The key and value of the dictionay can be specified very freely.
compressor (dict, optional, defaults to {"complevel" : 0, "complib" : "zlib"}) –
Data compressor option. It consists of a dict type that has complevel and complib as keys. For further reference, read pytable’s Filter.
- complevel (integer, defaults to 0) : compressor level(0~9). The larger the number, the more compressed it is.
- complib (string, defaults to ‘zlib’) : compressor library. choose in zlib, lzo, bzip2, blosc
max_object_size – One object file is divided into max_object_size and stored.

Examples:

from matorage import DataConfig, DataAttribute
data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

data_config.to_json_file('data_config.json')
data_config2 = DataConfig.from_json_file('data_config.json')

If you have NAS(network access storage) settings, You can save/load faster by using the endpoint as a NAS folder path.

Examples:

from matorage import DataConfig

# NAS example
data_config = DataConfig(
    endpoint='~/shared',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

to_dict()[source]¶

Serializes this instance to a Python dictionary.

Returns: Dictionary of all the attributes that make up this configuration instance,
Return type: Dict[str, any]

classmethod from_json_file(json_file)[source]¶

Constructs a Config from the path to a json file of parameters.

Parameters: json_file (string) – Path to the JSON file containing the parameters.
Returns: An instance of a configuration object
Return type: DataConfig

property get_length¶

Get length of dataset in DataConfig

Returns: length of dataset
Return type: integer

DataSaver¶

class matorage.DataSaver(config, multipart_upload_size=5242880, num_worker_threads=4, inmemory=False, refresh=False)[source]¶

This class must be created independently for the process. The independent process uses multiple threads to upload to storage and generates unique metadata information when upload is complete. Update the file, push the upload queue if it exceeds a certain size, close the file, and create a new file. After saving, you should disconnect the data saver.

To make This procedure easier to understand, the following is written in the pseudo-code.

per_one_batch_data_size = array_size // num_batch
per_one_file_batch_size = max_object_size // per_one_batch_data_size
for batch_idx in range(num_batch):
    if get_current_stored_batch_size() < per_one_file_batch_size:
        file.append(data[batch_idx])
    else:
        file_closing()
        new_file is opened
        new_file.append(data[batch_idx])
All files are closed.

Note

Deep Learning Framework Type : All(pure python is also possible)
All processes should call the constructors of this class independently.
After data save is over, you must disconnect through the disconnect function.

Parameters

config (matorage.DataConfig, require) – A DataConfig instance object
multipart_upload_size (integer, optional, defaults to 5 * 1024 * 1024) – size of the incompletely uploaded object. You can sync files faster with multipart upload in MinIO. This is because MinIO clients use multi-threading, which improves IO speed more efficiently regardless of Python’s Global Interpreter Lock(GIL).
num_worker_threads (integer, optional, defaults to 4) – number of backend storage worker to upload or download.
inmemory (boolean, optional, defaults to False) – If you use this value as True, then you can use HDF5_CORE driver so the temporary file for uploading or downloading to backend storage, such as MinIO, is not stored on disk but is in the memory. Keep in mind that using memory is fast because it doesn’t use disk IO, but it’s not always good. If default option(False), then HDF5_SEC2 driver will be used on posix OS(or HDF5_WINDOWS in Windows).
refresh (boolean, optional, defaults to False) – All existing data is erased and overwritten.

Single Process example

Examples:

import numpy as np
from tqdm import tqdm
from matorage import DataConfig, DataSaver

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

data_saver = DataSaver(config=data_config)
row = 100
data = np.random.rand(64, 3, 224, 224)

for _ in tqdm(range(row)):
    data_saver({
        'array' : data
    })

data_saver.disconnect()

__call__(datas, filetype=False)[source]¶

Parameters

datas (Dict[str, numpy.ndarray] or Dict[str, str], require) – if filetype is false, datas is Dict[str, numpy.ndarray] type, `value` is `numpy.ndarray` type with (B, *) shape, B means batch size. else true, datas is Dict[str, str] type, `value` is file path of `str` type.
filetype (boolean, optional) – Indicates whether the type of data to be added to this bucket is a simple file type.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'image' : np.random.rand(16, 28, 28),
    'target' : np.random.rand(16)
})

When used as shown below, filetype data is saved with a key called <bucket_name>/raw_image.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'raw_image' : 'test.jpg'
})
print(data_config.get_filetype_list)

property get_downloaded_dataset¶

get local paths of downloaded dataset in local storage

Returns: local path of downloaded datasets
Return type: list

disconnect()[source]¶

disconnecting datasaver. close all opened files and upload to backend storage. Must be called after datasaver function to store data safely.

Examples:

data_saver = DataSaver(config=data_config)
data_saver({
    'image' : np.random.rand(16, 28, 28),
    'target' : np.random.rand(16)
})
data_saver.disconnect()

torch.Dataset¶

class matorage.torch.Dataset(config, **kwargs)[source]¶

Dataset class for Pytorch Dataset

This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.

The _object_file_mapper manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in _object_file_maper.
We read _object_file_mapper and download only new objects that are not there.
__getitem__ brings numpy data in local data from data index.

Parameters

config (matorage.DataConfig, require) – dataset configuration
num_worker_threads (int, optional, defaults to 4) – Number of backend storage worker to upload or download.
clear (boolean, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.
cache_folder_path (str, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.
index (boolean, optional, defaults to False) – Setting for index mode.

Examples:

from matorage import DataConfig
from matorage.torch import Dataset
from torch.utils.data import DataLoader

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

dataset = Dataset(config=data_config, clear=True)

# iterative mode
for array in DataLoader(dataset):
    print(array)

# index mode
print(dataset[0])

tensorflow.Dataset¶

class matorage.tensorflow.Dataset(config, batch_size=1, **kwargs)[source]¶

Dataset class for Tensorflow Dataset

This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.

The _object_file_mapper manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in _object_file_maper.
We read _object_file_mapper and download only new objects that are not there.
if Tensorflow v2(2.2.0>=), we use tfio.IODataset.from_hdf5 and parallel interleave more fast

Parameters

config (matorage.DataConfig, require) – dataset configuration
num_worker_threads (int, optional, defaults to 4) – Number of backend storage worker to upload or download.
clear (boolean, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.
cache_folder_path (str, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.
index (boolean, optional, defaults to False) – Setting for index mode.
batch_size (integer, optional, defaults to 1) – how many samples per batch to load.
shuffle (boolean, optional, defaults to False) – set to True to have the data reshuffled at every epoch.
seed (integer, optional, defaults to 0) – random seed used to shuffle the sampler if shuffle=True.

Examples:

from matorage import DataConfig
from matorage.tensorflow import Dataset

data_config = DataConfig(
    endpoint='127.0.0.1:9000',
    access_key='minio',
    secret_key='miniosecretkey',
    dataset_name='array_test',
    attributes=[
        ('array', 'uint8', (3, 224, 224)),
    ]
)

dataset = Dataset(config=data_config, clear=True)

# iterative mode
for array in dataset.dataloader:
    print(array)

# index mode
print(dataset[0])

property filenames¶

Get filenames(file absolute path) in local storage

Returns: filenames(file absolute path) in local storage
Return type: list

property dataloader¶

Get iterative dataloader

Returns: iterative tf.data.dataset
Return type: InterleaveDataset

Dataset¶

DataAttribute¶

DataConfig¶

DataSaver¶

torch.Dataset¶

tensorflow.Dataset¶

matorage

Navigation

Related Topics