Dataset¶
DataAttribute¶
-
class
matorage.
DataAttribute
(name, type, shape, itemsize=0)[source]¶ DataAttribute classes.
- Parameters
name (
string
, require) – data attribute name .type (
string
, require) – data attribute type. select in string, bool, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64shape (
tuple
, require) – data attribute shape. For example, if you specify a shape with (2, 2), you can store an array of (B, 2, 2) shapes.itemsize (
integer
, optional, defaults to 0) – itemsize(bytes) for string type attribute. Must be set for string type attribute.
Examples:
>>> from matorage import DataAttribute >>> attribute = DataAttribute('array', 'uint8', (2, 2)) >>> attribute.name 'array' >>> attribute.shape (2, 2) >>> attribute.type UInt8Atom(shape=(), dflt=0)
-
to_dict
()[source]¶ Serializes this instance to a Python dictionary.
- Returns
Dictionary of all the attributes that make up this configuration instance
- Return type
Dict[string, any]
Examples:
>>> from matorage import DataAttribute >>> attribute = DataAttribute('array', 'uint8', (2, 2)) >>> attribute.to_dict() {'name': 'array', 'type': 'uint8', 'shape': (2, 2)}
DataConfig¶
-
class
matorage.
DataConfig
(**kwargs)[source]¶ Dataset configuration classes. This class overrides
StorageConfig
.- Parameters
endpoint (
string
, require) – S3 object storage endpoint. or If use NAS setting, NAS folder path.access_key (
string
, optional, defaults to None) – Access key for the object storage endpoint. (Optional if you need anonymous access).secret_key (
string
, optional, defaults to None) – Secret key for the object storage endpoint. (Optional if you need anonymous access).secure (
boolean
, optional, defaults to False) – Set this value to True to enable secure (HTTPS) access. (Optional defaults to False unlike the original MinIO).max_object_size (
integer
, optional, defaults to 10MB) – One object file is divided into max_object_size and stored.dataset_name (
string
, require) – dataset name.attributes (
list
, require) – DataAttribute type of list for data attributesadditional (
dict
, optional, defaults to{}
) – Parameters for additional description of datasets. The key and value of the dictionay can be specified very freely.compressor (
dict
, optional, defaults to{"complevel" : 0, "complib" : "zlib"}
) –Data compressor option. It consists of a dict type that has complevel and complib as keys. For further reference, read pytable’s Filter.
complevel (
integer
, defaults to 0) : compressor level(0~9). The larger the number, the more compressed it is.complib (
string
, defaults to ‘zlib’) : compressor library. choose in zlib, lzo, bzip2, blosc
max_object_size – One object file is divided into max_object_size and stored.
Examples:
from matorage import DataConfig, DataAttribute data_config = DataConfig( endpoint='127.0.0.1:9000', access_key='minio', secret_key='miniosecretkey', dataset_name='mnist', additional={ "framework" : "pytorch", "mode" : "training" }, compressor={ "complevel" : 0, "complib" : "zlib" }, attributes=[ ('image', 'float32', (28, 28)), ('target', 'int64', (1, )) ] ) data_config.to_json_file('data_config.json') data_config2 = DataConfig.from_json_file('data_config.json')
If you have NAS(network access storage) settings, You can save/load faster by using the endpoint as a NAS folder path.
Examples:
from matorage import DataConfig # NAS example data_config = DataConfig( endpoint='~/shared', dataset_name='mnist', additional={ "framework" : "pytorch", "mode" : "training" }, compressor={ "complevel" : 0, "complib" : "zlib" }, attributes=[ ('image', 'float32', (28, 28)), ('target', 'int64', (1, )) ] )
-
to_dict
()[source]¶ Serializes this instance to a Python dictionary.
- Returns
Dictionary of all the attributes that make up this configuration instance,
- Return type
Dict[str, any]
-
classmethod
from_json_file
(json_file)[source]¶ Constructs a Config from the path to a json file of parameters.
- Parameters
json_file (
string
) – Path to the JSON file containing the parameters.- Returns
An instance of a configuration object
- Return type
-
property
get_length
¶ Get length of dataset in
DataConfig
- Returns
length of dataset
- Return type
integer
DataSaver¶
-
class
matorage.
DataSaver
(config, multipart_upload_size=5242880, num_worker_threads=4, inmemory=False, refresh=False)[source]¶ This class must be created independently for the process. The independent process uses multiple threads to upload to storage and generates unique metadata information when upload is complete. Update the file, push the upload queue if it exceeds a certain size, close the file, and create a new file. After saving, you should disconnect the data saver.
To make This procedure easier to understand, the following is written in the pseudo-code.
per_one_batch_data_size = array_size // num_batch per_one_file_batch_size = max_object_size // per_one_batch_data_size for batch_idx in range(num_batch): if get_current_stored_batch_size() < per_one_file_batch_size: file.append(data[batch_idx]) else: file_closing() new_file is opened new_file.append(data[batch_idx]) All files are closed.
Note
Deep Learning Framework Type : All(pure python is also possible)
All processes should call the constructors of this class independently.
After data save is over, you must disconnect through the disconnect function.
- Parameters
config (
matorage.DataConfig
, require) – A DataConfig instance objectmultipart_upload_size (
integer
, optional, defaults to 5 * 1024 * 1024) – size of the incompletely uploaded object. You can sync files faster with multipart upload in MinIO. This is because MinIO clients use multi-threading, which improves IO speed more efficiently regardless of Python’s Global Interpreter Lock(GIL).num_worker_threads (
integer
, optional, defaults to 4) – number of backend storage worker to upload or download.inmemory (
boolean
, optional, defaults to False) – If you use this value as True, then you can use HDF5_CORE driver so the temporary file for uploading or downloading to backend storage, such as MinIO, is not stored on disk but is in the memory. Keep in mind that using memory is fast because it doesn’t use disk IO, but it’s not always good. If default option(False), then HDF5_SEC2 driver will be used on posix OS(or HDF5_WINDOWS in Windows).refresh (
boolean
, optional, defaults to False) – All existing data is erased and overwritten.
Single Process example
Examples:
import numpy as np from tqdm import tqdm from matorage import DataConfig, DataSaver data_config = DataConfig( endpoint='127.0.0.1:9000', access_key='minio', secret_key='miniosecretkey', dataset_name='array_test', attributes=[ ('array', 'uint8', (3, 224, 224)), ] ) data_saver = DataSaver(config=data_config) row = 100 data = np.random.rand(64, 3, 224, 224) for _ in tqdm(range(row)): data_saver({ 'array' : data }) data_saver.disconnect()
-
__call__
(datas, filetype=False)[source]¶ - Parameters
datas (
Dict[str, numpy.ndarray] or Dict[str, str]
, require) – if filetype is false, datas is Dict[str, numpy.ndarray] type, `value` is `numpy.ndarray` type with (B, *) shape, B means batch size. else true, datas is Dict[str, str] type, `value` is file path of `str` type.filetype (
boolean
, optional) – Indicates whether the type of data to be added to this bucket is a simple file type.
Examples:
data_saver = DataSaver(config=data_config) data_saver({ 'image' : np.random.rand(16, 28, 28), 'target' : np.random.rand(16) })
When used as shown below, filetype data is saved with a key called <bucket_name>/raw_image.
Examples:
data_saver = DataSaver(config=data_config) data_saver({ 'raw_image' : 'test.jpg' }) print(data_config.get_filetype_list)
-
property
get_downloaded_dataset
¶ get local paths of downloaded dataset in local storage
- Returns
local path of downloaded datasets
- Return type
list
-
disconnect
()[source]¶ disconnecting datasaver. close all opened files and upload to backend storage. Must be called after
datasaver
function to store data safely.Examples:
data_saver = DataSaver(config=data_config) data_saver({ 'image' : np.random.rand(16, 28, 28), 'target' : np.random.rand(16) }) data_saver.disconnect()
torch.Dataset¶
-
class
matorage.torch.
Dataset
(config, **kwargs)[source]¶ Dataset class for Pytorch Dataset
This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.
The
_object_file_mapper
manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in_object_file_maper
.We read
_object_file_mapper
and download only new objects that are not there.__getitem__
brings numpy data in local data from data index.
- Parameters
config (
matorage.DataConfig
, require) – dataset configurationnum_worker_threads (
int
, optional, defaults to 4) – Number of backend storage worker to upload or download.clear (
boolean
, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.cache_folder_path (
str
, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.index (
boolean
, optional, defaults to False) – Setting for index mode.
Examples:
from matorage import DataConfig from matorage.torch import Dataset from torch.utils.data import DataLoader data_config = DataConfig( endpoint='127.0.0.1:9000', access_key='minio', secret_key='miniosecretkey', dataset_name='array_test', attributes=[ ('array', 'uint8', (3, 224, 224)), ] ) dataset = Dataset(config=data_config, clear=True) # iterative mode for array in DataLoader(dataset): print(array) # index mode print(dataset[0])
tensorflow.Dataset¶
-
class
matorage.tensorflow.
Dataset
(config, batch_size=1, **kwargs)[source]¶ Dataset class for Tensorflow Dataset
This class is customized for the dataset of the PyTorch, so it is operated by the following procedure.
The
_object_file_mapper
manages the minio object as key and the downloaded local path as value. When minio object is downloaded, it is recorded in_object_file_maper
.We read
_object_file_mapper
and download only new objects that are not there.if Tensorflow v2(2.2.0>=), we use
tfio.IODataset.from_hdf5
and parallelinterleave
more fast
- Parameters
config (
matorage.DataConfig
, require) – dataset configurationnum_worker_threads (
int
, optional, defaults to 4) – Number of backend storage worker to upload or download.clear (
boolean
, optional, defaults to True) – Delete all files stored on the local storage after the program finishes.cache_folder_path (
str
, optional, defaults to ~/.matorage) – Cached folder path to check which files are downloaded complete.index (
boolean
, optional, defaults to False) – Setting for index mode.batch_size (
integer
, optional, defaults to 1) – how many samples per batch to load.shuffle (
boolean
, optional, defaults to False) – set to True to have the data reshuffled at every epoch.seed (
integer
, optional, defaults to 0) – random seed used to shuffle the sampler ifshuffle=True
.
Examples:
from matorage import DataConfig from matorage.tensorflow import Dataset data_config = DataConfig( endpoint='127.0.0.1:9000', access_key='minio', secret_key='miniosecretkey', dataset_name='array_test', attributes=[ ('array', 'uint8', (3, 224, 224)), ] ) dataset = Dataset(config=data_config, clear=True) # iterative mode for array in dataset.dataloader: print(array) # index mode print(dataset[0])
-
property
filenames
¶ Get filenames(file absolute path) in local storage
- Returns
filenames(file absolute path) in local storage
- Return type
list
-
property
dataloader
¶ Get iterative dataloader
- Returns
iterative tf.data.dataset
- Return type
InterleaveDataset