BasicDataLoaders
Julia package providing a simple data loader to train machine learning systems.
The source code of the project is available on github.
Authors
Lucas Ondel, Brno University of Technology, 2020
Installation
The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:
pkg> add BasicDataLoadersAPI
The package provide a simple data loader object:
BasicDataLoaders.DataLoader — Typestruct DataLoader
data
batchsize
endConstructor
DataLoader(data[, batchsize = 1, preprocess = x -> x,
preprocess_element = x -> x])where data is a sequence of elements to iterate over, batchsize is the size of each batch, preprocess is a user-defined function to apply on each batch and preprocess_element is a user-defined function to apply on each batch's element. By default, preprocess and preprocess_element are simply the identity function.
When iterating, the final batch may have a size smaller than batchsize.
DataLoder supports the iterating and indexing interface and, consequently, it can be used in distributed for loops.
Because it is very common for data loaders to load data from disk, the package also provides two convenience functions to easily read and write files:
BasicDataLoaders.save — Functionsave(path, obj)Write obj to file path in the BSON format. The intermediate directories are created if they do not exists. If path does not end with the extension ".bson", the extension is appended to the output path. The function returns the type of the object saved. See load to load this file again.
BasicDataLoaders.load — Functionload(path)Load a julia object saved in path with the function save. If path does not end with thex extension ".bson", the extension is appended to input path.
Examples
Here is a complete example that simply print the batches:
julia> using BasicDataLoaders
julia> dl = DataLoader(1:10, batchsize = 3)
DataLoader{UnitRange{Int64}}
data: UnitRange{Int64}
batchsize: 3
julia> for batch in dl println(batch) end
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
[10]Here is another example that computes the sum of all even numbers between 2 and 200 included:
julia> using BasicDataLoaders
julia> dl = DataLoader(1:100, batchsize = 10, preprocess = x -> 2*x)
DataLoader{UnitRange{Int64}}
data: UnitRange{Int64}
batchsize: 10
julia> sum(sum(batch) for batch in dl)
10100Finally, here is an example simulating loading data from files. In practice, you can replace the printing function with the load function.
julia> using BasicDataLoaders
julia> files = ["file1.bson", "file2.bson", "file3.bson"]
3-element Array{String,1}:
"file1.bson"
"file2.bson"
"file3.bson"
julia> dl = DataLoader(files, batchsize = 2, preprocess = x -> println("load and merge files $x"))
DataLoader{Array{String,1}}
data: Array{String,1}
batchsize: 2
julia> for batch in dl println("do something on this batch") end
load and merge files ["file1.bson", "file2.bson"]
do something on this batch
load and merge files ["file3.bson"]
do something on this batch