BasicDataLoaders
Julia package providing a simple data loader to train machine learning systems.
The source code of the project is available on github.
Authors
Lucas Ondel, Brno University of Technology, 2020
Installation
The package can be installed with the Julia package manager. From the Julia REPL, type ]
to enter the Pkg REPL mode and run:
pkg> add BasicDataLoaders
API
The package provide a simple data loader object:
BasicDataLoaders.DataLoader
— Typestruct DataLoader
data
batchsize
end
Constructor
DataLoader(data[, batchsize = 1, preprocess = x -> x,
preprocess_element = x -> x])
where data
is a sequence of elements to iterate over, batchsize
is the size of each batch, preprocess
is a user-defined function to apply on each batch and preprocess_element
is a user-defined function to apply on each batch's element. By default, preprocess
and preprocess_element
are simply the identity function.
When iterating, the final batch may have a size smaller than batchsize
.
DataLoder
supports the iterating and indexing interface and, consequently, it can be used in distributed for loops.
Because it is very common for data loaders to load data from disk, the package also provides two convenience functions to easily read and write files:
BasicDataLoaders.save
— Functionsave(path, obj)
Write obj
to file path
in the BSON format. The intermediate directories are created if they do not exists. If path
does not end with the extension ".bson", the extension is appended to the output path. The function returns the type of the object saved. See load
to load this file again.
BasicDataLoaders.load
— Functionload(path)
Load a julia object saved in path
with the function save
. If path
does not end with thex extension ".bson", the extension is appended to input path.
Examples
Here is a complete example that simply print the batches:
julia> using BasicDataLoaders
julia> dl = DataLoader(1:10, batchsize = 3)
DataLoader{UnitRange{Int64}}
data: UnitRange{Int64}
batchsize: 3
julia> for batch in dl println(batch) end
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
[10]
Here is another example that computes the sum of all even numbers between 2 and 200 included:
julia> using BasicDataLoaders
julia> dl = DataLoader(1:100, batchsize = 10, preprocess = x -> 2*x)
DataLoader{UnitRange{Int64}}
data: UnitRange{Int64}
batchsize: 10
julia> sum(sum(batch) for batch in dl)
10100
Finally, here is an example simulating loading data from files. In practice, you can replace the printing function with the load
function.
julia> using BasicDataLoaders
julia> files = ["file1.bson", "file2.bson", "file3.bson"]
3-element Array{String,1}:
"file1.bson"
"file2.bson"
"file3.bson"
julia> dl = DataLoader(files, batchsize = 2, preprocess = x -> println("load and merge files $x"))
DataLoader{Array{String,1}}
data: Array{String,1}
batchsize: 2
julia> for batch in dl println("do something on this batch") end
load and merge files ["file1.bson", "file2.bson"]
do something on this batch
load and merge files ["file3.bson"]
do something on this batch