BasicDataLoaders

Julia package providing a simple data loader to train machine learning systems.

The source code of the project is available on github.

Authors

Lucas Ondel, Brno University of Technology, 2020

Installation

The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add BasicDataLoaders

API

The package provide a simple data loader object:

BasicDataLoaders.DataLoaderType
struct DataLoader
    data
    batchsize
end

Constructor

DataLoader(data[, batchsize = 1, preprocess = x -> x,
           preprocess_element = x -> x])

where data is a sequence of elements to iterate over, batchsize is the size of each batch, preprocess is a user-defined function to apply on each batch and preprocess_element is a user-defined function to apply on each batch's element. By default, preprocess and preprocess_element are simply the identity function.

Warning

When iterating, the final batch may have a size smaller than batchsize.

source
Note

DataLoder supports the iterating and indexing interface and, consequently, it can be used in distributed for loops.

Because it is very common for data loaders to load data from disk, the package also provides two convenience functions to easily read and write files:

BasicDataLoaders.saveFunction
save(path, obj)

Write obj to file path in the BSON format. The intermediate directories are created if they do not exists. If path does not end with the extension ".bson", the extension is appended to the output path. The function returns the type of the object saved. See load to load this file again.

source
BasicDataLoaders.loadFunction
load(path)

Load a julia object saved in path with the function save. If path does not end with thex extension ".bson", the extension is appended to input path.

source

Examples

Here is a complete example that simply print the batches:

julia> using BasicDataLoaders

julia> dl = DataLoader(1:10, batchsize = 3)
DataLoader{UnitRange{Int64}}
  data: UnitRange{Int64}
  batchsize: 3

julia> for batch in dl println(batch) end
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
[10]

Here is another example that computes the sum of all even numbers between 2 and 200 included:

julia> using BasicDataLoaders

julia> dl = DataLoader(1:100, batchsize = 10, preprocess = x -> 2*x)
DataLoader{UnitRange{Int64}}
  data: UnitRange{Int64}
  batchsize: 10

julia> sum(sum(batch) for batch in dl)
10100

Finally, here is an example simulating loading data from files. In practice, you can replace the printing function with the load function.

julia> using BasicDataLoaders

julia> files = ["file1.bson", "file2.bson", "file3.bson"]
3-element Array{String,1}:
 "file1.bson"
 "file2.bson"
 "file3.bson"

julia> dl = DataLoader(files, batchsize = 2, preprocess = x -> println("load and merge files $x"))
DataLoader{Array{String,1}}
  data: Array{String,1}
  batchsize: 2

julia> for batch in dl println("do something on this batch") end
load and merge files ["file1.bson", "file2.bson"]
do something on this batch
load and merge files ["file3.bson"]
do something on this batch