Features extraction
Loading an audio file
To extract any type of speech features you will need the audio signal stored in an Array
-like object and the sampling rate in Hertz. SpeechFeatures does not provide a way to load these two elements from audio files directly but there are several Julia packages to do this. In this tutorial, we will use WAV.jl. For the rest of the tutorial, we assumed that you have installed the WAV.jl package in your Julia environment.
First of all, as an example, we download an audio file from the TIMIT corpus. In the Julia REPL type:
julia> run(`wget https://catalog.ldc.upenn.edu/desc/addenda/LDC93S1.wav`)
Now, we load the audio waveform:
julia> using WAV
julia> channels, srate = wavread("LDC93S1.wav", format = "double")
Where channels
is a N
xC
matrix where N
is the length of the audio in samples and C
is the number of channels. Since TIMIT is mono recorded it has only one channel. format = "double"
indicates that the signals in channels
will be encoded with double precision and each sample of the signal will be between 1.0
and -1.0
.
The wavread
function also accepts format = "native"
which will return the data in the format it is stored in the WAV file. We discourage its use as extracting the features from integer or floating point encoded signal can lead to drastically different output.
We get the signal from the channels
matrix:
julia> x = channels[:, 1]
As a sanity check, we print the sampling rate and duration of the signal:
julia> println("sampling freq: $srate Hz\nduration: $(round(length(x) / srate, digits=2)) s")
sampling freq: 16000.0 Hz
duration: 2.92 s
and we plot the waveform:
julia> using Plots
julia> pyplot()
julia> t = range(0, length(x) / srate, length=length(x))
julia> plot(t, x, size = (1000, 300), xlabel = "time (seconds)", legend = false)
Extracting the features
All the different types of features supported by this package follow the same extraction scheme.
- create a the feature extractor object with a specific configuration
- send the signal(s) to this extractor to get the features.
SpeechFeatures provides the following feature extractor:
Extractor | Constructor | Description |
---|---|---|
Log magnitude spectrum | LogMagnitudeSpectrum([options]) | Logarithm of the magnitude of the Short Term Fourier Transform (STFT) |
Log Mel Spectrum | LogMelSpectrum([options]) | Logarithm of the STFT transformed via a mel-spaced filter bank. |
Mel Cepsral Coefficients (MFCCs) | MFCC([options]) | Classical MFCC features |
As an example, we will use the popular Mel Frequency Cepstral Coefficients (MFCC) features. First we create the extractor with the default configuration:
julia> mfcc = MFCC()
and then, we extract and plot the features from our TIMIT sample:
julia> fea = x |> mfcc
Here is the list of possible options for each extractor
Option name | Default | Supported by | Description |
---|---|---|---|
removedc | true | all | Remove the direct component from the signal. |
dithering | true | all | Add Gaussian white noise with dithering stdandard deviation. |
srate | 16000 | all | Sampling rate in Hz of the input signal |
frameduration | 0.025 | all | Frame duration in seconds. |
framestep | 0.011 | all | Frame step (hop size) in seconds. |
preemphasis | 0.97 | all | Preemphasis filter coefficient. |
windowfn | SpeechFeatures.HannWindow | all | Windowing function (others are HammingWindow or RectangularWindow ). |
windowpower | 0.85 | all | Sharpening exponent of the window. |
nfilters | 26 | LogMelSpectrum | MFCC | Number of filters in the filter bank. |
lofreq | 80 | LogMelSpectrum | MFCC | Low cut-off frequency in Hz for the filter bank. |
hifreq | 7600 | LogMelSpectrum | MFCC | High cut-off frequency in Hz for the filter bank. |
addenergy | true | MFCC | Append the per-frame energy to the features. |
nceps | 12 | MFCC | Number of cepstral coefficients. |
liftering | 22 | MFCC | Liftering coefficient. |
Deltas and mean normalization
The deltas and acceleration coefficients (i.e. "double deltas") can be computed by chaining the features extraction with the deltas features extractor:
julia> Δ_ΔΔ = DeltaCoeffs(order = 2, deltawin = 2)
julia> fea = x |> mfcc |> Δ_ΔΔ
The order
parameter is the order of the deltas coefficients, i.e. order = 2
means that the first and second deltas (acceleration) coefficients will be computed. deltawin
is the length of the delta window.
Similarly, to remove the mean of the utterance you can add one more element to the chain:
julia> mnorm = MeanNorm()
julia> fea = x |> mfcc |> Δ_ΔΔ |> mnorm