In this tutorial, we’ll look at audio processing in Python using the librosa library as an example.
What is librosa? Librosa is a Python package for analyzing music and audio. It provides building blocks for creating structures that help get information about music.
Installing librosa in Python
Let’s install the library using the pip command:
pip install librosa
For the example I downloaded an mp3 file from https://www.bensound.com/ and converted it to ogg for comfort. Download a short ogg file (it can be any music file in ogg format):
import librosa y, sr = librosa.load('bensound-happyrock.ogg')
Processing audio as a time series
The last line of the
load function reads an ogg file as a time series. Where,
sr stands for
- Time series is represented by an array.
sample_rateis the number of samples per second of audio.
By default the audio is mixed in mono. But it can be resampled during loading to 22050 Hz. This is done with additional parameters in the
Extracting features from an audio file
A sample has several important attributes. There is a fundamental notion of rhythm in some forms, and the others are either nuanced or related:
- Tempo: the rate at which the patterns are repeated. Tempo is measured in bits per minute (BPM). If music has 120 BPM, that means there are 120 beats (beats) every minute.
- Beat: a stretch of time. This is the rhythm being beaten out in the song. So, there are 4 beats in one beat, for example.
- Beat: a logical division of beats. Usually there are 3 or 4 beats in a measure, although other variations are possible.
- Interval: In editing software, intervals are most common. Usually there is a sequence of notes, such as 8 hexes of the same length. Usually an interval is 8 notes, triplets or quadruplets.
- Rhythm: A list of musical sounds. All of the notes are rhythm.
Tempo and beats can be extracted from audio:
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr) print(tempo) print(beat_frames)
89.10290948275862 [ 3 40 75 97 132 153 183 211 246 275 303 332 361 389 ... 4438 4466]
Chalk-capstral coefficients (MFCC)
Chalk-and-cepstral coefficients are one of the most important features in audio processing. The MFCC is a matrix of values that captures the timbral aspects of a musical instrument: for example, the differences in the sound of a metal and wooden guitar. Other metrics do not capture this difference, but it is the closest thing to what humans can distinguish.
mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=8192, n_mfcc=12) # pip install seaborn matplotlib import seaborn as sns # from matplotlib import pyplot as plt mfcc_delta = librosa.feature.delta(mfcc) sns.heatmap(mfcc_delta) plt.show()
Here we create a heat map of MFCC data that provides this result: Normalizing it to a chromatogram will produce this result:
chromagram = librosa.feature.chroma_cqt(y=y, sr=sr) sns.heatmap(chromagram) plt.show()
These are just the basics of what you can get from the audio data for trainable algorithms. There are many advanced examples in the librosa documentation.