You are on page 1of 6

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Music Genre Detection using Machine


Learning Algorithms
Karan Rathi1 Manas Bisht2
Department of computer Department of Computer
Science and Engineering, Sharda University, SET Science and Engineering, Sharda University, SET
Greater Noida, UP (India) Greater Noida, UP (India)

Abstract:- Music genre classification is one example of labels, which allows them to classify new, unlabeled pieces
content-based analysis of music signals. Historically, of music into the appropriate genre category. The genres of
human-engineered features were employed to automate music that different communities write or even just listen to
this process, and in the 10-genre classification, 61% can be used to identify them. Different groups and
accuracy was attained. Even yet, it falls short of the 70% communities listen to various types of music. The music's
accuracy that humans are capable of in the identical genre is a key characteristic that distinguishes it from other
activity. Here, we suggest a novel approach that types of music. A regular person cannot recognize the genre
combines understanding of the neurophysiology of the of the music right away after listening to it. But because the
auditory system with research on human perception in distinctions between many genres of music can be hazy,
the classification of musical genres. The technique classifying them is a particularly challenging job. For
involves training a straightforward convolutional neural instance, in a test using a 10-way forced choice problem,
network (CNN) to categorise a brief portion of the music college students were able to classify the music 70%
input. The genre of the song is then identified by accurately after hearing it for just 3 seconds, and the
breaking it up into manageable chunks and combining accuracy remained constant with longer music [1].
CNN's predictions from each individual chunk. The Additionally, the amount of tagged data is sometimes
filters learned in the CNN match the Spectro temporal significantly lower than the data's dimension. For instance,
receptual field (STRF) in humans, and after training, even though the GTZAN dataset used in this work only has
this approach reaches human-level (70%) accuracy. 1000 audio tracks, each audio track is 30 seconds long and
has a sample rate of 22,050 Hz.
I. INTRODUCTION
II. LITRATURE REVIEW
Music plays a very important role in people’s lives.
Music brings like-minded people together and is the glue Numerous research papers on the classification of
that holds communities together. A music genre is a musical genres have extensively employed this kind of
category or classification of music that shares common methodology. Multiple spectrograms obtained from audio
characteristics such as musical style, instrumentation, recordings are used as inputs for CNN, and their patterns are
rhythm, melody, and cultural and historical context. extracted into a 2D convolutional layer with the appropriate
Examples of music genres include rock, pop, hip-hop, filter and kernel sizes [9]. The spectrogram is mentioned in
classical, jazz, blues, country, electronic, folk, and many CNN because the model is good at identifying picture
others. Each genre is defined by a set of conventions that details [8]. Lau proposed applying the Convolutional Neural
distinguish it from other genres and often has a dedicated Network (CNN) model using a preprocessed GTZAN
fan base and industry infrastructure. The boundaries dataset. Each song's extracted Mel-Frequency Cepstrum
between genres can sometimes be blurred, and new genres Coefficient (MFCC) spectrogram was included in the
can emerge through a fusion of existing ones or by dataset. Additionally, the feature descriptions for the audio
incorporating elements of different styles. Music genre excerpts in 3 seconds and 30 seconds were included in a
detection is the process of automatically identifying the separate.csv file [8]. Then, using Keras, he created a CNN
genre of a piece of music using algorithms and machine architecture with 5 convolutional blocks. Each block
learning techniques. The goal of music genre detection is to contained a convolutional layer with a 3x3 filter and a 1x1
classify a piece of music into one or more predefined stride, a max pooling with a 2x2 windows size and a 2x2
categories based on its acoustic features, such as timbre, stride, and a Rectifying Linear Unit (ReLU) function to
rhythm, harmony, and melody. Music genre detection is display the probabilities for 10 music genres; the genre with
used in various applications such as music recommendation the highest probability was picked as the input's
systems, music streaming platforms, and content-based classification label [8]. Twenty MFCCs were trained on 30-
music retrieval systems. The process typically involves second and 3-second pieces of music, three CNN models
analyzing the audio signal using signal processing were built on spectrograms, and a classification test was run
techniques to extract relevant features, which are then fed on the test sets following training [8]. As Lau noted, there
into machine learning models trained on labeled datasets. was a problem with the training datasets because the 3-
The models learn to recognize patterns and associations second dataset did not match the number of genres in the
between the extracted features and the corresponding genre sample Nevertheless, some genres featured fewer or more

IJISRT23MAY473 www.ijisrt.com 630


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
samples than the standard (1000) [10]. The Short-term data.
Fourier Transform (STFT) spectrograms, which are
composed of different sequences of spectrogram vectors B. Machine Learning Tecniques
across time, were used by Yu et al to establish the CNN We used several models such as KNN, SVM, Naïve
method [10]. In their paper, two datasets were mentioned: Bayes, Decision tree and NN.
Extended Ballroom and GTZAN. Yu et al. separated each
song from both datasets into 18 smaller parts in 3 seconds  KNN (K-Nearest Neighbours):
with 50% overlaps, increasing the data size set for each This machine learning method and algorithm can be
genre label by 18 times over the original [10]. The STFT applied to both classification and regression tasks. K-Nearest
spectrograms were examined with an analysis size of Neighbours uses the labels of a predetermined number of
513x128 and the train-validate-test ratio was 8:1:1 [10]. In data points to create a prediction about the class that the
order to capture discrete audio properties reflected in the target data point belongs to. We utilised K-Nearest
STFT spectrograms and lessen source loss, pooling kernels Neighbours (KNN), a conceptually straightforward yet
and convolution filters were designed in small sizes in the incredibly effective technique, to train our model. Avoid
first few layers of the CNN model [10]. Athulya and Sindhu combining SI and CGS units, such as magnetic field in
came up with the idea of building a 2D Convolutional oersteds and current in amperes. Due to the fact that
Neural Network (CNN). They extracted the audio samples equations do not balance dimensionally, this frequently
from the GTZAN dataset into several types of spectrograms causes confusion. If mixed units must be used, be sure to
using the Librosa tool. These spectrograms served as binary specify them for each quantity you include in an equation.
inputs for the 2D CNN model developed with the Keras
framework. The layers were also created using the  SVM (Support Vector Machine):
TensorFlow framework [6]. Displayed was a 2D This approach for supervised learning is utilised for
convolutional layer using input measurements of both regression and classification. Finding a hyperplane in an
128x128x1. The inputs to the max-pooling layer, which N-dimensional space that clearly classifies data points is the
would operate a matrix half the size of the input layer, were basic goal of SVM.
represented by a 2D NumPy array [6]. The overall number
of convolutional layers was 5, with a max-pooling layer, a  Naïve Bayes:
stride of 2, and a 2x2 kernel size. Next, the output from each This model that makes predictions based on
layer would be inserted into a fully linked layer that also had probability, and is an algorithm based on the idea of the
inputs in the form of a flattened and shrinking matrix size Bayes theorem. It can also be used to solve classification
[6]. The SoftMax function, which was included at the end of problems.
the output layer, produced the probability output. The
architecture achieved 94% accuracy. Similar to this, Nandy  Decision Tree:
and Agrawal suggested a 2D CNN with a 1D kernel based The non-parametric supervised learning approach used
on spectrograms generated from audio snippets in the Free for classification and regression applications is the decision
Music Archive (FMA) dataset. The model produced an tree. It is organised hierarchically, with a root node,
output of a 5000-length vector from an input dimension of branches, internal nodes, and leaf nodes.
500x1500. Convolution layer blocks, a batch normalization
layer, an activation layer, and, if practical, a max-pooling  NN (Neural Networks):
layer were all included in the construction of the CNN. The The foundation of deep learning algorithms is a subset
2D CNN model was trained, validated, and tested 80:10:10 of machine learning known as artificial neural networks
times, with a dropout parameter of 0.5 [1]. The model beat (ANNs), often known as ANNs. Their design is influenced
previous models from comparable research articles, by the structure of the human brain, replicating how synapses
performing with an accuracy rate of 76.2% and a logloss are sent and received in the brain.
rate of 0.7543. An F1-Score greater than 0.7 indicated that
the model was increasingly performing well at categorizing C. Data Set
musical genres. We have used the data set called FMA medium which
consists of fma_medium.zip: 25,000 tracks of 30s, 16
III. METHODOLOGY unbalanced genres (22 GiB). FMA stands for free music
archive.
A. Pre-Processing
We used data sets from the FMA medium, which are IV. EXPERIMENTAL RESULT
25000 songs from 8 different genres that have been
compressed to 30 seconds apiece. We have divided the  SVM:
unsorted datasets into only four genres based on the meta
data, namely: Hip-Hop, rock, pop, and folk One audio clip Table 1 SVM
was fed into our programme, and its spectral centroid,
spectral bandwidth, spectral roll off, MFCC (Mel-frequency
Cepstral Coefficients), Zero crossing rate, and RMSE (Root
mean Square Energy) properties were extracted from it and
saved into a new CSV file. Thus, obtaining a set of tagged

IJISRT23MAY473 www.ijisrt.com 631


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 SVM with Oversampling Techniques:  KNN with Undersampling Techniques:

Table 2 SVM with Oversampling Techniques Table 5 KNN with Undersampling Techniques

 TREE:

Table 6 TREE

 SVM with Undersampling Techniques:

Table 2 SVM with Undersampling Techniques


 TREE with Oversampling Techniques:

Table 7 TREE with Oversampling Techniques

 KNN:

Table 3 KNN

 TREE with Undersampling Techniques:

Table 8 TREE with Undersampling Techniques

 KNN with Oversampling Techniques:

Table 4 KNN with Oversampling Techniques

IJISRT23MAY473 www.ijisrt.com 632


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 NAÏVE BAYES:  Random Forest with Oversampling Techniques:

Table 9 NAÏVE BAYES Table 13 Random Forest with Oversampling Techniques

 NAÏVE BAYES with Oversampling Techniques:

Table 10 NAÏVE BAYES with Oversampling Techniques

 Random Forest with Undersampling Techniques:


 NAÏVE BAYES with Undersampling Techniques:
Table 14 Random Forest with Undersampling Techniques
Table 11 NAÏVE BAYES with Undersampling Techniques

 Random Forest:

Table 12 Random Forest

 Oversampling and Undersampling Used:

 SYNTHETIC MINORITY OVERSAMPLING (SMOTE):


This statistical method is employed to evenly increase
the number of examples in your dataset. SMOTE
creates new instances from inputs of minority cases that
already exist.

IJISRT23MAY473 www.ijisrt.com 633


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 SMOTE-NC: It is used to generate synthetic data to  RANDOM OVERSAMPLER: Machine learning uses the
oversample a minority target class in an imbalanced method of random oversampling to balance unbalanced
dataset. datasets. One class contains significantly fewer
 ADASYN: The major benefits of this technique, which examples than the other(s) in an unbalanced dataset.
creates synthetic data, are duplicating minority data and This may result in a biassed model that does not
producing extra data for "harder to learn" examples. adequately represent the minority class. This is
 BORDERLINE-SMOTE: This algorithm classifies any addressed by random oversampling, which duplicates
minority observation as a noise point if all the examples from the minority class until the dataset is
neighbours are of majority class, and such an balanced.
observation is ignored while creating synthetic data.
 K-MEANS SMOTE: is an oversampling method for V. FUTURE WORK
class-imbalanced data. It aids classification by
generating minority class samples in safe and crucial  There are several future scopes that can be explored to
areas of the input space. improve its performance. Here are some possible
 SVM SMOTE: is a Variant of SMOTE algorithm which approaches:
use an SVM algorithm to detect sample to use for
generating new synthetic samples.  Data Augmentation:
 CLUSTER CENTROIDS: is a method that undersamples One way to improve the performance of a genre
the majority class by substituting the cluster centroid of detection model is to increase the size of the training dataset
a KMEANS algorithm for a cluster of majority sample by generating additional examples from the existing ones
locations. through data augmentation techniques. For example, the
 CONDENSED NEAREST NEIGHBOUR: Condensed audio signals can be randomly time-stretched, pitch-shifted,
nearest neighbour which is also known as the Hart or filtered to create variations of the same piece of music
algorithm is an algorithm designed to reduce the data that can help the model learn more robust representations of
set for k-NN classification. It selects the set of the genre features.
prototypes U from the training data, such that 1NN with
U can classify the examples almost as accurately as  Feature Engineering:
1NN does with the whole dataset. Another way to improve the performance of a genre
 EDITED NEASREST NEIGHBOUR (ENN): This detection model is to extract more informative features from
method works by finding the K-nearest neighbour of the audio signals that capture the essential characteristics of
each observation first, then check whether the majority each genre. This can be achieved by using more
class from the observation's K-nearest neighbour is the sophisticated signal processing techniques or by
same as the observation's class or not. incorporating domain-specific knowledge about music
 REPEATED EDITED NEASRESTNEIGHBOUR: This theory and composition into the feature extraction process.
method repeats the ENN algorithm several times; it
under samples based on the repeated edited nearest  Ensemble Methods:
neighbour. Ensemble methods combine the outputs of multiple
 AIIKNN: It removes all examples from the dataset that models to make more accurate predictions than any single
were classified incorrectly. model alone. By training multiple genre detection models
 INSTANCEHARDNESS THRESHOLD: This is an under with different architectures, hyperparameters, or training
data, and then combining their outputs through voting,
sampling method that was built to tackle imbalanced
averaging, or stacking, we can leverage the diversity of the
classifications.
models' predictions to improve the overall accuracy of the
 NEARMISS: It refers to a group of undersampling
ensemble.
techniques that choose examples depending on how
close majority class and minority class examples are to
one another.  Transfer Learning:
Transfer learning involves reusing pre-trained models
 ONESIDEDSELECTION: Condensed Nearest
that were originally trained on large datasets for related
Neighbour (CNN) Rule and Tomek Links are two
tasks to improve the performance of a new model with
undersampling techniques that are combined to create
limited training data. By fine-tuning a pre-trained model on
One-Sided Selection, or OSS. The CNN approach is
a smaller genre detection dataset, we can leverage the pre-
used to eliminate redundant examples from the interior
existing knowledge captured by the model to improve its
of the density of the majority class, whereas the Tomek
accuracy on the target task.
Links method is used to eliminate noisy examples on
the class boundary.
 Hybrid Approaches:
 RANDOMUNDERSAMPLER: It undersamples the
Hybrid approaches combine multiple techniques from
majority class(es) by randomly picking samples with or
the above methods to create more sophisticated genre
without replacements.
detection models. For example, a hybrid model could use a
 TOMEKLINKS:Tomek links are pairs of instances of pre-trained deep learning model for feature extraction,
opposite classes who are their own nearest neighbours. followed by a support vector machine (SVM) classifier
trained on augmented data, and then an ensemble method to

IJISRT23MAY473 www.ijisrt.com 634


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
combine the predictions of multiple SVM models with
different hyperparameters.

VI. CONCLUSION

In this paper we propose the KNN algorithm with


EditedNearestNeighbor undersampling and only one
neighbour to be far more accurate than the remaining
algorithms along with their over and under sampling. Due to
our limited computational resources and time we were not
able to execute neural network on our dataset.

REFERENCES

[1]. Chatziagapi, A., Paraskevopoulos, G., Sgouropoulos,


D., Pantazopoulos, G., Nikandrou, M.,
Giannakopoulos, T., ... & Narayanan, S. (2019,
September). Data Augmentation Using GANs for
Speech Emotion Recognition. In Interspeech (pp.
171-175).
[2]. Biswas, R., & Ghattamaraju, N. (2019). An effective
analysis of deep learning based approaches for audio
based feature extraction and its visualization.
Multimedia Tools and Applications, 78, 23949-
23972.
[3]. Lu, Y. C., Wu, C. W., Lerch, A., & Lu, C. T. (2016,
August). Automatic Outlier Detection in Music Genre
Datasets. In ISMIR (pp. 101-107).
[4]. Fell, M., & Sporleder, C. (2014, August). Lyrics-
based analysis and classification of music. In
Proceedings of COLING 2014, the 25th international
conference on computational linguistics: Technical
papers (pp. 620-631).
[5]. Van Mieghem, L. C. F. (2020). Music Genre
Detection: with Neural Networks.

IJISRT23MAY473 www.ijisrt.com 635

You might also like