Stroke Classification Model for Hand Percussion Digital Musical Instrument

from Libro de Actas Congreso Ingeacus 2020

Propulsión Eléctrica y de Combustión Interna

aF.I.,Aranda, aJ.,Jaimovich, aL.,Cendoyya, aS.E.,Floody, bJ.P.,Cortés& aV.M., Espinoza

a Departamento de Sonido, Facultad de Artes, Universidad de Chile, Santiago, Chile. b MGH Voice Center, Massachusetts General Hospital, Boston, USA. Correspondence author: F.I., Aranda, email: felipe.aranda@ug.uchile.cl

Abstract— This project develops a classification model based on handstrokes using a percussive gesture interface for musical applications. The aim is to implement a percussion Digital Musical Instrument that classifies percussive hand strokes according to their spectral characteristics. The prototype is based on a USB audio interface and a piezoelectric sensor that captures vibrations of the hand’s impact on a solid surface (wooden desk) for three types of strokes. The prototype has three stages of analysis and processing: 1) audio signal onset detection, 2) audio signal feature extraction and 3) percussive stroke multiclass classification through Machine Learning algorithms. An audio dataset of 900 strokes (300 for each type) was randomly divided on a 70/30 proportion for training and testing the model, respectively. Each signal was properly filtered to estimate the onset time and audio features (Mel Frequency Cepstral Coefficients) to fit the Machine Learning algorithms. Three supervised learning models were trained on MATLAB’s Classification Learner App: Support Vector Machine Medium Gaussian, Subspace k-Nearest Neighbors, and Linear Discriminant. For the testing dataset, the SVM model obtained a classification rate of 97,6% ± 1,74%, a sensibility rate of 97,6% ± 1,74% and a specificity rate of 98,8% ± 1,10% for the three classes. Results show the potential of these models for designing percussion Digital Musical Instruments.

Keywords—palabras claves: digital musical instrument; hand percussion; machine learning; digital signal processing.

1. INTRODUCTION

Musicians are able to recognise the subtle differences in timbre produced by different playing techniques on an instrument [1]. In hand percussion, these differences constitute the language that musicians use to express themselves through the instrument. Classificating this timbral/spectral differences between strokes can benefit the creation and design of Digital Musical Instruments (DMI) for percussionists. A subset of DMI are the Percussion digital instruments, defined by Wanderley et. al. in [2] as instruments that include a separate gestural interface (or gestural controller unit) from a sound generation unit. Both units are independent and related by mapping strategies. It is possible to find several percussion DMI’s [3, 4, 5] designed to satisfy different musical needs.

The most common hand percussion DMI’s are the so-called ‘launchpad’, a matrix or grid of rubber-based ‘pads’, that tend to be color coded and velocity sensitive, in addition to MIDI capabilities [4] (e.g., the Novation Launchpad Pro (2015), shown in Fig. 1). The main problem with these controllers is that percussionists are required to develop new playing skills instead of using musical skills that the percussionist has developed through their career. Fig.1: Novation Launchpad Pro

There are several models that allow classification of percussive gestures [1, 6, 7] and a few DMI’s have already been designed based on their principles. The HandSolo [4], published in 2016 is a timbre-based hand drum controller that allows the use of natural hand-drumming strokes based on a timbre recognition classifier algorithm fully programmed in PureData. In HandSolo, the strokes decided upon were based on those with the widest difference of frequency content that would be easiest to differentiate algorithmically [4]. One can approach the study of gestures in a musical context by either analyzing the possible functions of a gesture during performance or by analyzing the physical properties of the gestures taking place [2]. The HandSolo’s gesture selection corresponds to the first approach exposed previously. In this model, the stroke selecting decision is based on considering the performer’s actual gestures to enhance percussionists’ natural drumming skills to perform with computer musical interfaces.

2. METHODS

A prototype based on a piezoelectric sensor that captures the vibration of the hand’s impact on a solid surface (e.g., table or desk) and a USB audio interface was explored in this work. Three different hand strokes from the djembe-style drum were used: bass (low-pitched), tone (medium-pitched) and slap (high-pitched) [8] as shown in Fig.2. An audio database of 900 strokes (300 for each type), executed by the same musician, was randomly divided on a 70/30 proportion for training and testing, respectively.

Mel Frequency Cepstral Coefficients (MFCC) were obtained from the audio signals. MFCCs are derived from the cepstral representation of an audio clip. The cepstrum features are mainly used in pitch detection. MFCCs represent the short-time power spectrum of an audio clip based on the discrete cosine transform of the log power spectrum on a nonlinear mel scale. In MFCCs the frequency bands are equally spaced on mel-scale, which mimics the human auditory system very closely [9].

The main algorithm was programmed (offline) in MATLAB in three main parts: audio signal onset detection, audio signal feature extraction and percussive stroke classification through Machine Learning (ML) algorithms. The audio signals were adapted separately for each stage. In the first stage, the signals were filtered with a band pass filter at cutoff frequencies of 200 Hz and 8 kHz to improve the algorithm’s detection capabilities. MATLAB’s ‘findpeaks’ function was utilized to retrieve the peaks’ position, as represented in Fig.3. In the Feature Extraction stage, the selected and unfiltered signals were broken down into audio files of 0,048 (s) each, containing the onset-offset information of each stroke. Each stroke was labeled with an ordinal number corresponding to the stroke’s class. The MFCC’s audio features were computed from each stroke, together with the difference of the coefficients from one frame of data to another. The coefficients of each stroke were arrayed into a matrix An×m dimensions, being n the number of coefficients (features) and m the number of strokes. This process was made separately for both the training and the testing data, the only difference is that the class label row was added to the first one. A feature normalization stage was implemented to both feature matrices by dividing the mean of each row of features by its standard deviation to avoid data snooping [10]. MATLAB’s ‘Classification Learner’ app was used to train three supervised learning classification algorithms; Support Vector Machine (SVM) Medium Gaussian, Subspace k-Nearest Neighbor (kNN) and Linear Discriminant. Each model was trained using the testing set and performing a multiclass, one versus all, classification task. A 5-fold cross validation procedure was applied on every algorithm training process.

Fig.3: Example of the Onset Detection process from filtered piezoelectric sensor

Three performance metrics were extracted from the model’s classification test: the algorithm’s success rate, true positive rate (TPR) or sensibility and true negative rate (TNR) or specificity.

3. RESULTS

structure known as a confusion matrix [11]. The results (see Fig. 4) of each multiclass classification are represented by one confusion matrix: predicted classes are in the columns and true classes are in the rows, true positives are highlighted in blue. Class ‘0’ corresponds to slap stroke, class ‘1’ to bass stroke and class ‘2’ to tone stroke. The matrices shown in Fig.4 were generated by MATLAB’s Classification Learner app as the models were trained.

Fig.4: (A) SVM Medium Gaussian’s confusion matrix. (B) kNN Subspace’s confusion matrix. (C) Linear Discriminant’s confusion matrix.

Performance metrics were calculated based on the confusion matrices information. The algorithm’s success rate metrics are the average of the three percentages of correctly classified strokes of each class against its class total number of strokes. Sensibility or true positive rate (TPR) measures the fraction of positive examples that are correctly labeled. Specificity or true negative rate (TNR) measures the fraction of negative examples that are correctly labeled [11].

TABLE I. ALGORITHM’S PERFORMANCE METRICS

Performance Metric

Algorithm SVM kNN L D

Success Rate (%) 97.62 ± 1,74 96.51 ± 1,79 97.31 ± 1,74 TPR = Sensibility (%) 97.61 ± 1,74 95.45 ± 0,16 97.32 ± 1,74 TNR = Specificity (%) 98.81 ± 1,10 98.25 ± 1,54 98.66 ± 1,57

High success rates were achieved for all three models. The classifiers were able to classify the timbres for the given task with an accuracy greater than 95%. SVM Medium Gaussian classifier was the most consistent classifier in this case, scoring the best results in all performance metrics.

4. DISCUSSION

Overall results of this work, using different ML models, are encouraging. However, some limitations and future work needs to be highlighted. It should be noted that this is a case study; only one surface (small wooden table) was implemented to generate the dataset, the strokes were selected arbitrarily and executed by just one musician. Testing different strokes and surfaces may be beneficial to implement a more generalized model. The next steps of this model are projected to be the incorporation of additional hand drumming strokes, the inclusion of the stroke’s velocity sensitivity, the implementation on an embedded platform (e.g., Bela or Teensy) and the mapping of the system’s output to a communication protocol such as MIDI or OSC to allow the communication with most computer musical interfaces and softwares.

5. CONCLUSION

The classification performance of hand strokes for DMI’s applications presented in this work, shows potential to improve the interaction between percussionists and computer musical interfaces, using common and cost effective audio equipment.

Applying signal processing and ML models, seems to improve model prediction according to the classification performance indicators. Further research with additional gestures and the implementation on embedded platforms (e.g., Bela/Teensy) is under development. [1] A. Tindale, A. Kapur, G. Tzanetakis and I. Fujinaga, (2004) “Retrieval of percussion gestures using timbre classification techniques,” Proceedings of the Fifth International Conference on Music Information Retrieval (ISMIR), Barcelona, Spain. [2] M.M. Wanderley, P.Depalle, (2004) “Gestural control of sound synthesis,” Proceedings of the Institute of Electrical and Electronics Engineers 2004 (IEEE). [3] Wavedrum Global Edition [online]. [Accessed 13 June 2020]. Available from: https://www.korg.com/cl/products/drums/wavedrum_global_edition/ [4] K. Jathal, T. Park, (2016) “The HandSolo: A Hand Drum Controller for Natural Rhythm Entry and Production,” Proceedings of the 16th international conference on New Interfaces for Musical Expression (NIME), Brisbane, Australia. [5] Aerodrums [online]. [Accessed 13 June 2020]. Available from: https://aerodrums.com/home/ [6] M. Krzyzaniak, G. Paine, (2015) “Realtime classification of hand-drum strokes,” Proceedings of the 15th international conference on New Interfaces for Musical Expression (NIME), Baton Rouge, USA. [7] L. Turchet, A. Mcpherson, M. Barthet, (2018) “Co-design of a Smart Cajón,” Journal of the Audio Engineering Society, vol. 66, no.4, pp. 220-230. [8] The Djembe, Drum Africa [online]. [Accessed 26 October 2020]. Available from: http://www.drumafrica.co.uk/articles/the-djembe/ [9] G. Sharma, K. Umapathy, S. Krishnan, (2020) “Trends in audio signal feature extraction methods,” Applied Acoustics, vol. 158. [10] Y. Abu-Mostafa, .M Magdon-Ismail,H.T. Lin, (2012) “Learning from data,” AMLBook, New York, USA. [11] J. Davis, M. Goadrich, (2006) “The relationship between PrecisionRecall and ROC curves,” Proceedings of the 23rd international conference on Machine learning, Pittsburgh, USA.

978-956-390-180-1