The following plot (a) shows the classification results for all experiments based on the (best performing) linear SVM classifier. An alternative ordernig can be seen in (b) where results of the same task are grouped together.
(a)
(b)
Overall average results for each embedding using linear SVM can be seen in the following figure. One can observe that OpenL3 embeddings outperform all other embeddings, and obtain comparable results to the baseline model:
Table III shows results comparing early fusion with late fusion approaches. Apart from task 6, late fusion performs best in all experiments. Table IV presents a comparison between different OpenL3 embeddings. Environmental embeddings perform constantly worse compared to the ones trained on music. Table V illustrates the results with OpenL3 embeddings with 512 values and larger embeddings with of 6144 values. Apart from Task 1, all tasks show better results with larger embeddings. However, larger embeddings are more computationally expensive. | |
In this section, the confusion matrices with mean file-wise accuracy for all tasks are shown.
Task 1 - Ensemble Size Classification in Music | Task 2 - Musical Instrument Family Recognition |
Task 3 - Speech Music Classification | Task 4 - Classification of Operational States in Electric Engines |
Task 5 - Metal Surface Classification | Task 6 - Plastic Material Classification |
Task 6 - Plastic Material Classification (Early Fusion) | Task 6 - Plastic Material Classification (Late Fusion) |
Here the baseline model of task 2 (Musical Instrument Family Recognition) can be seen.
The first plot shows the actual model used for this publication. In contrast to the original model (below), each convolutional block has only half of the layers (two of the original four convolutional blocks have been removed). After each block, batch normalization is applied. Dropout has been moved to the end of the model to avoid overfitting. Finally, the Sigmoid activation has been replaced with a softmax activation in the output layer since the application is a single label task.
The plot below describes the original model from Han et. al Y. Han, J. Kim, and K. Lee, “Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music,”IEEE/ACMTransactions on Audio, Speech, and Language Processing, vol. PP, 2016.
The following image shows the baseline model used for the Task 1 and Task 3. It is a CNN with 4 convolutional layers with a 3x3 kernel and a fully connected (dense) layer at the end. Max Pooling and dropout is applied after each block and a softmax activation is used for the output layer. The number of filters doubles in each convolutional block and starts with 16.