Southern right whale picture by Brian Skerry
The continued increase in the storage capacity and battery life of acoustic recorders is allowing industry and academia to collect passive acoustic monitoring (PAM) data over ever larger spatial and temporal scales. With this glut of new data comes a problem – what do we do with all the hugely complex data that is collected by acoustic recorders? How do we accurately extract the already variable vocalisations of our target species within the dynamic and highly variable soundscapes? In the past, manual or semi-manual validation of data might have been an option (humans are still the best at pattern recognition) but with such large quantities of data, this is no longer feasible. We need accurate automated algorithms, but over the past decades developing such algorithms – that are applicable to a wide range of environments, species and soundscape contexts – has proven to be extremely difficult… until recently.
Supervised machine learning is when you feed an algorithm training data (i.e. data that is labelled with the correct answer to whatever we are trying to solve) and it automatically constructs an appropriate classifier model that can then be used to analyse new, unlabelled data. Deep learning is a subset of machine learning which uses artificial neural networks to train classifiers. Deep learning is a popular buzzword today, and it has real benefits for acoustics compared to previous machine learning methods:
- Deep learning is scalable, i.e. it increases in accuracy with the more training data you feed it. You can also “top up” classifiers with new training data if required.
- Another important aspect of deep learning algorithms is that they automatically extract features from data. A previous machine algorithm might have required a list of features, such as peak frequency, length, amplitude, etc., but a deep learning algorithm can ingest raw spectrogram images or even waveform and work out the best features to extract for classification itself (in reality though this is a little more complicated – see below).
- The final, and perhaps most important, aspect is that deep learning methods are now used everywhere – in advertising, in your photo apps, in the military, by NASA, etc., which has meant that a massive ecosystem of code and services have been developed around using deep learning technologies. This means we have free access to state-of-the-art deep learning tech (from Microsoft/Google, etc.) which probably has had more R&D spending in the last few years than the sum total of all funding in bioacoustics research, ever.
In the context of practical application to PAM, deep learning allows us to train highly accurate classifiers which can cope with a large variation in temporal and spectral properties of signals within complex soundscapes. In fact, the approach is so good at acoustic analysis that it’s solving age-old analysis problems that acoustic researchers have been tackling (with little progress) for decades; one example is automatically detecting right whale calls (Shui et al. 2020).
However, whilst deep learning has the potential to provide extremely powerful algorithms there are caveats. Running deep learning classifiers can be very processor intensive compared to more simplistic detection and classification, they require large quantities of training data, and, like all automated algorithms, they can still be vulnerable to unexpected inconsistencies in data for which they have not been trained for. In addition, despite the numerous research papers, huge code ecosystem, and hype around this undoubtedly effective approach to automated acoustic analysis, training and then running deep learning classifiers is still not straightforward or accessible, usually requiring coding in Python. These technical barriers restrict the uptake of deep learning to specialised research groups and thus it’s impact so far in marine acoustics has been limited.
PAMGuard and Deep Learning
PAMGuard (an opensource software for passive acoustics) has always been about making the latest signal processing algorithms, for real time and post processing, available and accessible to researchers. The modular structure of PAMGuard means that any new module should be able to integrate with existing acoustic workflows, i.e. a new module capable of running deep learning models will be able to take advantage of PAMGuard’s data management system, displays and real time functionality and could provide a powerful and accessible tool for running deep learning models.
A year ago, as part of a postdoc at Aarhus University, Denmark, I started looking into whether this was feasible and quickly decided it was not, mainly due to one massive hurdle: every deep learning model is different. Specifically, they are coded using different libraries, accept different types of input data and have different output formats. Creating the coding architecture around that was not something achievable for one (or many postdocs)… that was until Amazon stepped in.
Whether or not you like giant, non-tax paying, and often morally dubious tech giants, you have to hand them, they write some fantastic tools for development. One such tool is Amazon’s deep java library – long story short, it provided a library that allowed any deep learning model to be loaded using just a few lines of Java code. PAMGuard is written in Java and so Amazon’s deep java library was the perfect framework for creating a new deep learning module.
So – the tools now existed to load and run any deep learning model easily but there was a still an issue – how do we figure out what the acoustic input data is? There’s perhaps a misconception that deep leaning algorithms just work by accepting raw wavforms or spectrograms – in reality, however, the accuracy of deep learning models is greatly improved by applying a set of transforms to the raw data; this might be cropping a spectrogram, normalizing, removing noise, etc. For PAMGuard’s deep learning module to work effectively, it had to replicate these steps in Java before passing data to a trained deep learning model. But every model is different (and uses different transforms) so how to deal with this without requiring hard coding in Java for every new model? Fortunately for me, Amazon wasn’t the only one who had come up with some useful deep learning tools. Two Python coding libraries (AnimalSpot and Ketos) had recently been released, each with a comprehensive framework to train deep learning models. Both libraries provided a relatively easy-to-use coding framework to allow researchers to clean up their spectrograms and then train deep learning models. Crucially though, for PAMGuard, the models produced by these frameworks were all the same format and contained metadata for the type of input required. This may all sound a bit technical but the upshot was that a Java library could now be created which could guarantee compatibly with any model trained in AnimalSpot and Ketos without any additional coding required!
The Raw Deep Learning Module
The first stage in developing the deep learning module was to create a library of spectrogram transforms which replicated those in both AnimalSpot and Ketos and also add some other generally used transforms. I won’t go into detail but the majority of the work is carried out a new java library JPAM created for the project but separate to PAMGuard so that it can be used easily elsewhere.
Next was to define some feature limits. What will this module do and not do? Most acoustic deep learning approaches so far have considered segmenting acoustic data into discrete chunks, applying the relevant data transforms to each chunk and then passing chunks to a deep learning model for prediction values (i.e. the probability a chunk of sound contains a target vocalisation). There are other approaches, but it was decided that this classifier would only accept raw acoustic data and use the segmentation approach mentioned above.
Next stage was designing the module. JPAM handled most of the deep learning and transforms heavy lifting so creating the module was mainly just plumbing in the rights bits and pieces into PAMGuard and creating a UI. The main UI was built with JavaFX and fairly straightforward, just a few controls to let users select the segment size and load a model. The models from AnimalSpot and Ketos both automatically load settings so there’s nothing much more to do for a user than select the framework they are using, browse to the deep learning model file, define a minimum prediction threshold and then run through their data in PAMGuard. When running on raw acoustic data, the module continually segments the data and the raw waveforms from any segments which pass above threshold are saved to PAMGuard files. Users can then view and export the results in PAMGuard viewer mode.
After a lot of coding, questions on GitHub and, of course, coffee, a prototype module was created and seemed to be working pretty well. However, it soon became apparent that there were two glaring omissions from the module… it could only run AnimalSpot and Ketos models, and it was SLOW.
A generic framework
It was never going to be possible to allow users to run absolutely any model, however, many of the transforms applied to segments are fairly generic and common, and so a user should be able to replicate many of the transforms sets required for a model using the existing transforms available. So, a third framework option was created – the so called “Generic Model”. Using this, users can import any model and then manually define the input transforms using an additional UI. A preview of the transforms and their final input shape is available, potentially allowing an experienced user to import a non-AnimalSpot/Ketos model and get it to work in PAMGuard. To make life easier for anyone else using these models, the transforms cam be saved to a file and then loaded up in another instance of PAMGuard. We tested this approach on a right whale classifier (Shui et al. 2020) – also see the tutorial.
Using “pre-detections” to increase processing speed
Deep learning is useful for all species but it is slow! On a typical Intel consumer chip (without a graphics card), each prediction for a segment takes around 100ms (around 5 -10ms if you have a NVIDIA graphics card or Apple M1 chip). As long as the segment hop is greater than 100ms the deep learning module will run in real time. That’s fine for a 2 second segment (e.g. right whales) but what about higher frequency species with short calls, like bats and toothed whales (both of which I study)? -there are many animal calls that are shorter than 100ms and typically we want to run at x10 the speed of real time for analysis of acoustic recordings and stable real time operation. Thankfully the answer to this problem (without buying a lot of expensive hardware) was fairly straightforward – allow the data input into the module to be from “dumb” detectors as well as raw sound data. For example, the click detector in PAMGuard detects all transients in a defined filter band. Typically the detected transients will be less than 1% of the total raw sound data. If these are input into the deep learning module they can be segmented and predictions applied in the same way as raw data. Running a “dumb” detector at a high false positive rate means most calls/clicks are detected and there is still a huge data reduction. You therefore get the advantage of more accurate deep learning classifiers without the large processing time overhead. An example of this is in the bat tutorial.
Improvements to PAMGuard
Of course, making a whole new module and making it work well opens a whole pandora’s box of potential updates and improvements to PAMGuard. Here’s are a list of features introduced with the new deep learning module.
> New display engine for spectrograms and waveforms in the Time Base Display
We used bats as an example case to build the module. Often, bat researchers use a spectrogram and waveform to look at calls. The Time Base Display in PAMGuard could show spectrograms from raw data but not spectrograms or waveforms of detections. Seems easy to implement correctly, but plotting the spectrogram and waveforms of potentially thousands of detections at different temporal scales is quite difficult. For example, imagine plotting a few thousand clicks detections, each with around 1000 samples… that’s a lot of points to plot! To make the display seamlessly transition shows waveforms and spectrogram zoomed out and zoomed right in so you can see the individual samples required a whole new plotting library.
The Time Base Display underwent some major changes to allow users to quickly scroll through waveforms and spectrograms of detected calls and deep learning detections. Users can now seamlessly scale from showing a few milliseconds to hours of data.
> Symbol Manager
PAMGuard can display data in a multitude of ways, for example a stem or scatter plot, spectrogram, waveforms, etc. Data are also plotted across different displays such as the map display. There has been a unified symbol manager in PAMGuard for some years – this allows users, for example, to colour symbols in the same way across the display – so for example a detected dolphin whistle would be the same colour plotted on a spectrogram as it would plotted on the map.
The new symbol manager means that users can plot data based on a number of properties (e.g. peak frequency, event, classification…, etc.) – in this case the detected transients are plotted with an Amplitude (dB) axis as scatter points and colour coding by deep learning species probability.
> Air Mode
PAMGuard was designed for the marine environment but it can be just as useful in terrestrial studies. It’s super annoying when dB’s are referenced to 1uPa instead of 20uPa and microphones are called hydrophones, and depth is height etc. “Air Mode” changes the UI to reflect terrestrial acoustics and fixes all these things (yes, I did go through all of PAMGuard and change “hydrophone” to getRecieverName())
> MATLAB tools for extracting deep learning data
The deep learning module creates it’s own detections which are saved in PAMGuard binary files. The MATLAB library was updated so that these can be loaded – below is some example code plotting right whale detections using the MATLAB library
The next step is for other folks to use test this module! There will inevitably be bugs so if you find one please do send a bug report here.
- This module gives folk the ability to run any deep learning model in PAMGuard – but it’s only really useful if there are actually models available for people to use. The tutorials demonstrate an example of a right whale and Danish bat species models but Ketos and AnimalSpot can be used to train pretty much any species model. It would be great if folk made these available and open source so others can use them! A deep learning model for ADDs and military sonar might be fun… anyone?
- More frameworks – have a great open source framework for training acoustic models? Get in touch if you would like it added to PAMGuard.
- Once this module has been tested and (if it) proves useful then the next logical step to make deep learning models more accessible, will be to allow users to train models within PAMGuard itself. That way users could, for example, mark out a number of detected clicks, categorize them to species and then train a deep learning model. That, however, is a big interdisciplinary job and would require some serious funding to get right. For now, AnimalSpot and Ketos are great and I encourage folk to check them out.
Availability and tutorials
The deep learning module will be available in the 2.01.06 release of PAMGuard . If you can’t wait until then, here is a link to a new installer and jar file. First use the installer to install PAMGuard and then copy the jar file into the PAMGuard programs folder (it should overwrite the current jar file). Make sure any previous versions of PAMGaurd are uninstalled.
Deep learning is a super powerful tool and there are some great acoustic focused libraries to train acoustic models. PAMGuard now provides a module to run these models in real time or post processing, allowing anyone to deploy deep learning models on their acoustic data. So let’s get training and testing some more deep learning models (preferably using Ketos or AnimalSpot) and remember to make them open source and available so the whole bioacoustics and conservation community benefits!
Southern right whale picture by Brian Skerry
Thanks to Marie Roche and her group for sharing their right whale model.