Psychoacoustics modelling and the recognition of silence in recorded speech

Wilson, Derek

Please use this identifier to cite or link to this item: http://theses.ncl.ac.uk/jspui/handle/10443/5733

Title:	Psychoacoustics modelling and the recognition of silence in recorded speech
Authors:	Wilson, Derek
Issue Date:	2018
Publisher:	Newcastle University
Abstract:	Over many years, a variety of different computer models purposed to encapsulate the essential differences between silence and speech have been investigated; but that notwithstanding, research into a different audio model may provide fresh insight. So, inspired by the unsurpassed human capability to differentiate between silence and speech under virtually any conditions, a dynamic psychoacoustics model, with a temporal resolution of an order of magnitude greater than that of the typical Mel Frequency Cepstral Coefficients model, and which implemented simultaneous masking around the most powerful harmonic in each of 24 Bark frequency bands, was evaluated within a two stage binary speech/silence non-linear classification system. The first classification stage (deterministic) was purposed to provide training data for the second stage (heuristic) — which was implemented using a Deep Neural Network (DNN). It is authoritatively asserted in the Literature — in a context of speech processing and DNNs — that performance improvements experienced with a ‘standard’ speech corpus do not always generalise. Accordingly, six new test-cases were recorded; and as this corpus implicitly included frequency normalisation it was feasible to assess whether the solution generalised, and it was found that all of the test-cases could be successfully processed by any of the six trained DNNs. In other tests, the performance of the two stage silence/speech classifier was found to exceed that of the silence/speech classifiers discussed in the Literature Review; but it was interesting to note that the Split Sample Technique for neural net training did not always identify the optimal trained network — and to correct this, an additional step in the training process was devised and tested. Overall, the results conclusively demonstrate that the combination of the dynamic psychoacoustics model with the two stage binary speech/silence non-linear classification system provides a viable alternative to existing methods of detecting silence in speech.
Description:	PhD Thesis
URI:	http://hdl.handle.net/10443/5733
Appears in Collections:	School of Computing

Files in This Item:

File	Description	Size	Format
Wilson D 2018.pdf		4.75 MB	Adobe PDF	View/Open
dspacelicence.pdf		43.82 kB	Adobe PDF	View/Open

Show full item record