Skip to content

A Python-based system for automatic word segmentation in speech using ML models like SVM, MLP, and RNN.

License

Notifications You must be signed in to change notification settings

AimiliosKourpas/sound-signal-processing

Repository files navigation


๐ŸŽต Speech and Sound Signal Processing - Word Segmentation ๐ŸŽต

University: University of Piraeus

Department: Informatics

Course: Speech and Sound Signal Processing


๐ŸŽฏ Project Overview

In this project, we designed a system that takes a spoken sentence and segments it into words ๐Ÿ—ฃ๏ธโœ‚๏ธ.
The system detects when each word starts and ends, without knowing in advance how many words there are โ€” only assuming a small silence gap between words.

Additionally, we created a program to play each detected word separately.
Finally, the system estimates the average pitch (fundamental frequency) of the speaker.


โš™๏ธ Classifiers Implemented

The following classifiers were trained and evaluated:

  • Least Squares (LSQ)
  • Support Vector Machine (SVM)
  • Multilayer Perceptron (MLP) (Three-layer neural network)
  • Recurrent Neural Network (RNN)

โšก Important Rules

  • Programming Language: Python 3.12.4 ๐Ÿ
  • โ— No CNNs, no web services, no transfer learning allowed.
  • Deliverables: PDF documentation, source code (source2023.zip), auxiliary files (auxiliary2023.zip).

๐Ÿ” How It Works

The system does binary classification:
โœ… Speech (foreground) vs โŒ Non-speech (background).

Main Steps:

  1. Extract Mel spectrograms ๐ŸŽถ from sliding windows of the audio.
  2. Classify each window as speech or non-speech.
  3. Apply a median filter to smooth out small errors ๐Ÿงน.
  4. Find the boundaries between words based on the cleaned-up predictions ๐Ÿงฉ.

๐Ÿง  Classifier Details

๐Ÿ–Š๏ธ Least Squares (LSQ) & Support Vector Machine (SVM)

  • Simple models that output a continuous value.
  • Then we threshold them to get binary speech/non-speech predictions.

๐Ÿงฎ Multilayer Perceptron (MLP)

  • Two hidden layers: 128 and 64 neurons with ReLU activation โšก.
  • Output layer: Single neuron with Sigmoid activation ๐Ÿง .
  • Trained with Binary Cross-Entropy Loss.

๐Ÿ” Recurrent Neural Network (RNN)

  • Built using TensorFlow's SimpleRNN layers ๐Ÿ”.
  • Processes sequences of frames to capture temporal dynamics.
  • Outputs one probability per time frame.

๐Ÿ“‚ Datasets

Foreground (Speech) ๐Ÿ—ฃ๏ธ:

  • Common Voice Corpus Delta Segment 18.0 (as of 6/19/2024) ๐Ÿ“š.

Background (Noise / Non-speech) ๐Ÿ”‡:

  • ESC-50 dataset (Harvard Dataverse) ๐ŸŽง.

(Selected only ~150 folders to keep things manageable.)


๐Ÿ› ๏ธ Training

  • ๐Ÿง‘โ€๐Ÿซ MLP: trained with Dropout layers to prevent overfitting.
  • ๐Ÿ›ก๏ธ SVM: trained with LinearSVC from scikit-learn.
  • ๐Ÿ”ข LSQ: trained using simple matrix operations.
  • ๐Ÿ” RNN: trained using SimpleRNN layers to model sequence data.

๐Ÿงช Testing and Evaluation

  • ๐ŸŽฏ Tested on three WAV files: 5 seconds, 10 seconds, 20 seconds.
  • ๐Ÿ“œ Each test file has:
    • .txt file with ground-truth words.
    • .json file with ground-truth timestamps.

Testing Process:

  1. Load the test WAV file.
  2. Extract Mel spectrogram features.
  3. Predict using all models.
  4. Post-process with median filtering.
  5. Detect speech segments.
  6. Compare predictions to ground-truth annotations ๐Ÿ“.

๐Ÿ“ Project Structure

๐Ÿ“„ File ๐Ÿ“œ Description
train.py Training script for all models
test.py Testing and evaluation script

๐Ÿ›’ Libraries Used

  • os ๐Ÿ—‚๏ธ: File operations
  • numpy โž—: Math operations
  • json ๐Ÿ“„: Handling annotation files
  • librosa ๐ŸŽถ: Audio processing
  • joblib ๐Ÿ’พ: Model saving/loading
  • scikit-learn ๐Ÿ“š: ML algorithms (SVM, preprocessing)
  • tensorflow.keras ๐Ÿค–: Neural networks (MLP, RNN)

๐Ÿ”ฅ Key Functions

In train.py

  • load_train_audio_clips(limit=None): Load training audio.
  • extract_features(audio_clip): Get Mel spectrograms.
  • pad_features(features, expected_frames): Pad/truncate features.
  • Train and save models (MLP, SVM, LSQ, RNN).

In test.py

  • Load audio and ground-truth.
  • Predict frame-by-frame speech probability.
  • Smooth with median filter.
  • Detect segments and compare results.

โœจ Final Thoughts

This project built a speech segmentation system that works without prior word count knowledge.
It uses traditional machine learning and simple RNNs, without heavy neural network models like CNNs, or any external APIs ๐ŸŒ๐Ÿšซ.


About

A Python-based system for automatic word segmentation in speech using ML models like SVM, MLP, and RNN.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages