In this project, we designed a system that takes a spoken sentence and segments it into words ๐ฃ๏ธโ๏ธ.
The system detects when each word starts and ends, without knowing in advance how many words there are โ only assuming a small silence gap between words.
Additionally, we created a program to play each detected word separately.
Finally, the system estimates the average pitch (fundamental frequency) of the speaker.
The following classifiers were trained and evaluated:
- Least Squares (LSQ)
- Support Vector Machine (SVM)
- Multilayer Perceptron (MLP) (Three-layer neural network)
- Recurrent Neural Network (RNN)
- Programming Language: Python 3.12.4 ๐
- โ No CNNs, no web services, no transfer learning allowed.
- Deliverables: PDF documentation, source code (source2023.zip), auxiliary files (auxiliary2023.zip).
The system does binary classification:
โ
Speech (foreground) vs โ Non-speech (background).
Main Steps:
- Extract Mel spectrograms ๐ถ from sliding windows of the audio.
- Classify each window as speech or non-speech.
- Apply a median filter to smooth out small errors ๐งน.
- Find the boundaries between words based on the cleaned-up predictions ๐งฉ.
- Simple models that output a continuous value.
- Then we threshold them to get binary speech/non-speech predictions.
- Two hidden layers: 128 and 64 neurons with ReLU activation โก.
- Output layer: Single neuron with Sigmoid activation ๐ง .
- Trained with Binary Cross-Entropy Loss.
- Built using TensorFlow's SimpleRNN layers ๐.
- Processes sequences of frames to capture temporal dynamics.
- Outputs one probability per time frame.
Foreground (Speech) ๐ฃ๏ธ:
- Common Voice Corpus Delta Segment 18.0 (as of 6/19/2024) ๐.
Background (Noise / Non-speech) ๐:
- ESC-50 dataset (Harvard Dataverse) ๐ง.
(Selected only ~150 folders to keep things manageable.)
- ๐งโ๐ซ MLP: trained with Dropout layers to prevent overfitting.
- ๐ก๏ธ SVM: trained with LinearSVC from scikit-learn.
- ๐ข LSQ: trained using simple matrix operations.
- ๐ RNN: trained using SimpleRNN layers to model sequence data.
- ๐ฏ Tested on three WAV files: 5 seconds, 10 seconds, 20 seconds.
- ๐ Each test file has:
.txtfile with ground-truth words..jsonfile with ground-truth timestamps.
Testing Process:
- Load the test WAV file.
- Extract Mel spectrogram features.
- Predict using all models.
- Post-process with median filtering.
- Detect speech segments.
- Compare predictions to ground-truth annotations ๐.
| ๐ File | ๐ Description |
|---|---|
train.py |
Training script for all models |
test.py |
Testing and evaluation script |
os๐๏ธ: File operationsnumpyโ: Math operationsjson๐: Handling annotation fileslibrosa๐ถ: Audio processingjoblib๐พ: Model saving/loadingscikit-learn๐: ML algorithms (SVM, preprocessing)tensorflow.keras๐ค: Neural networks (MLP, RNN)
load_train_audio_clips(limit=None): Load training audio.extract_features(audio_clip): Get Mel spectrograms.pad_features(features, expected_frames): Pad/truncate features.- Train and save models (MLP, SVM, LSQ, RNN).
- Load audio and ground-truth.
- Predict frame-by-frame speech probability.
- Smooth with median filter.
- Detect segments and compare results.
This project built a speech segmentation system that works without prior word count knowledge.
It uses traditional machine learning and simple RNNs, without heavy neural network models like CNNs, or any external APIs ๐๐ซ.