3 Implementation

3 Implementation#

After completing Phase 2, where we finalized our preprocessing pipeline—including standardizing input durations (6 seconds vs. 20 seconds) and establishing strategies for class balancing—our focus shifted to Phase 3, which involved developing the model architectures. In earlier tests, we found that the Temporal Convolutional Network (TCN) outperformed LSTM-based models for sequential data, showing better gradient stability and faster computation. Building on this, our main task was to implement the planned comparison between the standard TCN and the spiking TCN-SNN using PyTorch.

We began by building the standard TCN as a baseline. To respect the sequential nature of audio, we implemented causal convolutions that allow the model to use only past and current information, simulating real-time lung sound analysis. We also applied dilated convolutions so the network could capture both short-duration crackles and long-duration wheezes without requiring computationally heavy recurrent structures. The network consists of three convolutional blocks with channel sizes of 32, 64, and 128, which progressively extract higher-level features from MFCCs or Mel-Spectrograms.

Next, we developed the spiking TCN (TCN-SNN). Unlike the standard TCN, this model communicates via discrete spikes rather than continuous outputs. To ensure a fair comparison, we kept the TCN backbone identical in both models so that any performance differences would be due to the spiking mechanism itself. We used Leaky Integrate-and-Fire (LIF) neurons with a short simulation window of six time steps to reduce latency. This design allows the model to perform efficiently while mimicking key aspects of biological spiking behavior. Surrogate gradients were applied to enable standard backpropagation for training, bridging the gap between spiking behavior and gradient-based optimization.

With both architectures defined, we standardized the training process using the Adam optimizer with a learning rate of 1e-3 and Cross-Entropy Loss. We also included evaluation metrics such as accuracy, confusion matrices, and ROC curves to track not only whether the models learned, but how they learned. This is particularly important for testing whether the TCN-SNN handles background noise more effectively, potentially resulting in fewer false positives.

We then performed initial experiments to identify the best input configuration. When using Mel-Spectrograms, both models struggled, showing accuracy close to random guesses even after extensive tuning. The high dimensionality of Mel-Spectrograms likely overwhelmed the models with background noise. In contrast, MFCCs produced a dramatic improvement, achieving a peak accuracy of 93%, as they compress spectral information and reduce interference from non-essential background sounds. We also compared input durations and found that 20-second segments performed better than 6-second slices. Longer segments capture multiple breathing cycles, giving the TCN sufficient context to model long-term dependencies, while shorter segments often lacked enough information for accurate classification.

Based on these results, we chose to proceed with MFCC features and 20-second inputs. However, we observed some overfitting in later training epochs, so the next phase will focus on systematic hyperparameter tuning, including adjusting the number of layers, kernel sizes, and regularization settings. These adjustments will be applied to both the standard TCN and the TCN-SNN to ensure a fair and rigorous comparison between the two architectures.