1 Background of the Study#
1.1 Chronic Respiratory Diseases#
Chronic respiratory diseases (CRDs) continue to impose a substantial health burden, with the Global Burden of Disease Study 2021 estimating over 55 million incident cases, predominantly asthma and chronic obstructive pulmonary disease (COPD). This study highlights complex regional patterns in morbidity and mortality, noting 4.4 million deaths attributed to CRDs in 2021, although age-standardized mortality rates have declined over the past decades, reflecting some progress in disease management (Momtazmanesh et al., 2023a). Within this global context, Southeast Asia bears particular concern, where CRDs account for approximately 12% of all deaths. COPD and asthma contribute most significantly to premature mortality, with environmental and socioeconomic factors such as high exposure to outdoor and household air pollution from biomass fuel use posing significant risks in many countries (WHO Southeast Asia Region, 2019). In the Philippines specifically, lung disease remains a critical public health issue; a study in Nueva Ecija found that 20.8% of adults aged 40 and older have COPD, with significant associations to biomass fuel exposure and smoking. Financial impacts are also considerable, as evidenced by recent research highlighting substantial out-of-pocket expenses for Filipino patients hospitalized due to acute COPD exacerbations (Idolor et al., 2011; Ang & Fernandez, 2024).
Fundamentally, chronic respiratory diseases encompass a wide range of disorders that impair respiratory and pulmonary functions, significantly affecting an individual’s breathing and oxygen exchange. These diseases include chronic obstructive pulmonary disease (COPD), asthma, interstitial lung disease, lung infections such as pneumonia, and lung cancer. Common symptoms often include breathlessness, chronic cough (either productive or dry), wheezing, chest pain, and sometimes sputum production. Additional symptoms may involve fatigue, fever, and reduced exercise tolerance, which vary depending on the specific disease and its severity. The underlying causes are multifactorial, including environmental exposures like smoking and pollution, infections, and occupational hazards, all contributing to inflammation, airway obstruction, or lung tissue scarring (Singh, 2016; Mayo Clinic, nd; NIEHS, nd). Identification of abnormal lung sounds such as crackles and wheezes during auscultation plays a vital role in clinical evaluation. Crackles often indicate fluid or fibrosis in the lungs, while wheezes suggest airway narrowing or obstruction. These sounds help in early diagnosis, differentiation between diseases, and monitoring treatment response, making auscultation a cornerstone of respiratory assessment (Zimmerman, 2023). Together, these perspectives underscore the urgent need for targeted public health interventions addressing environmental exposures, healthcare access, and disease management to mitigate the ongoing burden of chronic respiratory diseases globally and regionally.
1.2 Conventional Diagnostic Approaches and Real-World Challenges#
In contemporary clinical practice, physicians still rely heavily on traditional physical examination techniques—such as auscultation, percussion, palpation, and vocal resonance—as primary and accessible tools for assessing lung function. Despite their widespread use, these methods have important limitations that can reduce diagnostic accuracy, even when performed by experienced clinicians.
A major limitation of auscultation is its low sensitivity. A meta-analysis of 34 studies involving adult patients with acute pulmonary conditions found that lung auscultation had a pooled sensitivity of only 37% and a specificity of 89% (Arts et al., 2020). This indicates that auscultation may fail to detect a substantial number of true respiratory pathologies, limiting its reliability as a stand-alone diagnostic tool.
Auscultation is also less accurate in mechanically ventilated patients. In a study of 200 post–cardiac surgery patients, two independent examiners (blinded to mechanical measurements) performed chest auscultation. They correctly identified decreased or absent breath sounds or crackles in only 34 % of cases for examiner A and 42 % for examiner B. Sensitivities were 25.1% and 36.4%, respectively, while specificities were moderately higher at 68.3% and 63.4% (Xavier, Melo‑Silva, Santos, & Amado, 2019). These findings demonstrate that auscultation may not reliably reflect underlying lung function in such patients.
Interobserver variability further limits reliability. In a longitudinal study of patients with fibrotic interstitial lung disease, nine respiratory physicians independently assessed crackles at baseline and 12 months. Agreement on the presence of crackles yielded a Fleiss’ κ of 0.57 (95% CI: 0.55–0.58), and agreement on changes in crackle intensity over time was lower (κ = 0.42, 95% CI: 0.41–0.43) (Sgalla et al., 2024). Although individual physicians were more consistent over time (intra-rater κ = 0.79–0.87), the moderate agreement between different physicians highlights persistent subjectivity in interpretation.
Terminology inconsistencies also contribute to diagnostic challenges. A survey of staff physicians, residents, and medical students found that only approximately 63% of staff physicians and 69% of residents correctly identified crackles, while many used incorrect terms. The study concluded that insufficient auscultation skill, rather than personal preference, was a major factor (Vasquez & Ruiz, 2020). Lack of standardized terminology can lead to miscommunication and misinterpretation among clinicians.
Other physical examination signs also have limitations. A review of patients presenting with dyspnea found that features such as asymmetric chest expansion, diminished breath sounds, egophony, bronchophony, and tactile fremitus may assist in diagnosing pneumonia or pleural effusion. However, for early-stage chronic obstructive pulmonary disease (COPD), no single physical sign demonstrated high accuracy (Shellenberger et. al, 2017). Many signs are particularly insensitive in early or mild disease.
Spirometry remains an essential tool in the traditional assessment of lung function. It provides objective measurements of airflow, including forced expiratory volume in one second (FEV₁), forced vital capacity (FVC), and the FEV₁/FVC ratio, which are critical for diagnosing and staging obstructive lung diseases such as asthma and COPD (singh et. al, 2025; agusti et. al, 2023). While spirometry provides reproducible and quantitative data that physical examination alone cannot deliver, its accuracy depends on proper technique and patient cooperation. Additionally, it may be difficult to perform in acute or critically ill patients and in resource-limited settings where equipment or trained personnel are unavailable.
Overall, traditional respiratory assessment methods face several limitations—such as the low sensitivity of auscultation, high subjectivity, variable clinician interpretation, and the reduced diagnostic value of physical signs in early or subtle disease. These constraints make it difficult to reliably detect faint or transient abnormalities, particularly early-stage crackles and wheezes that may signal evolving pulmonary pathology. As a result, there is increasing motivation to explore automated analysis systems capable of providing more objective, sensitive, and reproducible respiratory sound interpretation.
1.3 Existing Studies on Respiratory Sound Classification and Research Gaps#
Automated respiratory sound classification has progressed substantially through deep learning, enabling improved detection of abnormal lung sounds such as crackles, wheezes, and mixed adventitious sounds. A recent state-of-the-art contribution is the work of Kim, Kim, Leem, and colleagues (2025), who developed an enhanced respiratory sound classification system combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory networks (BiLSTMs). Their architecture processed mel-spectrogram inputs extracted from multichannel digital stethoscope recordings and utilized a classification strategy based on focal loss to address dataset imbalance. Evaluated on four respiratory sound categories—normal, crackles, wheezes, and rhonchi—their model achieved an accuracy of 85.7%, surpassing both medical trainees and subspecialty fellows. Their work highlights the importance of temporal modeling (via LSTM layers), spatial feature extraction (via CNN layers), and the benefits of multi-channel auscultation.
Datasets have played a critical role in advancing this field. The ICBHI 2017 Respiratory Sound Database remains the global benchmark, containing 6,898 respiratory cycles labeled as normal, crackle, wheeze, or both (ICBHI Challenge, 2017). Despite its widespread use, the dataset suffers from severe class imbalance, particularly in the mixed both category. The more recent dataset by Huang et al. (2023), collected via an intelligent digital stethoscope, mirrors this imbalance: 3,642 normal cycles, 1,864 crackles, 886 wheezes, and 506 mixed sounds. These distribution issues adversely affect model generalizability, especially in minority classes where sensitivity remains low even in high-performing models.
Existing deep-learning methodologies vary in feature representation and architectural design. A large body of work has centered on time–frequency representations, especially MFCCs and mel-spectrograms, which are then fed into CNNs or CRNNs. Rocha et al. (2020) showed that combining handcrafted features with CNN-based spectrogram learning improves classification reliability. Wang et al. (2024) systematically compared different input representations and found that mel-spectrograms consistently outperform raw audio in traditional CNN setups—although they inevitably lose micro-temporal details essential for detecting fine crackles (lasting only a few milliseconds).
Motivated by these limitations, several researchers have turned toward raw waveform–based models that process unaltered respiratory audio. Early studies (Abduh et al., 2018; Perna et al., 2018; Kochetov et al., 2018) demonstrated that 1D CNNs trained directly on waveforms can effectively detect crackles and wheezes without hand-designed features. These models capture detailed temporal signatures, including amplitude spikes and short transient events that spectrograms often smooth out. More recently, Temporal Convolutional Networks (TCNs) have gained traction. Fernando et al. (2021, 2022) showed that dilated causal convolutions can capture inhalation–exhalation dynamics, transient crackles, and sequential respiratory patterns with high temporal fidelity. Their work emphasized interpretability and robustness, demonstrating improved performance under varied breathing patterns.
Beyond medical-specific models, general-purpose lightweight raw-audio architectures such as LEAN (Choudhary et al., 2023) have introduced efficient waveform encoders that can be adapted to medical audio tasks. Although LEAN has not been directly applied to respiratory sounds, its dual-branch design (raw waveform + log-mel) provides valuable insights for building computationally efficient models.
Despite these advancements, several research gaps persist. First, many studies—including Kim et al. (2025)—focus on spectrogram-based features, leaving raw waveform models underexplored, especially in multi-class respiratory tasks. Second, most studies evaluate one model architecture, limiting fair comparison across architectures under unified training conditions. Third, although Kim et al. (2025) and others use data augmentation (e.g., noise injection, pitch shifting), explicit robustness testing across different recording conditions, devices, and auscultation sites remains limited. Fourth, severe class imbalance persists across datasets, and while techniques like focal loss or oversampling partially mitigate this, minority-class performance remains significantly lower than for normal or crackle classes. Finally, few studies systematically analyze class-wise error patterns, which are crucial for understanding real-world misclassification risks.
Given these limitations, the present study aims to systematically compare multiple architectures—including raw waveform models (RawNet-inspired, TCN variants) and spectrogram-based baselines—under a unified pipeline. This includes evaluating class-wise performance, implementing imbalance-handling strategies, and conducting controlled robustness evaluations. The study is conducted entirely in a notebook environment for feasibility within the project timeline.
1.4 Raw Waveform–Based Deep Learning Architectures for Respiratory Sound Analysis#
Automated respiratory sound analysis plays a critical role in the early detection of pulmonary abnormalities such as crackles and wheezes, which are indicative of conditions including pneumonia, chronic obstructive pulmonary disease, and heart failure. Traditional approaches to lung sound classification have predominantly relied on time–frequency representations, particularly mel‑spectrograms, as input to convolutional neural networks or recurrent architectures. Such representations effectively summarize the spectral content over time, enabling models to learn frequency‑based features efficiently. Spectrogram-based models have achieved strong performance in prior research and are widely accepted for sound classification tasks (Choudhary, Karthik, Lakshmi, & Kumar, 2023). However, these representations inherently involve transformations such as the short-time Fourier transform, which may lead to the loss of subtle temporal and phase information essential for accurately identifying brief or transient events such as fine crackles.
To address these limitations, the study focuses on raw audio waveform input, which preserves the complete temporal structure of lung sounds. Learning directly from raw audio enables the model to capture both short-duration crackles and longer-duration wheezes without relying on feature engineering or spectrogram approximations. This underexplored approach offers an opportunity to evaluate how much temporal fidelity and detail can be leveraged to improve the accuracy of abnormal lung sound detection.
Among deep learning architectures suitable for raw audio, three models were identified for experimentation:
RawNet – an end-to-end network originally developed for speaker verification, capable of learning embeddings directly from waveforms (Jung, Heo, Kim, Shim, & Yu, 2019). Its architecture includes one-dimensional convolutional layers, residual connections, and feature‑map scaling (MR‑RawNet) to extract temporal features at multiple resolutions (Jung, Kim, Shim, Kim, & Yu, 2020).
Temporal Convolutional Networks (TCNs) – employ dilated causal convolutions and residual connections to model long-range temporal dependencies efficiently. Unlike recurrent networks, TCNs can process sequences in parallel while preserving causality, which is critical when detecting sequential patterns in breathing cycles. TCNs have demonstrated strong performance in lung sound event detection, identifying inhalation, exhalation, crackles, and wheezes with high accuracy (Fernando, Sridharan, Denman, Ghaemmaghami, & Fookes, 2021; Fernando et al., 2022).
LEAN (Light and Efficient Audio Network) – represents a lightweight yet effective approach, incorporating a wave encoder for raw audio alongside a pre‑trained log‑mel branch fused through cross‑attention mechanisms (Choudhary et al., 2023).
The combination of these three architectures allows the study to explore complementary modeling strategies for raw audio: RawNet provides end-to-end embedding extraction with multi-resolution temporal modeling, TCN captures long-range dependencies and interpretable temporal patterns, and LEAN balances efficiency with temporal and spectral feature fusion. By experimenting with these models, the study can assess not only the accuracy of abnormal lung sound detection but also how different architectures preserve and exploit temporal structure in raw audio — an aspect that remains underexplored compared with spectrogram-based approaches.
Objectives of the Study#
To compare multiple deep-learning architectures for four-class respiratory sound classification, with emphasis on models that learn directly from raw lung-sound waveforms, while using time‑frequency representations as secondary benchmarks to identify the most effective model design.
To examine how hyperparameter-tuning strategies and imbalance‑handling techniques (such as class-balancing methods or augmentation approaches) influence the performance of the selected raw-audio model, with the goal of improving macro‑averaged F1-score and sensitivity for minority classes such as wheezes and both.
To evaluate the generalizability of the optimized model using a hold-out test set, assessing its class-wise performance, error tendencies, and robustness to determine its suitability for real-world or clinical application.
References#
Abduh, M., Moussavi, Z., & Heo, G. (2018). Automatic crackle detection using end-to-end deep learning. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2474–2477. https://doi.org/10.1109/EMBC.2018.8512801
Agustí, A., Celli, B. R., Criner, G. J., Halpin, C. F., Anzueto, A., Barnes, P., … Vogelmeier, C. F. (2023). Global Initiative for Chronic Obstructive Lung Disease 2023 Report: GOLD Executive Summary. American Journal of Respiratory and Critical Care Medicine, 207(7), 819–837. https://doi.org/10.1164/rccm.202301-0106pp
Ang, B. W., & Fernandez, L. (2024). A prospective study on direct out-of-pocket expenses of hospitalized patients with acute exacerbation of chronic obstructive pulmonary disease in a Philippine tertiary care center. BMC Pulmonary Medicine, 24(1), 184. https://doi.org/10.1186/s12890-024-03011-y
Arts, L., Lim, E. H. T., van de Ven, P. M., Heunks, L., & Tuinman, P. R. (2020). The diagnostic accuracy of lung auscultation in adult patients with acute pulmonary pathologies: a meta‑analysis. Scientific Reports, 10(1), 7347. https://doi.org/10.1038/s41598-020-64405-6
Xavier, G., Melo‑Silva, C. A., Santos, C. E. V. G., & Amado, V. M. (2019). Accuracy of chest auscultation in detecting abnormal respiratory mechanics in the immediate postoperative period after cardiac surgery. Jornal Brasileiro de Pneumologia, 45(5), e20180032. https://doi.org/10.1590/1806-3713/e20180032
Rocha, B. M., Filos, D., & Pereira, J. M. (2020). A respiratory sound classification system: Combining feature-based and deep-learning approaches. Sensors, 21(1), 57.
Wang, X., Li, Z., Zhang, L., & Zhou, Y. (2024). Performance evaluation of deep-learning models for lung sound classification under different input representations. EURASIP Journal on Advances in Signal Processing, 2024(1), 11.
Huang, Y., Li, L., Luo, Y., Xiong, Z., Li, H., Luo, J., & Wei, M. (2023). Establishment of a respiratory sound database using a digital intelligent stethoscope. Military Medical Research, 10(1), 27.
ICBHI Challenge. (2017). ICBHI 2017 Respiratory Sound Database.
Srivastava, A., Gupta, S., & Sharma, R. (2025). Deep-learning-based respiratory sound analysis for detection of chronic obstructive pulmonary disease. Journal of Clinical Medicine.
Tzeng, Y., Huang, Y., & Chen, L. (2025). Noise‑robust deep learning methods for respiratory sound analysis: Challenges and opportunities. Electronics, 14, 2794.
Yu, Z., Zhang, F., Liu, H., & Zhao, X. (2025). Advances and challenges in respiratory sound analysis using artificial intelligence. Electronics, 14(14), 2794.
Tsai, C.-F., Hsieh, C.-Y., & Hung, C.-H. (2023). Improving respiratory sound classification using Capsule Networks with MFCC representations. Microsystem Technologies.
Choudhary, A., Karthik, C. R., Lakshmi, P. S., & Kumar, S. (2023). LEAN: Light and Efficient Audio Classification Network. arXiv preprint. https://arxiv.org/abs/2305.12712
Jung, J.-W., Heo, H.-S., Kim, J.-H., Shim, H.-J., & Yu, H.-J. (2019). RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. Interspeech 2019. https://www.isca-archive.org/interspeech_2019/jung19b_interspeech.html
Jung, J.-W., Kim, S.-B., Shim, H.-J., Kim, J.-H., & Yu, H.-J. (2020). Improved RawNet with feature map scaling for text‑independent speaker verification using raw waveforms. Interspeech 2020. https://www.isca-archive.org/interspeech_2020/jung20c_interspeech.pdf