بهبود گشتار صدای گفتار احساسی با استفاده از یک تابع انرژی کارآمد

نوع مقاله : مقاله پژوهشی

نویسندگان

1 دانشجوی دکتری، گروه مهندسی کامپیوتر، دانشگاه بوعلی‌سینا، همدان، ایران

2 استاد، گروه مهندسی کامپیوتر، دانشگاه بوعلی‌سینا، همدان، ایران

3 استادیار، گروه مهندسی کامپیوتر، دانشگاه بوعلی‌سینا، همدان، ایران

10.22034/abmir.2026.24063.1196

چکیده

یکی از موضوعات مهم در حوزه پردازش صوت و گفتار، تغییر احساس در گفتار است. ازجمله چالش‌های موردتوجه، محاسبه مقدار دقیق ویژگی‌های اصلی شامل انرژی، گام و دیرش است. هرچند قبلاً روش‌های موثری ارائه‌شده‌اند، لیکن روش‌های موجود برای محاسبه انرژی به جهت اینکه فقط پارامتر دامنه را لحاظ می‌کنند، به‌تنهایی برای مدل‌سازی نوای گفتار کارایی مطلوبی ندارند. این پژوهش نشان می‌دهد که چگونه می‌توان از تابع جدیدی برای محاسبه انرژی جهت مدل‌سازی گشتار گفتار احساسی بهره برد. روش پیشنهادی برای محاسبه انرژی، بر اساس حساسیت سیستم شنوایی انسان در فرکانس‌های مختلف عمل می‌کند. در این روش انرژی بر اساس فاصله نقاط اکسترمم که مرتبط با دامنه و فرکانس است محاسبه می‌شود. جهت ارزیابی کارایی، یک سیستم گشتار گفتار احساسی با استفاده از تابع انرژی پیشنهادی بر روی پایگاه‌داده گفتار احساسی Persian ESD پیاده‌سازی و با روش‌های معمول محاسبه انرژی مقایسه شد. نتایج آزمایشی طبق نظرسنجی CMOS نشان می‌دهد که تابع پیشنهادی، کیفیت گفتار احساسی تولیدشده را در حد مطلوبی افزایش داده است.

کلیدواژه‌ها


عنوان مقاله [English]

Improving the pitch of emotional speech using an efficient energy function

نویسندگان [English]

  • Majid Nikzar 1
  • Hassan Khotanlou 2
  • Mirhossein Dezfoulian 3
1 PhD Student, Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamedan, Iran
2 Professor, Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamedan, Iran
3 Assistant Professor, Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamedan, Iran
چکیده [English]

An important issue in speech processing is how to beneficially add or change the emotion of a speech. ‎Among the important challenges is ‎calculating the exact value of the main ‎features including energy, pitch, and ‎duration. This research shows that ‎energy feature extraction can be promoted ‎using a proper energy ‎function. The proposed method for ‎energy calculation is based on the ‎sensitivity of the human hearing ‎system at different frequencies. In ‎this method, the energy is calculated ‎based on the distance between the ‎extrema points in the speech signal, which is related to the ‎amplitude and frequency. To ‎evaluate the efficiency, a simple ‎emotional speech conversion system ‎was implemented using the proposed ‎energy function on the Persian ESD ‎emotional speech dataset and the results are ‎compared with the conventional energy ‎function. The ‎experimental results based on the ‎CMOS assessment show that ‎using the proposed method produced better results in compare with the state of the art methods

کلیدواژه‌ها [English]

  • Emotional Speech Transformation
  • Speech Synthesis
  • Speech Prosody Modeling
  • Sound Intensity
  • Energy Spectrum
[1]     T. Qi, S. Wang, C. Lu and T. Song, "PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts," in Interspeech, Rotterdam, The Netherlands, 2025.
[2]     K. Zhou, B. Sisman, R. Liu and H. Li, "Emotional voice conversion: Theory, databases and ESD," Speech Communication, vol. 137, pp. 1-18, 2022.
[3]     L. R. Murray and J. L. Arnott, "Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion," The Journal of the Acoustical Society of America, 1993.
[4]     K. Waghmare, S. Kayte and B. Gawali, "Analysis of Pitch and Duration in Speech Synthesis using PSOLA," Communications on Applied Electronics (CAE), vol. 4, no. 4, pp. 10-18, 2016.
[5]     P. Y. Oudeyer, "The production and recognition of emotions in speech: features and algorithms," International Journal of Human-Computer Studies, vol. 59, pp. 157-183, 2003.
[6]     S. S. Sadeghi, H. Khotanlou and M. R. Mahand, "Automatic Persian Text Emotion Detection using Cognitive Linguistic and Deep Learning," Journal of Artificial Intelligence and Data Mining (JAIDM), vol. 9, no. 2, pp. 169-179, 2021.
[7]     N. Esfandian, "Phoneme Classification using Temporal Tracking of Speech Clusters in spectro- temporal domain," International Journal of Engineering (IJE), IJE Transactions A: Basics, vol. 33, no. 1, pp. 105-111, 2020.
[8]     M. Aliabadi, R. Golmohammadi, M. Mansoorizadeh, H. Khotanlou and A. O. Hamadani, "An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks," Applied Acoustics, vol. 74, p. 364–374, 2013.
[9]     M. Karami Mollaei and M. Eshaghi, "A NEW ALGORITHM FOR VOICE ACTIVITY DETECTION BASED ON WAVELET PACKETS," International Journal of Engineering(IJE), IJE Transactions A: Basics, vol. 22, no. 3, pp. 225-232, 2009.
[10] A. A. Kiaei and H. Khotanlou, "Segmentation of Medical Images using Mean Value Guided Contour," Medical Image Analysis, 2017.
[11] S. Özaydın, "Examination of Energy Based Voice Activity Detection Algorithms for Noisy Speech Signals," European Journal of Science and Technology, pp. 157-163, 2019.
[12] K. Aghajani and I. Esmaili Paeen Afrakoti, "Speech Emotion Recognition Using Scalogram Based Deep Structure," International Journal of Engineering (IJE), IJE TRANSACTIONS B: Applications, vol. 33, no. 2, pp. 285-292, 2020.
[13] Y. Stylianou, "VOICE TRANSFORMATION: A SURVEY," in IEEE International Conference on Acoustics Speech and Signal Processing, 2009.
[14] O. Turk and L. M. Arslan, "Robust processing techniques for voice conversion," Computer Speech and Language, vol. 20, p. 441–467, 2006.
[15] L. Mary, Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition, 2 ed., SpringerBriefs in Speech Technology, 2019.
[16] D. Ververidis and C. Kotropoulos, "Emotional speech classification using Gaussian mixture models," in 2005 IEEE International Symposium on Circuits and Systems, Kobe, Japan, 2005.
[17] D. Verma, S. K. Barnwal, A. Barve, M. K. J. Kannan, R. Gupta and R. Swaminathan, "Multimodal Sentiment Sensing and Emotion Recognition Based on Cognitive Computing Using Hidden Markov Model with Extreme Learning Machine," International Journal of Communication Networks and Information Security, vol. 14, no. 2, pp. 155-167, 2022.
[18] M. K. Reddy and K. S. Rao, "Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis," Machine Intelligence and Signal Analysis, vol. 748, pp. 85-91, 2019.
[19] J. B. Singh and P. K. Lehana, "Emotional speech analysis using harmonic plus noise model and Gaussian mixture model," International Journal of Speech Technology, vol. 22, p. 483–496, 2019.
[20] S. Karimi and M. H. Sedaaghi, "How to categorize emotional speech signals with respect to the speaker’s degree of emotional intensity," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 24, p. 1306–1324, 2016.
[21] J. Holmes and W. Holmes, Speech Synthesis and Recognition, Second ed., London: Taylor & Francis, 2001.
[22] L. R. Rabiner and R. W. Schafer, Introduction to Digital Speech Processing, Boston: Now the essence of knowledge, 2007.
[23] M. Mansoorizadeh and N. Moghaddam Charkari, "Multimodal information fusion application to human emotion recognition from face and speech," Multimedia Tools and Applications, vol. 49, p. 277–297, 2010.
[24] A. V. Oppenheim, A. S. Willsky and H. Nawab, Signals and systems, New Jersey: Prentice-Hall, 1996.
[25] S. Hadiyoso, I. D. Irawati and A. Rizal, "Epileptic Electroencephalogram Classification using Relative Wavelet Sub-band Energy and Wavelet Entropy," International Journal of Engineering(IJE), Transactions A: Basics, vol. 34, no. 1, pp. 75-81, 2021.
[26] M. Jalil, A. Butt and A. Malik, "Short-Time Energy, Magnitude, Zero Crossing Rate and Autocorrelation Measurement for Discriminating Voiced and Unvoiced segments of Speech Signals," in International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), Konya, Turkey, 2013.
[27] M. Hamidi and M. Mansoorizade, "EMOTION RECOGNITION FROM PERSIAN SPEECH WITH NEURAL NETWORK," International Journal of Artificial Intelligence & Applications (IJAIA), vol. 3, no. 5, pp. 107-112, 2012.
[28] H.K. Vydana, S.R. Kadiri and A.K. Vuppala, "Vowel-Based Non-uniform Prosody Modification for Emotion Conversion," Circuits, Systems and Signal Processing, vol. 35, p. 1643–1663, 2016.
[29] P. R. Hill, Audio and Speech Processing with MATLAB, New York: CRC Press, 2019.
[30] N. Keshtiari, M. Kuhlmann, M. Eslami and G. Klann-Delius, "Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)," Behavior Research Methods, vol. 47, p. 275–294, 2015.
[31] S. H. Mohammadi and A. Kain, "An overview of voice conversion systems," Speech Communication, vol. 88, p. 65–82, 2017.
[32] K. S. Rao, Predicting Prosody from Text for Text-to-Speech Synthesis, New York: Springer Briefs in Electrical and Computer Engineering, 2012.
[33] M. Nikzar, H. Khotanlou and M. Dezfoulian, "THE RELATIONSHIP BETWEEN THE NUMBER OF EXTREMA OF COMPOUND SINUSOIDAL SIGNALS AND ITS HIGH-FREQUENCY COMPONENT," Journal of Mahani Mathematical Research (JMMR), vol. 13, no. 1, pp. 181-195, 2023.
[34] A. Salarpour and H. Khotanlou, "An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering," International Journal of Engineering (IJE), IJE TRANSACTIONS B: Applications, vol. 31, no. 2, pp. 250-262, 2018.
[35]  R. C. Streijl, S. Winkler and D. S. Hands, "Mean Opinion Score (MOS) revisited:Methods and applications, limitations and alternatives," Multimedia Systems, vol. 22, p. 213–227, 2016.