|Keywords||Speech, NLP, Data Science|
The scarcity of digital products available in Hebrew is, to a large extent, due to the lack of Hebrew data corpuses available for machine learning training. Specifically, services like speech-to-text and text-to-speech products are seldom available in commercial products and not available at all for research purposes. Moreover, the next generation of NLP models will bypass the transcription stage and offer textless NLP based on speech and audio vocalizations.
This challenge has created a demand to model building blocks and data sets, in advance so the automatic transcription of speech, speech generation, and other spoken language modeling products will be developed in the near future to support the Hebrew language.
The researchers are currently creating the first Modern Hebrew speech data corpus including transcription and synchronization for sentences that will enable the construction of speech recognition, modeling, and synthesis systems.
The data corpus will comprise 1,200 hours of speech including read speech, spontaneous speech, and clean expressive speech (reading, emotional and conversational speech).
The researchers are developing a dedicated recording system and software to enable rapid adjustment during recording, transcription, and automatic data synchronization.
The implementation of a complete system for Hebrew speech recognition and pronunciation (niqqud: a system of diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet) in commercial quality will enable the automatic transcription of untagged information in the future.
The transcription and speech production systems will be free for academic institutions and will be provided under license to companies in the industry. The research group is open to academic collaborations and commercial use of the data set.