Application
The scarcity of digital products available in Hebrew is, to a large extent, due to the lack of Hebrew data corpuses available for machine learning training. Specifically, services like speech-to-text and text-to-speech products are seldom available in commercial products and not available at all for research purposes. Moreover, the next generation of NLP models will bypass the transcription stage and offer textless NLP based on speech and audio vocalizations.
This challenge has created a demand to model building blocks and data sets, in advance so the automatic transcription of speech, speech generation, and other spoken language modeling products will be developed in the near future to support the Hebrew language.
Our Innovation
Our innovative technology addresses the challenge of diacritic-free Hebrew Text-to-Speech (TTS) synthesis. It is designed to handle modern Hebrew text without the need for diacritics prediction, making it ideal for a wide range of applications. The system can be utilized in voice assistants, audiobook creation, accessibility tools, and various other scenarios where high-quality Hebrew speech synthesis is required.
Our approach employs a Language Modeling (LM) technique that operates on discrete speech representations. It utilizes word-piece tokenization for improved text processing, allowing for better handling of the complexities of the Hebrew language. The system leverages large-scale weakly supervised data, comprising approximately 5,000 hours of in-the-wild recordings. This innovative method combines Auto-Regressive (AR) and Non-AutoRegressive (NAR) models for efficient token prediction, resulting in high-quality speech output. Additionally, the system features rapid speaker adaptation through acoustic prompting, enhancing its versatility and usability.
Advantages
- Diacritic-free operation, eliminating errors from diacritic prediction
- Superior context handling for ambiguous pronunciations
- Improved scalability and utilization of large-scale data
- Enhanced performance in naturalness and content preservation
- Flexible speaker adaptation capabilities
- End-to-end approach combining text processing and speech generation
- Outperforms baseline methods (MMS and Overflow) in objective metrics and human evaluations
- Open-source contribution, fostering further research
- Paper: https://arxiv.org/pdf/2407.12206
- Demo & code: https://github.com/slp-rl/HebTTS
Opportunity
This diacritic-free Hebrew TTS system addresses a significant market gap for high-quality Hebrew speech synthesis. It offers commercialization potential across various industries and can be extended to languages with similar challenges. Our business strategy includes:
1) Providing model improvements and support based on user feedback, ensuring continuous enhancement of the system.
2) Offering improved versions to existing customers at preferential rates, incentivizing long-term partnerships.
3) Facilitating knowledge transfer through tutorials, enabling companies to optimize such models in-house.
Opportunities exist for collaboration with tech companies, the development of specialized applications in education, media, and accessibility, and further research to enhance controllability and inference speed. The open-source nature of the project encourages ongoing advancements in the field of speech technology.