Dzongkha NLP

Dzongkha Translation

Input

Output

limitations

The Dzongkha translation model was trained by fine-tuning the NLLB-200-distilled-600M model and has shown promising results. However, there are some limitations to consider:
1. Limited Training Data: Training relies heavily on having a diverse and substantial amount of data, however, Dzongkha translation has been trained on a 30K parallel corpus due to a lack of a high-end training system.
2. Domain-specific Translation: Perhaps, the translation model excels in general Dzongkha translation tasks, but its performance could be less reliable for the specific domains or specialized terminologies such as words used in hospitals, parliament, court, etc.
3. Contextual Understanding: The model can handle sentence structure and grammar to some extent. However, it may struggle with capturing the broader context of a conversation or text with an increase in the input sentence length and it may struggle to translate phrasal verbs and idioms.
4. Limited tokens: The NLLB model was trained with 512 tokens. However, the Dzongkha translation model was trained with 128 tokens. Subsequently, the model may not give accurate translations with longer sentences. In addition, longer sentences will take more latency and inference time.
5. Out-of-vocabulary Words: Users might test the system with phrases and words that are not present during training. Therefore, when encountering such out-of-vocabulary words, the model may resort to suboptimal translation or simply fail to translate them accurately.
Overall, while the fine-tuning of the nllb-200-distilled-600M model for Dzongkha translation is a significant step forward, these limitations need to be acknowledged and addressed to ensure more robust and reliable translations in the future.

Karma Wangchuk

Guide

Karma Wangchuk is an Associate Lecturer at the Information Technology Department, College of Science and Technology. He holds a Master of Engineering in Computer Engineering with a specialization in Image Processing, NLP, Big Data, ML.

Dodrup Wangchuk Sherpa

Developer

Dodrup Wangchuk Sherpa is a student at the College of Science and Technology, pursuing an undergraduate degree in Information Technology(2019-2023).

Thinley Norbu

Developer

Thinley Norbu is a student at the College of Science and Technology, pursuing an undergraduate degree in Information Technology(2019-2023).

Sonam Yangchen

Developer

Sonam Yangchen is a student at the College of Science and Technology, pursuing an undergraduate degree in Information Technology(2019-2023).

Dzongkha Translation

Input

Output

limitations

Developers

Karma Wangchuk

Dodrup Wangchuk Sherpa

Thinley Norbu

Sonam Yangchen