Skip to content


This page contains probable open source resources found over internet on Odia text.

Monolingual Corpus

  • For much NLP related tasks a large monolingual corpus is a necessity.
  • As the name suggest a monolingual corpus stores text of only one language. In this case it will be of Odia.
  • An existing monolingual corpus can be found in the IndicNLP corpus with 6.94 million sentences formed of 107 million tokens.
  • Similarly, you can check this resource list by Dr. Shantipriya Parida for further monolingual corpus in Odia language

Parallel Corpus

Dataset License Need to be cleaned? Estimated number of pairs Note
Wikipedia Data dump CC BY-SA 3.0 Yes 2,00,000+ Pairs can be taken from the dumps. (Reference).
OdiEnCorp 1.0 CC BY-NC-SA 4.0 Yes 1,00,000+ Mostly Bible data
OdiaNLP corpus CC BY-NC 4.0 No 80,437 Generic En-Or pairs collected from many sources with multiple liceses.
OdiEnCorp 2.0 CC BY-NC-SA 4.0 - 94,000+ Quality not checked.
TDIL Freeware Yes 52,000+ Technical terms strings
Press Information Bureau CC-BY-4.0 Yes 58,461 Sentences are aligned. Contains Mann Ki Baat pairs too.
Mann ki baat CC-BY-4.0 Yes 38,359 High quality translation and much more. However, cleaning needs to be done.
IndoWordnet CC BY-SA 4.0 - 30,000+ Corpus quality need to be checked.
OPUS Multiple - 6,00,000+ Corpus contitute of the following corporas: Wikimedia, GNOME, Mozilla, etc.
Samanantar CC BY-SA 4.0 - 10,00,000+ One of the best available corpus as of Dec 2021.

Odia Transliteration

  • Odia Wikimedia community has developed an open-source unicode converter to transliterate between various languages to Odia language.

    Languages supported
    • The following languages are supported now:
      • Ahamiya
      • Bengali
      • Santali
      • Hindi
      • Gujarati
      • Roman and
      • Urdu

Additional Resources

To cite this resource list, please use:

    author       = {Soumendra Kumar Sahoo},
    title        = {Text resources by Odia NLP},
    howpublished = {\url{}},
    year         = {2021}

Last update: 2023-03-27