Text
This page contains probable open source resources found over internet on Odia text.
Monolingual Corpus¶
- For much NLP related tasks a large monolingual corpus is a necessity.
- As the name suggest a monolingual corpus stores text of only one language. In this case it will be of Odia.
- An existing monolingual corpus can be found in the IndicNLP corpus with 6.94 million sentences formed of 107 million tokens.
- Similarly, you can check this resource list by Dr. Shantipriya Parida for further monolingual corpus in Odia language
Parallel Corpus¶
Dataset | License | Need to be cleaned? | Estimated number of pairs | Note |
---|---|---|---|---|
Wikipedia Data dump | CC BY-SA 3.0 | Yes | 2,00,000+ | Pairs can be taken from the dumps. (Reference). |
OdiEnCorp 1.0 | CC BY-NC-SA 4.0 | Yes | 1,00,000+ | Mostly Bible data |
OdiaNLP corpus | CC BY-NC 4.0 | No | 80,437 | Generic En-Or pairs collected from many sources with multiple liceses. |
OdiEnCorp 2.0 | CC BY-NC-SA 4.0 | - | 94,000+ | Quality not checked. |
TDIL | Freeware | Yes | 52,000+ | Technical terms strings |
Press Information Bureau | CC-BY-4.0 | Yes | 58,461 | Sentences are aligned. Contains Mann Ki Baat pairs too. |
Mann ki baat | CC-BY-4.0 | Yes | 38,359 | High quality translation and much more. However, cleaning needs to be done. |
IndoWordnet | CC BY-SA 4.0 | - | 30,000+ | Corpus quality need to be checked. |
OPUS | Multiple | - | 6,00,000+ | Corpus contitute of the following corporas: Wikimedia, GNOME, Mozilla, etc. |
Samanantar | CC BY-SA 4.0 | - | 10,00,000+ | One of the best available corpus as of Dec 2021. |
Odia Transliteration¶
-
Odia Wikimedia community has developed an open-source unicode converter to transliterate between various languages to Odia language.
Languages supported
- The following languages are supported now:
- Ahamiya
- Bengali
- Santali
- Hindi
- Gujarati
- Roman and
- Urdu
- The following languages are supported now:
Additional Resources¶
- Odia-NLP-Resource-Catalog by Dr. Shantipriya Parida
- A Catalog of resources for Indian language NLP by AI4Bharat team
- OpenOdia : Tools for Odia language
To cite this resource list, please use:
@misc{OdiaNLP,
author = {Soumendra Kumar Sahoo},
title = {Text resources by Odia NLP},
howpublished = {\url{https://www.mte2o.com/}},
year = {2021}
}
Last update:
2023-03-27