Skip to content

OdiaNLP

Datasets

OdiaNLP

Home
Home
- About
- Contributors
Need help
Projects
Projects
- Projects
- Machine Translation
  Machine Translation
  - Progress
  - Architecture
    Architecture
    
    Ingestion pipeline flow
  - Datasets Datasets
    Table of contents
    
    Overview
    
    Sources
    
    Additional links
    
    Odia Monolingual corpus
    
    Odia dictionary
  - Challenges
    Challenges
    
    Sentence Alignment
    
    UI for corpus generation
    
    Tweets extraction
- Dictionary
- Others
Resources
Resources
- Text
- Audio
- Visual
Blogs
Blogs
- Odias in ML
Updates
Updates
- Updates

English-Odia Parallel corpus¶

Overview¶

80,437 English text followed by its Odia translation text pairs can be downloaded from our NMT model repo.
Parallel pairs have been collected from many sources by many volunteers.

Sources¶

Wikipedia Data dump
Open Parallel Corpus
OdiEnCorp 1.0
TDIL - Technical strings 52,000 pairs-Data needs to be cleaned
Global Voices - 328 sentences pairs
Mann ki baat - 1000+ pairs
Twitter:DoctorBabu - Around 100 Botanical terms En-Or pairs
Rupesh Ranjan Panda - Around 300 generic En-Or pairs
Krishna Kabi - 186 En-Or pairs

Additional links¶

Extracting Parallel-text pairs from Wikipedia

Odia Monolingual corpus¶

Monolingual Odia data has been extracted from Wikipedia.
You can use this repo to fetch the latest dataset.
Ready-made monolingual corpus (with ~17,000 wikipedia articles) can be found at Kaggle created by Gaurav.

Odia dictionary¶

The dictionary data has been extracted from Odia Purnachandra Bhashakosha.
The source code repository for the dataset are in: OdiaNLP/dictionary

Last update: 2023-03-27