Possible projects¶

This list contains the proposals for the projects which can be started with Odia language. This list has been created keeping students, research scholars and hobbist in mind, who by little knowledge on this domain can learn and able to execute this project.

Monolingual corpus
Language detector
Word tokenizer
Word2vec preparation
Word similarity
Sentence tokenizer
Random sentence generator
Sentence similarity
Stemming
Synonym
Part of Speech tagging
Dependency parse tree creation
Lemmatization
Sentiment analysis
Text classification
Co-reference resolution
Word(s) based sentence creation
Random quotation generator
Author specific Quotation generation
Syntactically sentence correction
Correct punctuation marks
Autocomplete
Spell corrector
Sentence completion
Automatic speech recognition (major project)
Text Summarization
Optical Character Recognition

Monolingual corpus¶

Please check the resources page for existing corpus details.

Language detector¶

Given a text string, detect its language. It should identify if Odia language text are given.
An existing language detector can be found in OpenOdia project.

Existing Algorithms¶

Google language detector

Part of speech tagging¶

Given a sentence find out the part of speeches present on that sentence.
Part of speech can be verb, noun, adjective, pronoun, preposition, etc.

Stemming¶

Initial rough corpus

["ଲେ", "ଠୁ", "ର", "ରେ", "ଟି", "ଟେ", "ଟା",
ୁଥିଲେ
["ଥିଲେ", "ଥିଲ", "ଥିଲୁ", "ଥିଲି", "ଉଛେ", "ଉଛ", "ଉଛୁ", "ଉଛି", "ଇଛେ", "ଇଛ", "ଇଛୁ", "ଇଛି", "ଅଛେ", "ଅଛ", "ଅଛୁ", "ଅଛି", "ସିଛେ", "ସିଛ", "ସିଛୁ", "ସିଛି", "ଅନ୍ତେ", "ଅନ୍ତ", "ଅନ୍ତୁ", "ଅନ୍ତି", "ଇଲେ", "ଇଲ", "ଇଲୁ", "ଇଲି", "ଇବେ", "ଇବ", "ଇବୁ", "ଇବି", "ଥିବେ", "ଥିବ", "ଥିବୁ", "ଥିବି", "ଟାକୁ", "ଟାକେ", "ଟିର", "ଟିରେ", "ଟିଏ", "ମାନେ", "ଗୁଡ଼ା"]
["ଗୁଡ଼ାଏ", "ଗୁଡ଼ାକ",]

Largest substring approach¶

By using the largest suffix substring removal process as shown in this code by Mohit for Hindi language.
In Odia language by using a specific set of suffixes we can omit critical information form the sentence.
For example the suffixes like uthilu, uthibe describes about the tense of the sentence, whether it is in future or past or present.
Similarly, there will be exceptions throughout the process and we can not use a generic set of suffixes to stem.
Therefore, a better method need to be found out.

Existing work¶

A Suffix Stripping Algorithm for Odia Stemmer by Utkal university with 88% accuracy.
Design of Lightweight stemmer for Odia Derivational Suffixes by Govt. college of Engineering Keonjhar with 85% accuracy.

Odia text summarization¶

An existing extractive word-frequency based text summarizer is implemented in the OpenOdia project.
Extrative text summarization
Automatic Text Summarization for Oriya Language

Optical Character Recognition of Odia script¶

Automatic recognition of printed Oriya script

Last update: 2023-03-27