Possible projects¶
This list contains the proposals for the projects which can be started with Odia language. This list has been created keeping students, research scholars and hobbist in mind, who by little knowledge on this domain can learn and able to execute this project.
- Monolingual corpus
- Language detector
- Word tokenizer
- Word2vec preparation
- Word similarity
- Sentence tokenizer
- Random sentence generator
- Sentence similarity
- Stemming
- Synonym
- Part of Speech tagging
- Dependency parse tree creation
- Lemmatization
- Sentiment analysis
- Text classification
- Co-reference resolution
- Word(s) based sentence creation
- Random quotation generator
- Author specific Quotation generation
- Syntactically sentence correction
- Correct punctuation marks
- Autocomplete
- Spell corrector
- Sentence completion
- Automatic speech recognition (major project)
- Text Summarization
- Optical Character Recognition
Monolingual corpus¶
- Please check the resources page for existing corpus details.
Language detector¶
- Given a text string, detect its language. It should identify if Odia language text are given.
- An existing language detector can be found in OpenOdia project.
Existing Algorithms¶
Part of speech tagging¶
Given a sentence find out the part of speeches present on that sentence.
Part of speech can be verb, noun, adjective, pronoun, preposition, etc.
Stemming¶
Initial rough corpus
- ["ଲେ", "ଠୁ", "ର", "ରେ", "ଟି", "ଟେ", "ଟା",
- ୁଥିଲେ
- ["ଥିଲେ", "ଥିଲ", "ଥିଲୁ", "ଥିଲି", "ଉଛେ", "ଉଛ", "ଉଛୁ", "ଉଛି", "ଇଛେ", "ଇଛ", "ଇଛୁ", "ଇଛି", "ଅଛେ", "ଅଛ", "ଅଛୁ", "ଅଛି", "ସିଛେ", "ସିଛ", "ସିଛୁ", "ସିଛି", "ଅନ୍ତେ", "ଅନ୍ତ", "ଅନ୍ତୁ", "ଅନ୍ତି", "ଇଲେ", "ଇଲ", "ଇଲୁ", "ଇଲି", "ଇବେ", "ଇବ", "ଇବୁ", "ଇବି", "ଥିବେ", "ଥିବ", "ଥିବୁ", "ଥିବି", "ଟାକୁ", "ଟାକେ", "ଟିର", "ଟିରେ", "ଟିଏ", "ମାନେ", "ଗୁଡ଼ା"]
- ["ଗୁଡ଼ାଏ", "ଗୁଡ଼ାକ",]
Largest substring approach¶
- By using the largest suffix substring removal process as shown in this code by Mohit for Hindi language.
- In Odia language by using a specific set of suffixes we can omit critical information form the sentence.
- For example the suffixes like uthilu, uthibe describes about the tense of the sentence, whether it is in future or past or present.
- Similarly, there will be exceptions throughout the process and we can not use a generic set of suffixes to stem.
- Therefore, a better method need to be found out.
Existing work¶
- A Suffix Stripping Algorithm for Odia Stemmer by Utkal university with 88% accuracy.
- Design of Lightweight stemmer for Odia Derivational Suffixes by Govt. college of Engineering Keonjhar with 85% accuracy.
Odia text summarization¶
- An existing extractive word-frequency based text summarizer is implemented in the OpenOdia project.
- Extrative text summarization
- Automatic Text Summarization for Oriya Language
Optical Character Recognition of Odia script¶
Last update:
2023-03-27