Tools and data for the Armenian language

See also: Requests for Projects, NLP Progress, and NLP Guide

Progress compared to other languages

Eastern Armenian is typically between language 50 and 100 for the top companies which launch products in many languages. Just like for many other languages, early technical challenges were often around encoding for non-ASCII characters, but Unicode has mostly solved that.

TODO: compare to English, Russian, Hebrew, Georgian

Raw corpora

Unlabelled datasets.

Text

Wikipedia
TODO

Image

For optical character recognition
TODO

Audio

For speech recognition
TODO

Word embeddings

fastText has trained embeddings for the top few hundred languages, including Armenian, on Wikipedia and Common Crawl.

Word Analogy Task along with GloVe, fastText, SkipGram, CBOW embeddings pre-trained on Wikipedia + news articles + blog posts.

Syntax parsing

https://github.com/UniversalDependencies/UD_Armenian-ArmTDP

Part-of-speech tagging

https://github.com/UniversalDependencies/UD_Armenian-ArmTDP

Named-entity recognition

https://github.com/ispras-texterra/pioner

Speech recognition

There is proprietary dataset at UCom. Google has a working system trained on search queries read aloud.

TODO: open dataset

Speech synthesis

TODO

Optical character recognition

TODO

Classification

Sentiment analysis: TODO

Spelling correction

TODO

Grammar correction

TODO

Translation

Parallel corpora: TODO

Unsupervised translation: TODO

Transliteration

https://github.com/YerevaNN/translit-rnn

https://github.com/deeplanguageclass/fairseq-transliteration/

Variant translation between Western and Eastern

TODO