Inspired by YC's Request for Startups, here are a list of potential problems for students, researchers, engineers, businesses and competitions.
They could be realised as research, datasets, libraries, applications or products.
Models like BERT or GPT-3 rely on legacy libraries for charset detection, language detection and sentence segmentation to turn raw data crawled from the web into rows of training data. They are imperfect, especially for long-tail languages, and their quirks affect the accuracy of the downstream models. Ironically they are often rules-based. Today it makes sense to try a deep learning approach to these foundational problems.
Deep-learning-based charset detection / encoding detection
The incumbent is Mozilla's chardet (and ports to Node and Python, like https://pypi.org/project/chardet/).
It is not great.
Deep-learning-based language detection / language identification
The incumbent is Google Chrome's stats-based cld3 (and ports to Node and Python).
Facebook did this in 2017 with fastText (which had been already proposed in our community) but it is still not great.
Deep-learning-based sentence segmentation / sentence splitting
This is key for the self-supervised pre-training of BERT etc.
One model that can take text, image or audio and output text, image or audio
Output multiple translations
Natural language and programming language
In 2021 products like GitHub Copilot and Replit launched code generation. This is still basically an unsolved set of problems, which are difficult to even evaluate, but for which there is plenty of training data:
See also:
Note:
We encourage:
We discourage:
See: Tools and data for the Armenian language