Client
Language corpora gain increasing relevance in the Human Language Technology areas. World leading universities, research groups and NLP and speech technology companies apply corpora with vetted linguistic data for an array of purposes from studying the dynamics of language development to training NLP models.
Oxford University Press is known as the creator of reliable human language technology tools and services. For years, OUP has been developing and using New Monitoring Corpus building software that harvested RSS feeds and collected data. Up to this date, NMC has stored more than 26 million documents with around 9 million tokens.
However, latest technologies allow to improve these numbers and excel the capabilities of corpora. With this in mind, OUP decided to create a system able to collect more data and stand no performance limits.
The main goal was to build a platform scalable in terms of productivity and functionality. To address this challenge, OUP has chosen cloud technology – not a standard approach to making a corpus. As a result, Super Corpus Platform has proved to become a powerful system which exceeded all expectations.