Vitalii Radchenko
22 Sep 2019
Big Data and Data Analytics

Best practices for NLP pipelines and reproducible research

Nowadays many companies are solving various NLP problems (classification, chatbots, clustering, question-answer systems, etc.) and with the accumulation of experience, the most effective pipelines have been developed.
Firstly, I will focus on the best world NLP practices (AllenNLP) and our own experience. I will tell you how to structure your pipeline and the features of each component: how to correctly format incoming data, iterators on datasets, what should be a dictionary, data preparation, etc.
Finally, I will show our solution and the best reproducible practices (DVC), how it works, and describe its advantages and disadvantages.
In the end, I will present how all of it works in real-life, based on one of our domestic tasks.