This page is about the multitask English model designed to solve several classification tasks simultaneously. The model is used in the DeepPavlov Dream
Specifically, the model is used to address the following tasks:
Using such single backbone models (instead of many separate ones) allows for saving computational power.
Here you can find the full list of classes for all these tasks.
The model uses plain linear layers on top of the distilbert-base-cased backbone. This architecture is explained in the DeepPavlov manual (supported since v 1.1.1).
The workings of this model are also examined in this paper.
The model can be accessed either via an annotator (
http://0.0.0.0:8087/model) or a post-annotator (
Examples of external calls to this model can be found here.
The new multitask “9-in-1” model was trained on the following single-label datasets:
|Task||Class number||Dataset||Training samples||Notes|
|Sentiment classification||3||DynaSent(r1+r2)||94k||In previous Multitask models SST was used (8k samples) which led to overfit.|
|Emotion classification||7||go_emotions||42k||Emotions in this dataset were Ekman-grouped. And only single-label samples were used|
|MIDAS intent classification||15||Midas||≈9k||Head is single-label, only semantic classes were used.|
|DeepPavlov Topic classification||25||DeepPavlov Topics||1.8m||Class names for this classifier and for Cobot replacement classifiers are different, so special functions for every such topic were added to support this difference. (Example )|
|Toxicity classification||8||Kaggle dataset||170k||To make the classifier single-label,
|CoBot topics||22||private DREAM-2 dataset||216k||The most frequent miscellaneous class (
|CoBot dialogact topics||11||private DREAM-2 dataset||127k||The most frequent misc class (
|CoBot dialogact intents||11||private DREAM-2 dataset||316k||same procedures were made as for CoBot dialogact topics|
NB: There were various attempts in improving the architecture, such as using task-specific tokens in concatenation with the
CLS token or instead of it (proved unsuccessful).
However, increasing the batch size from
640 (for distilled model) or
320 (for ordinary model) yielded improvements.
We have measured the metrics of the Multitask model for the
distilbert-base-uncased. For the latter model, we explored using MIDAS with- and without history, and settled on adopting MIDAS without history to additionally cut down on computational time.
Training the model without history from scratch almost didn’t impact the performance and yielded some improvement for MIDAS (
Setting 3 VS
It also allowed us to shorten max sequence length twice and have only 1 model prediction to cache, which decreased the prediction time from 0.73 sec to 0.55 sec.
|Task/model||Dataset modification||Train size||Singletask,
|Use history in MIDAS training data||yes||yes||no||yes||yes|
|Emotion classification (go_emotions)||converted to multi-class||39.5к||70.47/70.30||68.18/67.86||67.59/67.32||71.48/71.16||67.27/67.23|
|Toxic classification (Kaggle)||converted to single label||1.62m||94.53/93.64||93.84/93.5||93.86/93.41||94.54/93.15||93.94/93.4|
|Sentiment classification (DynaSent, v1+v2)||no||94k||74.75/74.63||72.55/72.21||72.22/71.9||75.95/75.88||75.65/75.62|
|Factoid classification (Yahoo)||no||3.6k||81.69/81.66||81.02/81.07||80.0/79.86||84.41/84.44||80.34/80.09|
|Midas classification||only semantic classes||7.1k||80.53/79.81(with history)||72.73/71.56 (with history)
62.26 /60.68 (without history)
|73.69/73.26(without history)||82.3/82.03(with history)||77.01/76.38(with history)|
|DeepPavlov Topics classification||no||1.8m||87.48/87.43||86.98/86.9||87.01/87.05||88.09/88.1||87.43/87.47|
|Cobot topics classification||converted to single label no history, removed 1 wide-spread misc class
|Cobot dialogact topics classification||converted to single label no history, removed 1 widespread misc class
|Cobot dialogact intents classification||converted to single label no history||318k||77.07/77.7||76.83/76.76||76.65/76.57||77.28/77.72||76.96/76.89|
|GPU memory used, Mb||2418*9=21762||2420||2420||3499*9=31491||3501|
|Test inference time, sec ( for the tests)||0.76||0.55||≈ 1.33|
To achieve the best trade-off between memory use, inference time, and test metrics, the model from Setting 3 (Multitask,
distilbert base uncased, batch
640) is used in DeepPavlov Dream.
If treating all classification models as Singletask, the 6 separate classifiers (emotions-, toxic-, sentiment-, and 3 cobot models) would take approximately of GPU memory.
NB: We don’t consider
topic classifier model as it is unclear what kind of Singletask topic classifier we would use. For example, if we used a distilbert-like topic classifier, it would take of GPU memory. In this case, replacements of all Singletask models would require of GPU memory.
Compared to this setting, our Multitask approach saves of GPU memory.
This memory usage decrease is achieved by:
This Multitask model uses on CPU. Following the same approach, the estimated economy would be Compared to this setting, our Multitask approach saves of CPU memory.
Compared to the previous dev (where multitask 6-in-1
bert-base is already used), our multitask approach saves:
The inference time decrease happens because the current multitask approach is much faster than MIDAS (thanks to the transformer-agnosticity), and we no longer need to use them both.