This page is about the multitask English model designed to solve several classification tasks simultaneously. The model is used in the DeepPavlov Dream
Specifically, the model is used to address the following tasks:
Using such single backbone models (instead of many separate ones) allows for saving computational power.
Here you can find the full list of classes for all these tasks.
The model uses plain linear layers on top of the distilbert-base-cased backbone. This architecture is explained in the DeepPavlov manual (supported since v 1.1.1).
The workings of this model are also examined in this paper.
DeepPavlov Dream uses this model as a part of combined_classification
annotator via port 8087
.
The model can be accessed either via an annotator ( http://0.0.0.0:8087/model
) or a post-annotator ( http://0.0.0.0:8087/batch_model
).
Examples of external calls to this model can be found here.
The new multitask “9-in-1” model was trained on the following single-label datasets:
Task | Class number | Dataset | Training samples | Notes |
---|---|---|---|---|
Sentiment classification | 3 | DynaSent(r1+r2) | 94k | In previous Multitask models SST was used (8k samples) which led to overfit. |
Factoid classification | 2 | YAHOO | 3.6k | |
Emotion classification | 7 | go_emotions | 42k | Emotions in this dataset were Ekman-grouped. And only single-label samples were used |
MIDAS intent classification | 15 | Midas | ≈9k | Head is single-label, only semantic classes were used. |
DeepPavlov Topic classification | 25 | DeepPavlov Topics | 1.8m | Class names for this classifier and for Cobot replacement classifiers are different, so special functions for every such topic were added to support this difference. (Example ) |
Toxicity classification | 8 | Kaggle dataset | 170k | To make the classifier single-label, non_toxic class was added to this dataset, and it was set to be equal to (1 - sum of the toxic probs). Then all toxic probs were normalized, and the class with the maximal probability was selected. |
CoBot topics | 22 | private DREAM-2 dataset | 216k | The most frequent miscellaneous class (Phatic ) was excluded, and all multilabel examples were converted to the single-label format. All these measures made the model less likely to overfit on the misc classes, thus improving its quality on the real-world data. |
CoBot dialogact topics | 11 | private DREAM-2 dataset | 127k | The most frequent misc class (Other ) was excluded, all multi-label examples were converted to the single-label format, and history support was also removed. All these measures made model less likely to overfit, thus improving its quality on the real-world data. |
CoBot dialogact intents | 11 | private DREAM-2 dataset | 316k | same procedures were made as for CoBot dialogact topics |
NB: There were various attempts in improving the architecture, such as using task-specific tokens in concatenation with the CLS
token or instead of it (proved unsuccessful).
However, increasing the batch size from 32
to 640
(for distilled model) or 320
(for ordinary model) yielded improvements.
We have measured the metrics of the Multitask model for the bert-base-uncased
and distilbert-base-uncased
. For the latter model, we explored using MIDAS with- and without history, and settled on adopting MIDAS without history to additionally cut down on computational time.
Training the model without history from scratch almost didn’t impact the performance and yielded some improvement for MIDAS (Setting 3
VS Setting 2
).
It also allowed us to shorten max sequence length twice and have only 1 model prediction to cache, which decreased the prediction time from 0.73 sec to 0.55 sec.
Setting | 1 | 2 | 3 | 4 | 5 | ||
---|---|---|---|---|---|---|---|
Task/model | Dataset modification | Train size | Singletask, distilbert base uncased ,batch 640 |
Multitask, distilbert base uncased , batch 640 |
Multitask, distilbert base uncased , batch 640 , all tasks trained without history |
Singletask, bert base uncased , batch 320 |
Multitask, bert base uncased , batch 320 |
Use history in MIDAS training data | yes | yes | no | yes | yes | ||
Emotion classification (go_emotions) | converted to multi-class | 39.5к | 70.47/70.30 | 68.18/67.86 | 67.59/67.32 | 71.48/71.16 | 67.27/67.23 |
Toxic classification (Kaggle) | converted to single label | 1.62m | 94.53/93.64 | 93.84/93.5 | 93.86/93.41 | 94.54/93.15 | 93.94/93.4 |
Sentiment classification (DynaSent, v1+v2) | no | 94k | 74.75/74.63 | 72.55/72.21 | 72.22/71.9 | 75.95/75.88 | 75.65/75.62 |
Factoid classification (Yahoo) | no | 3.6k | 81.69/81.66 | 81.02/81.07 | 80.0/79.86 | 84.41/84.44 | 80.34/80.09 |
Midas classification | only semantic classes | 7.1k | 80.53/79.81(with history) | 72.73/71.56 (with history) 62.26 /60.68 (without history) |
73.69/73.26(without history) | 82.3/82.03(with history) | 77.01/76.38(with history) |
DeepPavlov Topics classification | no | 1.8m | 87.48/87.43 | 86.98/86.9 | 87.01/87.05 | 88.09/88.1 | 87.43/87.47 |
Cobot topics classification | converted to single label no history, removed 1 wide-spread misc class Phatic |
216k | 79.88/79.9 | 77.31/77.36 | 77.45/77.35 | 80.68/80.67 | 78.21/78.22 |
Cobot dialogact topics classification | converted to single label no history, removed 1 widespread misc class Other |
127k | 76.81/76.71 | 76.92/76.79 | 76.8/76.7 | 77.02/76.97 | 76.86/76.74 |
Cobot dialogact intents classification | converted to single label no history | 318k | 77.07/77.7 | 76.83/76.76 | 76.65/76.57 | 77.28/77.72 | 76.96/76.89 |
Total (9-in-1) | 4218k | 80.36/80.20 | 78.48/78.22 | 78.36/78.15 | 81.31/81.12 | 79.3/79.11 | |
GPU memory used, Mb | 2418*9=21762 | 2420 | 2420 | 3499*9=31491 | 3501 | ||
Test inference time, sec ( for the tests) | 0.76 | 0.55 | ≈ 1.33 |
To achieve the best trade-off between memory use, inference time, and test metrics, the model from Setting 3 (Multitask, distilbert base uncased
, batch 640
) is used in DeepPavlov Dream.
If treating all classification models as Singletask, the 6 separate classifiers (emotions-, toxic-, sentiment-, and 3 cobot models) would take approximately of GPU memory.
NB: We don’t consider topic classifier model
as it is unclear what kind of Singletask topic classifier we would use. For example, if we used a distilbert-like topic classifier, it would take of GPU memory. In this case, replacements of all Singletask models would require of GPU memory.
Compared to this setting, our Multitask approach saves of GPU memory.
This memory usage decrease is achieved by:
BERT-base-uncased
with distilbert-base-uncased
.This Multitask model uses on CPU. Following the same approach, the estimated economy would be Compared to this setting, our Multitask approach saves of CPU memory.
Compared to the previous dev (where multitask 6-in-1 bert-base
is already used), our multitask approach saves:
The inference time decrease happens because the current multitask approach is much faster than MIDAS (thanks to the transformer-agnosticity), and we no longer need to use them both.