Abstract
User intent classification is a vital task for analyzing users’ essential requirements from the users’ input query in information retrieval systems, question answering systems, and dialogue systems. Pre-trained language model Bidirectional Encoder Representation from Transformers (BERT) has been widely applied to the user intent classification task. However, BERT is compute intensive and time-consuming during inference and usually causes latency in real-time applications. To improve the inference efficiency of BERT for the user intent classification task, this article proposes a new network named one-stage deep-supervised early-exiting BERT as one-stage deep-supervised early-exiting BERT (OdeBERT). In addition, a deep supervision strategy is developed to incorporate the network with internal classifiers by one-stage joint training to improve the learning process of classifiers by extracting discriminative category features. Experiments are conducted on publicly available datasets, including ECDT, SNIPS, and FDQuestion. The results show that the OdeBERT can speed up original BERT 12 times faster at most with the same performance, outperforming state-of-the-art baseline methods.
- [1] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for Computational Linguistics, 4171–4186. Google Scholar
Cross Ref
- [2] . 2019. BERT for joint intent classification and slot filling. arXiv:1902.10909. Retrieved from https://arxiv.org/abs/1902.10909Google Scholar
- [3] . 2020. A hybrid neural network RBERT-C Based on Pre-trained RoBERTa and CNN for user intent classification. In Proceedings of Neural Computing for Advanced Applications (NCAA’20). Springer, Singapore, 306–319. Google Scholar
Cross Ref
- [4] . 2019. Using convolutional neural network with BERT for intent determination. In Proceedings of International Conference on Asian Language Processing (IALP’19). IEEE, China. 65–70. Google Scholar
Cross Ref
- [5] . 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1746–1751. Google Scholar
Cross Ref
- [6] . 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press, 2873–2879. Google Scholar
Digital Library
- [7] . 2020. DynaBERT: Dynamic BERT with adaptive width and depth. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20), 9782–9793.Google Scholar
- [8] . 2019. A tensorized transformer for language modeling. In Proceedings of 33rd Conference on Neural Information Processing Systems (NeurIPS’19), 2232–2242.Google Scholar
Digital Library
- [9] . 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, 143–155. Google Scholar
Cross Ref
- [10] . 2019. Pruning a bert-based question answering model. arXiv:1910.06360. Retrieved from https://arxiv.org/pdf/1910.06360v1Google Scholar
- [11] . 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5797–5808. Google Scholar
Cross Ref
- [12] . 2019. Are sixteen heads really better than one? In Proceedings of 33rd Conference on Neural Information Processing Systems (NeurIPS’19), 14014–14024.Google Scholar
- [13] . 2019. Q8bert: Quantized 8bit bert. In Proceedings of 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS’19). IEEE, 36–39. Google Scholar
Cross Ref
- [14] . 2019. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv:1906.00532. Retrieved from https://arxiv.org/abs/1906.00532Google Scholar
- [15] . 2019. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’19). 8815–8821.Google Scholar
- [16] . 2019. Patient knowledge distillation for bert model compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 4323–4332. Google Scholar
Cross Ref
- [17] . 2019. TinyBERT: Distilling BERT for natural language understanding. In Proceedings of the Association for Computational Linguistics (EMNLP’20). Association for Computational Linguistics, 4163–4174. Google Scholar
Cross Ref
- [18] . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108. Retrieved from https://arxiv.org/abs/1910.01108v4Google Scholar
- [19] . 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942. Retrieved from https://arxiv.org/abs/1909.11942Google Scholar
- [20] . 2022. A survey on dynamic neural networks for natural language processing. arXiv:2202.07101. Retrieved from https://arxiv.org/pdf/2202.07101Google Scholar
- [21] . 2017. BranchyNet: Fast inference via early exiting from deep neural networks. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR’17). 2464–2469.Google Scholar
- [22] . 2019. Shallow-deep networks: Understanding and mitigating network overthinking. In Proceedings of the 36th International Conference on Machine Learning (ICML’19). PMLR, 3301–3310.Google Scholar
- [23] . 2019. Improved techniques for training adaptive deep networks. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, 1891–1900.Google Scholar
Cross Ref
- [24] . 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, 3712–3721. Google Scholar
Cross Ref
- [25] . 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2246–2251.Google Scholar
Cross Ref
- [26] . 2020. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6640–6651.Google Scholar
Cross Ref
- [27] . Romebert: Robust training of multi-exit bert. arXiv: 2101.09755. Retrieved from https://arxiv.org/abs/2101.09755v1.Google Scholar
- [28] . Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. Association for Computing Machinery, New York, NY, 1–6.Google Scholar
- [29] . 2017. Dynamic routing between capsules. In Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing System. 3859–3869.Google Scholar
- [30] . 2018. CosFace: Large margin cosine loss for deep face recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 5265–5274.Google Scholar
Cross Ref
Index Terms
OdeBERT: One-stage Deep-supervised Early-exiting BERT for Fast Inference in User Intent Classification
Recommendations
BERT for Sequence-to-Sequence Multi-label Text Classification
Analysis of Images, Social Networks and TextsAbstractWe study the BERT language representation model and the sequence generation model with BERT encoder for the multi-label text classification task. We show that the Sequence Generating BERT model achieves decent results in significantly fewer ...
Active deep networks for semi-supervised sentiment classification
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: PostersThis paper presents a novel semi-supervised learning algorithm called Active Deep Networks (ADN), to address the semi-supervised sentiment classification problem with active learning. First, we propose the semi-supervised learning method of ADN. ADN is ...
Semi-Supervised Deep Learning Using Pseudo Labels for Hyperspectral Image Classification
Deep learning has gained popularity in a variety of computer vision tasks. Recently, it has also been successfully applied for hyperspectral image classification tasks. Training deep neural networks, such as a convolutional neural network for ...






Comments