基於知識蒸餾的BERT模型壓縮-名人轶事

                                                            2019年10月14日 14:02 来源:名人轶事
                                                            编辑:一分排列3计划

                                                            一分排列3计划

                                                            【98岁老人被判15年】

                                                            就像老師在教學生的時候⊿◇,學生只記住了最終的答案☆,但是對於中間的過程確完全沒有學習↑◇。這樣在遇到新問題的時候π∵,學生模型犯錯誤的概率更高□。基於這個假設∵,文章提出了一種損失函數↑,使得學生模型的隱藏層表示接近教師模型的隱藏層表示∟,從而讓學生模型的泛化能力更強☆π♂。文章稱這種模型為「耐心的知識蒸餾」模型 (Patient Knowledge Distillation﹡〇┊, 或者PKD)♂?⊙。

                                                            Zhe Gan: is a senior researcher at Microsoft, primarily working on generative models, visual QA/dialog, machine reading comprehension (MRC), and natural language generation (NLG). He also has broad interests on various machine learning and NLP topics. Zhe received his PhD degree from Duke University in Spring 2018. Before that, he received his Master's and Bachelor's degree from Peking University in 2013 and 2010, respectively.

                                                            Yu Cheng: is a senior researcher at Microsoft. His research is about deep learning in general, with specific interests in model compression, deep generative model and adversarial learning. He is also interested in solving real-world problems in computer vision and natural language processing. Yu received his Ph.D.from Northwestern University in 2015 and his bachelor from Tsinghua University in 2010. Before join Microsoft, he spent three years as a Research Staff Member at IBM Research/MIT-IBM Watson AI Lab.

                                                            在過去一年裡♂┊π,語言模型的研究有了許多突破性的進展♀∵⌒, 比如GPT用來生成的句子足夠以假亂真[1];BERT, XLNet, RoBERTa [2,3,4]等等作為特徵提取器更是橫掃各大NLP榜單⊿∴π。但是☆⊿,這些模型的參數量也相當驚人↑▽,比如BERT-base有一億零九百萬參數⊿π,BERT-large的參數量則高達三億三千萬↑⊿,從而導致模型的運行速度過慢□∵。為了提高模型的運行時間△﹡↑,本文率先提出了一種新的知識蒸餾 (Knowledge Distillation) [5] 方法來對模型進行壓縮△∴,從而在不損失太多精度的情況下□,節省運行時間和內存π。文章發表在EMNLP 2019⊙┊。

                                                            一分排列3计划

                                                            Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

                                                            本文首發於微信公眾號:大數據文摘◇☆△。文章內容屬作者個人觀點⊿♂∴,不代表和訊網立場﹡。投資者據此操作⌒☆,風險請自擔☆﹡π。

                                                            Jingjing (JJ) Liu: is a Principal Research Manager at Microsoft, leading a research team in NLP and Computer Vision. Her current research interests include Machine Reading Comprehension, Commonsense Reasoning, Visual QA/Dialog and Text-to-Image Generation. She received her PhD degree in Computer Science from MIT EECS in 2011. She also holds an MBA degree from Judge Business School at University of Cambridge.Before joining MSR, Dr.Liu was the Director of Product at Mobvoi Inc and Research Scientist at MIT CSAIL.

                                                            圖表2Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI Blog 1.8 (2019).

                                                            具體來說♀∵∵,對於句子分類類型的任務π♂,當普通的知識蒸餾模型用來對模型進行壓縮的時候, 通常都會損失很多精度◇∟。原因是學生模型 (student model) 在學習的時候只是學到了教師模型 (teacher model) 最終預測的概率分佈▽,而完全忽略了中間隱藏層的表示♀♂∟。

                                                            圖表1在速度方面☆∟,6層transformer模型幾乎可以將推理 (inference) 速度提高兩倍◇♂,總參數量減少1.64倍;而三層transformer模型可以提速3.73倍↑⊙,總參數兩減少2.4倍↑♂。具體結果見圖表2∟☆∵。

                                                            因為對於句子分類問題▽☆π,模型的預測都是基於[CLS]字符的特徵表示之上⊿,比如在這個特徵上加兩層全連接△。因此研究者提出一個新的損失函數△,使得模型能夠同時學到[CLS]字符的特徵表示:

                                                            Yang, Zhilin, et al. "XLNet: Generalized Autoregressive Pretraining for Language Understanding." arXiv preprint arXiv:1906.08237 (2019).

                                                            Liu, Yinhan, et al. "Roberta: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692 (2019).

                                                            Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

                                                            一分排列3计划

                                                            其中M是學生的層數(比如3▽□,6)﹡⌒, N是老師模型的層數(比如12♀⌒,24)↑△,h是[CLS]在模型中隱藏層的表示♂,而i, j則表示學生-老師隱藏層的對應關係♀〇,具體如下圖所示π♂♀。比如↑∵,對於6層的學生模型π,在學習12層的教師模型的時候∟, 學生模型可以學習教師模型的 (2,4,6,8,10)層隱藏層的表示 (左側PKD-skip)⊿, 或者教師模型最後幾層的表示 (7,8,9,10,11, 右側PKD-last). 最後一層因為直接學習了教師模型的預測概率┊△⊿,因此略過了最後一個隱藏層的學習♀。

                                                            Siqi Sun: is a Research SDE in Microsoft. He is currently working on commonsense reasoning and knowledge graph related projects. Prior joining Microsoft, he was a PhD student in computer science at TTI Chicago, and before that he was an undergraduate student from school of mathematics at Fudan University.

                                                            一分排列3计划

                                                            代碼已經開源在:https://github.com/intersun/PKD-for-BERT-Model-Compression

                                                            研究者將提出的模型與模型微調(fine-tuning)和正常的知識蒸餾在7個句子分類的保准數據集上進行比較∟,在12層教師模型蒸餾到6層或者3層學生模型的時候⊙△,絕大部分情況下PKD的表現都優於兩種基線模型▽〇。並且在五個數據集上SST-2 (相比于教師模型-2.3%準確率), QQP (-0.1%), MNLI-m (-2.2%), MNLI-mm (-1.8%), and QNLI (-1.4%) 的表現接近於教師模型∵π▽。具體結果參見圖表1⊙∴△。從而進一步驗證了研究者的猜測♂♂↑,學習了隱藏層表示的學生模型會優於只學教師預測概率的學生模型◇▽?。

                                                            推荐阅读:釜山行2杀青

                                                            本网(平台)所刊载内容之知识产权为一分排列3计划及/或相关权利人专属所有或持有。未经许可,禁止进行转载、摘编、复制及建立镜像等任何使用。
                                                            特色栏目