本文是完整讀過GPT-2 paper之後記錄下來的覺得重要的地方，其中大部分摘自paper原文，有用中文部分記錄自己讀paper時想到的東西以及不懂的地方，求指教！

讀GPT-2 paper之前可以作為預習先看看張俊林大佬漫談式的文章《效果驚人的GPT 2.0模型：它告訴了我們什麼》 https://zhuanlan.zhihu.com/p/56865533 寫的着實很好！

Abstract

The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks.

Introduction

Current systems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.
Multi-task learning
- The two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively.
- From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives.
- Cons：Multitask training may need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.

Approach

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution P(output | input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model P(output | input, task).
Task conditioning is often implemented at an architecture level, such as the task specific encoder-decoders network. However, language provides a flexible way to specify tasks, inputs and outputs all as a sequence of symbols. (For example, a translation training example can be written as the sequence (translate to French, english text, french text 因為訓練這個模型就是讓它學習語言知識，這樣自然也能夠了解語言給的任務指令).
Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. The problem instead becomes whether we are able to, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toyish setup but learning is much slower than in explicitly supervised approaches.
Training dataset
- Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible
- Web scrapes: while these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues (most content are mostly unintelligible). We also do not want to select a subsample of datasets that are most similar to target datasets for some tasks, since we want to avoid making assumptions about the tasks to be performed ahead of time.
- Instead, we only scraped web pages which have been curated/filtered by humans. As a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
Input representation
- Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.
- Our input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.

Model

Transformer内部結構的改變
- Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network.
- An additional layer normalization was added after the final self-attention block.
初始化方法改變

A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers.
Embedding 大小

The vocabulary is expanded to 50,257. We also increase the

context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

Experiments

All models still underfit WebText and held-out perplexity has as of yet improved given more training time.
LM
- understand how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for. Since our model operates on a byte level and does not require lossy pre-processing or tokenization, we can evaluate it on any language model benchmark.（但需要de-tokenizer，因為很多其他模型的訓練的時候會整理data，填入特殊符号，比如，但WebText裡極少見這種符号）。
- 結果：有四種不同大小的模型117,345,762,1542M，最大的那個improved state-of-the-art on 7 out of 8 datasets. 最小的那個隻提升了4個dataset。
Children’s book test
- Rather than reporting perplexity as an evaluation metric, CBT reports accuracy on an automatically constructed close test where the task is to predict which of 10 possible choices for an omitted word is correct.
- Data overlap analysis showed one of the CBT test set books, The Jungle Book by Rudyard Kipling, is in WebText, so we report results on the validation set which has no significant overlap. （于是我也了解為什麼之前說要把WebText裡Wikipedia的内容去掉了，因為之前有人用Wikipedia的語料訓練過模型，然後GPT-2就不用Wikipedia的資料訓練了，然後這樣就可以挺直腰闆告訴你：你看我訓練的時候沒看過你那些domain specific的資料，但跟你做一樣的任務照樣比你強，說明什麼？說明我比較通用！）。
LAMBADA
- The LAMBADA dataset (Paperno et al., 2016) tests the ability of systems to model long-range dependencies in text. The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict.
- Investigating GPT-2’s errors showed most predictions are valid continuations of the sentence, but are not valid final words. This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy.
- 結果：perplexity 99.8 --> 8.6；accuracy 19% --> 52.66%
Winograd Schema Challenge
- This challenge was constructed to measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text.
- 結果：隻有較大的兩個size超過了SOTA
Reading comprehension
- The Conversation Question Answering dataset (CoQA) Reddy et al. (2018) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document.
- 結果：貌似表現并不好，沒超過SOTA
Summarization
- To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k = 2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary.
- On the commonly reported ROUGE 1,2,L metrics the generated summaries only begin to approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article.
- GPT-2’s performance drops by 6.4 points on the aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.
Translation
- In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format English sentence = french sentence and then after a final prompt of English sentence = we sample from the model with greedy decoding and use the first generated sentence as the translation.
- 結果：Achieving 11.5 BLEU. Outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019).
- Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to confirm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language which is approximately 500x smaller than the monolingual French corpus common in prior unsupervised machine translation research. （問題是訓練的時候都沒見過法語，那它是怎麼翻譯的？？？木有搞懂，還是說這裡目的在于說英語語料規模巨大，是以法語語料隻有清理時遺漏的零星一點點，也可以完成翻譯任務？？）
Question answering
- Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. （木有完全搞懂seed的原理，是在告訴模型這種問題的回答形式是什麼嗎？那seed的作用原理是什麼？）
- GPT-2 answers 5.3 times more questions correctly, suggesting that model capacity has been a major factor in the poor performance of neural systems on this kind of task as of yet. （但這不是說明GPT-2容量魁梧記憶力強麼，他如果能在看過那麼多語料中的語句都“記住”的話，就像記住了整個Wikipedia的回答一樣，就能回答很多問題啊）

Generalization V.S. Memorization

Recent work in computer vision has shown that common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has 3.3% overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of machine learning systems.
To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set.
Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. Performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways.

Discussions

方向1：繼續探索pre-training + fine-tuning 的兩階段模式。 While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning.
方向2：探索如何倔強的使用單項語言模型并且幹過BERT。 Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018). （這是在公然挑釁BERT麼？我就是用單向LM，我就是要用更巨無霸的Transformer和大到漫無天際的訓練資料來超過你）

原paper：https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Language Models are Unsupervised Multitask Learners 論文紀要AbstractIntroductionApproachModelExperimentsGeneralization V.S. MemorizationDiscussions

Abstract

Introduction

Approach

Model

Experiments

Generalization V.S. Memorization

Discussions

繼續閱讀

microsoft 的gpt2模型源碼學習記錄

文本摘要綜述-bertsum、BottleSum、TextRANk

GPT系列：GPT, GPT-2, GPT-3精簡總結 (模型結構+訓練範式+實驗)1、GPT2、GPT-23、GPT-3