正在加载图片...
the input sequence centered around the respective output position.This would increase the maximum path length to O(n/r).We plan to investigate this approach further in future work. A single convolutional layer with kernel width k<n does not connect all pairs of input and output positions.Doing so requires a stack of O(n/k)convolutional layers in the case of contiguous kernels or O(logk(n))in the case of dilated convolutions [15],increasing the length of the longest paths between any two positions in the network.Convolutional layers are generally more expensive than recurrent layers,by a factor of k.Separable convolutions [6],however,decrease the complexity considerably,to O(k.n.d+n.d2).Even with k =n,however,the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. As side benefit,self-attention could yield more interpretable models.We inspect attention distributions from our models and present and discuss examples in the appendix.Not only do individual attention heads clearly learn to perform different tasks,many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. 5 Training This section describes the training regime for our models. 5.1 Training Data and Batching We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.Sentences were encoded using byte-pair encoding [3],which has a shared source- target vocabulary of about 37000 tokens.For English-French.we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31].Sentence pairs were batched together by approximate sequence length.Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs.For our base models using the hyperparameters described throughout the paper,each training step took about 0.4 seconds.We trained the base models for a total of 100,000 steps or 12 hours.For our big models.(described on the bottom line of table 3),step time was 1.0 seconds.The big models were trained for 300,000 steps (3.5 days). 5.3 Optimizer We used the Adam optimizer [17]with B1=0.9.B2=0.98 and e=10-9.We varied the learning rate over the course of training,according to the formula: lrate=dmin(step num-0,step num.warmup steps15) (3) This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.We used warmup_steps =4000. 5.4 Regularization We employ three types of regularization during training: Residual Dropout We apply dropout [27]to the output of each sub-layer,before it is added to the sub-layer input and normalized.In addition,we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.For the base model,we use a rate of Pdrop =0.1. >the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work. A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [15], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d 2 ). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. 5 Training This section describes the training regime for our models. 5.1 Training Data and Batching We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source￾target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). 5.3 Optimizer We used the Adam optimizer [17] with β1 = 0.9, β2 = 0.98 and  = 10−9 . We varied the learning rate over the course of training, according to the formula: lrate = d −0.5 model · min(step_num−0.5 , step_num · warmup_steps−1.5 ) (3) This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. 5.4 Regularization We employ three types of regularization during training: Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0.1. 7
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有