Otherwise excellent PyTorch documentation does not fully specify how multilayer bidirectional RNN outputs get packed.
The output of a RNN block is the tuple (output, h_n)
, where output
is the hidden state for each input of the outermost recurrent layers, and h_n
is the final hidden states of all layers. There is no way to get all hidden states for all layers.
For a given batch element e_idx
, h_n
is structured like so
h_n[:, e_idx, :] == [
[ layer 0 final forward hidden state ], # corresponds to last element in sequence
[ layer 0 final backward hidden state ], # corresponds to first element in sequence
...
[ layer n final forward hidden state ], # call this `X`. (last element in seq.)
[ layer n final backward hidden state ], # call this `Y`. (first element in seq.)
]
Outer layer bidirectional outputs are concatenated, with the forward outputs coming first. For a network with hidden size H
, the last element in the sequence has forward output
output[e_idx, -1, 0:H] == X
Likewise, the first element in the sequence has backwards output
output[e_idx, 0, H:] == Y
Dropout only gets applied to inner nested layers, and is useless for non-stacked RNNs. See error:
dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1
Docs say that attention masks should be specified as <Nbatch * Nheads> x <target (query) seq len> x <source (key) seq len>
. However, they do not specify how the first dimension should be ordered.
According to this SO answer, it’s (batch, head)
interleaved.
Sadly, there doesn’t seem to be a way to use the same mask for all heads, which would elide a repeat()
.
To fully print tensors,
t.set_printoptions(profile = "full", linewidth = 100)
should be good enough. Full print options found here.
Note that unlike numpy, this cannot work in a context manager; print options need to explicitly be reset to default via t.set_printoptions(profile = "default")
.
Why does this work?
t[index][index2] = <something>
This chained assignment does not update anything in t
, but it still works for some reason.
On backpropping subsets of output: https://discuss.pytorch.org/t/backwards-with-only-a-subset-of-the-output-losses/188313/7
See also: https://github.com/pytorch/pytorch/issues/9688