pytorch//wiki č.cc

NN modules

RNN

Otherwise excellent PyTorch documentation does not fully specify how multilayer bidirectional RNN outputs get packed.

The output of a RNN block is the tuple (output, h_n), where output is the hidden state for each input of the outermost recurrent layers, and h_n is the final hidden states of all layers. There is no way to get all hidden states for all layers.

For a given batch element e_idx, h_n is structured like so

h_n[:, e_idx, :] == [
  [ layer 0 final forward hidden state ], # corresponds to last element in sequence
  [ layer 0 final backward hidden state ], # corresponds to first element in sequence
  ...
  [ layer n final forward hidden state ],   # call this `X`. (last element in seq.)
  [ layer n final backward hidden state ],  # call this `Y`. (first element in seq.)
]  

Outer layer bidirectional outputs are concatenated, with the forward outputs coming first. For a network with hidden size H, the last element in the sequence has forward output

output[e_idx, -1, 0:H] == X

Likewise, the first element in the sequence has backwards output

output[e_idx, 0, H:] == Y

Dropout only gets applied to inner nested layers, and is useless for non-stacked RNNs. See error:

dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1

MultiheadAttention

Docs say that attention masks should be specified as <Nbatch * Nheads> x <target (query) seq len> x <source (key) seq len>. However, they do not specify how the first dimension should be ordered.

According to this SO answer, it’s (batch, head) interleaved.

Sadly, there doesn’t seem to be a way to use the same mask for all heads, which would elide a repeat().

Printing options

To fully print tensors,

t.set_printoptions(profile = "full", linewidth = 100)

should be good enough. Full print options found here.

Note that unlike numpy, this cannot work in a context manager; print options need to explicitly be reset to default via t.set_printoptions(profile = "default").

Weirdness

Why does this work?

t[index][index2] = <something>

This chained assignment does not update anything in t, but it still works for some reason.

Backpropagation

On backpropping subsets of output: https://discuss.pytorch.org/t/backwards-with-only-a-subset-of-the-output-losses/188313/7

See also: https://github.com/pytorch/pytorch/issues/9688