$$ \newcommand{\problemdivider}{\begin{center}\large \bf\ldots\ldots\ldots\ldots\ldots\ldots\end{center}} \newcommand{\subproblemdivider}{\begin{center}\large \bf\ldots\ldots\end{center}} \newcommand{\pdiv}{\problemdivider} \newcommand{\spdiv}{\subproblemdivider} \newcommand{\ba}{\begin{align*}} \newcommand{\ea}{\end{align*}} \newcommand{\rt}{\right} \newcommand{\lt}{\left} \newcommand{\bp}{\begin{problem}} \newcommand{\ep}{\end{problem}} \newcommand{\bsp}{\begin{subproblem}} \newcommand{\esp}{\end{subproblem}} \newcommand{\bssp}{\begin{subsubproblem}} \newcommand{\essp}{\end{subsubproblem}} \newcommand{\atag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{subsection}.\alph{equation}}} \newcommand{\btag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{section}.\alph{equation}}} \newcommand{\ctag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\arabic{equation}}} \newcommand{\dtag}[1]{\addtocounter{equation}{1}\label{#1}\tag{\Alph{chapter}.\arabic{section}.\arabic{equation}}} \newcommand{\unts}[1]{\ \text{#1}} \newcommand{\textop}[1]{\operatorname{#1}} \newcommand{\textopl}[1]{\operatornamewithlimits{#1}} \newcommand{\prt}{\partial} \newcommand{\pderi}[3]{\frac{\prt^{#3}#1}{\prt #2^{#3}}} \newcommand{\deri}[3]{\frac{d^{#3}#1}{d #2^{#3}}} \newcommand{\del}{\vec\nabla} \newcommand{\exval}[1]{\langle #1\rangle} \newcommand{\bra}[1]{\langle #1|} \newcommand{\ket}[1]{|#1\rangle} \newcommand{\ham}{\mathcal{H}} \newcommand{\arr}{\mathfrak{r}} \newcommand{\conv}{\mathop{\scalebox{2}{\raisebox{-0.2ex}{$\ast$}}}} \newcommand{\bsm}{\lt(\begin{smallmatrix}} \newcommand{\esm}{\end{smallmatrix}\rt)} \newcommand{\bpm}{\begin{pmatrix}} \newcommand{\epm}{\end{pmatrix}} \newcommand{\bdet}{\lt|\begin{smallmatrix}} \newcommand{\edet}{\end{smallmatrix}\rt|} \newcommand{\bs}[1]{\boldsymbol{#1}} \newcommand{\uvec}[1]{\bs{\hat{#1}}} \newcommand{\qed}{\hfill$\Box$} $$
Tags:
  • python
  • deeplearning
  • NN modules

    RNN

    Otherwise excellent PyTorch documentation does not fully specify how multilayer bidirectional RNN outputs get packed.

    The output of a RNN block is the tuple (output, h_n), where output is the hidden state for each input of the outermost recurrent layers, and h_n is the final hidden states of all layers. There is no way to get all hidden states for all layers.

    For a given batch element e_idx, h_n is structured like so

    h_n[:, e_idx, :] == [
      [ layer 0 final forward hidden state ], # corresponds to last element in sequence
      [ layer 0 final backward hidden state ], # corresponds to first element in sequence
      ...
      [ layer n final forward hidden state ],   # call this `X`. (last element in seq.)
      [ layer n final backward hidden state ],  # call this `Y`. (first element in seq.)
    ]  
    

    Outer layer bidirectional outputs are concatenated, with the forward outputs coming first. For a network with hidden size H, the last element in the sequence has forward output

    output[e_idx, -1, 0:H] == X
    

    Likewise, the first element in the sequence has backwards output

    output[e_idx, 0, H:] == Y
    

    Dropout only gets applied to inner nested layers, and is useless for non-stacked RNNs. See error:

    dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1
    

    MultiheadAttention

    Docs say that attention masks should be specified as <Nbatch * Nheads> x <target (query) seq len> x <source (key) seq len>. However, they do not specify how the first dimension should be ordered.

    According to this SO answer, it’s (batch, head) interleaved.

    Sadly, there doesn’t seem to be a way to use the same mask for all heads, which would elide a repeat().

    Printing options

    To fully print tensors,

    t.set_printoptions(profile = "full", linewidth = 100)
    

    should be good enough. Full print options found here.

    Note that unlike numpy, this cannot work in a context manager; print options need to explicitly be reset to default via t.set_printoptions(profile = "default").

    Weirdness

    Why does this work?

    t[index][index2] = <something>
    

    This chained assignment does not update anything in t, but it still works for some reason.

    Backpropagation

    On backpropping subsets of output: https://discuss.pytorch.org/t/backwards-with-only-a-subset-of-the-output-losses/188313/7

    See also: https://github.com/pytorch/pytorch/issues/9688