Neural Networks for Data Science Applications

[Official course website]

This is a set of lecture notes for the course Neural Networks for Data Science Applications (link), delivered in the Master’s Degree in Data Science (Sapienza University of Rome). This is a draft with no pretense of completeness, which is only provided as auxiliary material for self-study. It covers a superset of material compared to the slides, going more or less in-depth depending on the topic. A description of the organization of the notes can be found in Lecture 1 - Introduction, while a brief introduction to each chapter can be found below.

Feel free to comment here on Notion to provide feedback on the draft. Many parts are missing, let me know if you want to help in completing them.

<aside> ⚠️ TO-DO:

[ ] All lectures: Add additional code snippets.
[ ] Somewhere: Add a discussion on shattered gradients.
[ ] Lecture 6: add an historical overview of the R-AD algorithm; ~~add explanation in_place in the ReLU~~; add an example of a simple implementation of R-AD.
[ ] Lecture 10: add no positional embeddings (NoPE).
[ ] Lecture 11: discuss some theoretical properties of GNNs.
[ ] Lectures 12, E1-E5: TBD. </aside>

Notation

As we will see in Lecture 2 - Preliminaries, our fundamental data type for the course is a tensor, which we define as an $n$-dimensional array of objects, typically real-valued numbers. We call $n$ the rank of the tensor (with the necessary apologies to any mathematician reading us). The notation in the notes vary depending on $n$:

$n=0$: with a slight abuse of notation, this is just a single value (a scalar). For scalars, we use lowercase letters, e.g., $x$.
$n=1$: this is a column of values, i.e., a vector. For vectors we use a lowercase bold font, e.g., $\mathbf{x}$. The corresponding row vector is denoted by $\mathbf{x}^\top$ when we need to distinguish them. Sometimes we may ignore the transpose for readability, if the shape is clear from context.
$n=2$: a rectangular array of values, i.e., a matrix. We use an uppercase bold font, e.g., $\mathbf{X}$.
$n>2$: no specific notation is used.

We use a variety of indexing strategies described better in Linear algebra, while additional notation is introduced when necessary. In many cases, fully understanding a method requires to understand precisely the shape of each tensor involved. To denote the shape concisely, we use the following notation:

$$ X \sim(b,h,w,3) $$

This is a rank-$4$ tensor with shape $(b,h,w,3)$. Some dimensions can be pre-specified (e.g., $3$ in this case), other dimensions can instead be denoted by variables. Note that we use the same symbol to denote drawing from a probability distribution, e.g., $\varepsilon \sim \mathcal{N}(0,1)$, but we do this rarely and the meaning of the symbol should always be clear from context. Hence, $\mathbf{x} \sim (d)$ will substitute the more common $\mathbf{x} \in \mathbb{R}^d$, and similarly for $\mathbf{X} \sim (n,d)$ instead of $\mathbf{X} \in \mathbb{R}^{n \times d}$.

Sometimes we want to constraint the elements of a tensor, for which we use a special notation:

$\mathbf{x} \sim \text{Binary}(c)$ denotes a tensor with only binary values, $\left\{0,1\right\}$.
$\mathbf{X} \sim \text{Int}_{1,4}(a,b)$ denotes an $a \times b$ matrix with integers in the range $[1,4]$.
$\mathbf{x} \sim \Delta(a)$ denotes a vector belonging to the so-called simplex, i.e., $x_i \ge 0$ and $\sum_i x_i = 1$. For tensors with higher rank, e.g., $\mathbf{X} \sim \Delta(n,c)$, we assume the normalization is applied with respect to the last dimension. For example, in this case each row $\mathbf{X}_i$ belongs to the simplex.

Notation

Code snippets