독서정리 - 딥러닝을 위한 수학

Contents

Ch1. 도구 소개

Numpy

# seed 초기화
np.random.seed(42)

# data type 명시 가능
a = np.array([1,2,3,4], dtype="uint8")
a.dtype # uint8

# array filled with zeros, ones
np.zeros((3,4)).shape # (3,4)
np.ones((3,4)).size # 12

# arange and linspace and such
a = np.arange(12).reshape((3,4))
a = np.linspace(0, 8, 20) # [0, 8]을 20개로 균등하게 나눈 값들을 반환
'''
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
'''
a[1] # a[1,:]와 동일하며 array([4, 5, 6, 7])
a[1] = [44,55,66,77]
a
'''
array([[ 0,  1,  2,  3],
       [ 44,  55,  66,  77],
       [ 8,  9, 10, 11]])
'''
a[1,::-1] # array([77, 66, 55, 44])

# these are the same (abbreviation for :)
a[1,:,:,:]
a[1,...]

# sampling
n = 1000
np.random.normal(0,1,n) # avg, std, n
np.random.random(n) # [0,1)에서 난수 n개 샘플링
np.random.randint(a, b, n) # [a, b) 범위에서 정수 n개 샘플링

#misc
np.argsort(arr) # 정렬되었을 때 indices를 반환
np.bincount(arr, minlength=n) # arr의 원소 개수 반환, minlength는 반환 리스트의 최소 크기

# save and load
np.save("arr.npy", a)
a = np.load("arr.npy")

SciPy

a = np.random.normal(0,1,1000)
b = np.random.normal(0,0.8,1000)
scipy.stats.ttest_ind(a, b)
# TtestResult(statistic=np.float64(-0.22508693369270907), pvalue=np.float64(0.8219346636644121), df=np.float64(1998.0))

Ch2. 확률

sample space : 한 event의 모든 가능한 결과의 집합
mutually exclusive : 두 event가 동시에 발생하지 않는 경우
independent : 두 event의 발생 확률이 서로 무관한 경우 어떤 event의 결과는 random variable(e.g. X)로 나타낼 수 있고 sample space의 하나의 값을 특정한 확률로 취한다 $P(X=s)=p$
sum rule : mutually exclusive한 사건들에 대해

\[P(A_0\cup A_1\cdots \cup A_n ) = \sum_{i=0}^nP(A_i)\]

disjoint는 교집합이 없다는 의미이다. sample space는 서로 disjoint인 부분집합들로 나눌 수 있는데 그런 식으로 나눈 각 부분집합들을 집합의 partition이라고 한다.

partition $B_i$에 대한 A의 total probability

\[P(A)=\sum_i P(A|B_i)P(B_i)\]

joint probability : 여러 random variable에 대해서 동시에 여러 condition을 만족할 때의 확률 $P(X_i, Y_j, Z_k)$
marginal probability : 하나의 random variable에 대해 나머지 random variable가 가질 수 있는 값을 모두 더한 확률 $P(X_i)=\sum^n_{j=0} (X_i, Y_j)$
chain rule for probability

\[P(X_n,X_{n-1},\dots,X_1)=\prod_{i=1}^nP(X_i|\bigcap^{i-1}_{j=1} X_j)\]

예를 들어, $P(X,Y,Z)=P(X\vert Y,Z)P(Y,Z)=P(X\vert Y,Z)P(Y\vert Z)P(Z)$ 라는 뜻이다.

Probability distribution : 특정한 패턴에 따라 값을 랜덤하게 생성하는 함수
- Discrete distribution
  - Binomial distribution : 주어진 확률 p에 대해서 시행을 n회 했을 때 각 사건의 발생 횟수에 대한 확률 분포 $P(X=k)=\binom nk p^{k}(1-p)^{n-k}$
    - Bernoulli distribution : 시행 횟수가 1일 때의 binomial distribution
  - Poisson distribution : 일정한 시간 간격 동안 평균 발생 횟수가 n일 때 동일한 시간 간격 동안 사건의 발생 횟수에 대한 확률 분포 $P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}$
- Continuous distribution : 특정한 하나의 값이 선택될 확률이 0
  - Uniform distribution
  - Normal distribution(bell curve, gaussian curve)
  - Bimodal distribution : 쌍봉낙타처럼 peak가 2개인 분포
  - Beta distribution
  - Gamma distribution

t = np.random.random(n)
t = np.random.normal(m, std, size=n)
t = np.random.beta(a, b, size=n)
t = np.random.gamma(k, scale, size=n)

np.histogram(t, bins=bins) # 히스토그램의 각 구간을 bin이라고 부른다. 

probability density function : continuous distribution에서 값을 추출할 때 적용되는 확률을 생성하는 함수
Central limit theorem : 원래 확률 분포와 무관하게 여러번 샘플링해서 얻은 표본의 평균에서 생성한 확률 분포는 normal distribution에 가까워진다
Law of large numbers : 표본의 크기가 클수록 표본의 평균이 모평균에 가까워진다
Bayes’ theorem : posterior probability $P(A\vert B)$는 likelihood $P(B\vert A)$에 prior probability $P(A)$를 곱한 다음 evidence $P(B)$로 정규화한 것이다

\[P(A\vert B)=\frac{P(B\vert A)P(A)}{P(B)}=\frac{P(B\vert A)P(A)}{P(B\vert A)P(A)+P(B\vert A^{C})P(A^{C})}\]

$P(A)$에 $P(A\vert B)$를 대입해서 연쇄적으로 베이즈 정리를 적용할 수 있다. 이때 $P(B)$는 marginal probability이므로 새로 계산해서 적용하면 된다.

Naive bayes classifier
label $y$와 feature vector $x=(x_0,x_1,\dots,x_{n-1})$가 있다고 할 때 $x$의 각 원소가 나올 확률이 서로 독립 $P(x)=P(x_0)P(x_1)\cdots P(x_{n-1})$ 이라고 가정하면 (보통 실제로는 아님)

\[P(y\vert x)=\frac{P(x\vert y)P(y)}{P(x)}\]

이때 $P(x\vert y)$는 각 label 일때 $x$가 나왔던 비율이고 분모의 $P(x)$는 $x$를 분류할 때는 모두 동일하기 때문에 계산을 생략해도 된다

Ch3. 통계

data
- categorical data(nominal data) : 순서 없는 데이터
- ordinal data : 순서는 있는데 차이의 의미가 없는 데이터
- interval data : 순서도 있고 차이의 의미도 있지만 영점이 없는 데이터
- ratio data : 순서도 있고 차이의 의미랑 영점도 있는 데이터
mean
- unweighted arithmetic mean
- weighted arithmetic mean ($\sum_i w_i=1$ 일때) : $\bar{x}=\sum_i^{n-1}w_i x_i$
- geometric mean : $^n\sqrt{\prod_i^{n-1} x_i}$
- harmonic mean : $(\frac{1}{n}\sum_i^{n-1} \frac{1}{x_i})^{-1}$
variation
- range : $\max(x) - \min(x)$
- mean deviation : $MD=\frac{1}{n}\sum_i^{n-1} \vert \bar{x}-x_i\vert$
- biased sample variance : $s_n^2=\frac{1}{n}\sum_i^{n-1}(\bar{x}-x_i)^2$
- unbiased sample variance : $s^2=\frac{1}{n-1}\sum_i^{n-1}(\bar{x}-x_i)^2$
- standard deviation : $s$
- median absolute deviation : $MAD=median(\vert median(x)-x_i\vert)$
standard error(SE) : sample들의 평균값 집합의 표준편차이고 표본 평균이 얼마나 모집단 평균을 잘 추정했는지를 뜻한다 $SE=\frac{s}{\sqrt{n}}$
2-quantile = 50th percentile = Q2

q = np.quantile(x, [0.0, 0.25, 0.5, 0.75, 1.0])
q = np.percentile(x, [0, 25, 50 ,75, 100])

box plot
- IQR : Q3-Q1 사이 부분
- whisker(flier) : Q3 or Q1 $\pm$ 1.5 IQR

missing data 채울 때는 mean보다는 median이 더 나은 경우가 많다.

x[i] = np.nan
np.isnan(x)

correlation
- pearson correlation coefficient(r) : linear한 연관 관계를 확인할 수 있고 non-linear 한 상관 관계는 0일 때가 있다.
- spearman correlation coefficient($\rho$) : monotonic한 연관 관계를 확인할 수 있고 feature 값을 rank 값으로 바꿔서 사용한다.

d = np.vstack((a, b, c))
np.corrcoef(d) # PCC matrix를 반환

귀무가설(null hypothesis, $H_0$) : 차이가 없다는 가설
대안가설(alternative hypothesis, $H_a$) : 차이가 있다는 가설 (grater, less, two-sided)
hypothesis testing
- t-test : parametric
- Mann-Whitney U test : non-parametric, 데이터 값 자체가 아니라 rank 값을 사용한다.

parametric은 데이터의 분포가 정규분포를 따르는 것을 가정한다.
p-value는 $H_0$이 참이라고 할 때 관측된 통계치가 나타날 확률을 뜻하고 검정값을 기준으로 각 검정법의 확률 분포 함수를 적분해서 얻는다.
p-value가 특정 threshold($\alpha$)보다 작으면 $H_0$를 reject 하고 $H_a$를 accept 할 증거가 된다. 이를 결과가 statistically significant 하다고 말한다. 검정법으로는 $H_0$가 참인지 거짓인지 결정적으로 알 수는 없다.
일반적으로 Mann-whitney U test가 $H_0$를 reject 하고 t-test에서는 reject 못 했으면 Mann-whitney U test의 결과를 받아들이는게 좋다.

confidence interval (CI) : 예를 들어 95%의 CI라고 하면 동일한 방식으로 100번 표본을 뽑아서 신뢰 구간을 구하면 그 중 95개가 실제 모평균을 포함하게 된다는 뜻이다. 그냥 모평균이 CI에 있을 확률이 95%라고 이해해도 된다. p-value가 $\alpha$보다 작으면 $H_0$를 포함하지 않는 $CI_\alpha$가 존재한다.
size effect : 두 데이터 간의 평균 차이를 표준 편차로 나눈 값으로 두 데이터의 평균 차이가 얼마나 의미있는 차이인지 나타낸다. 그냥 평균 차이를 표준화한 지표라고 보면 된다.
- e.g. cohen’s d : 해석하는 사람마다 차이가 있는데 대략 0.2 정도면 effect가 낮고 0.8 정도면 effect가 높다고 보면 된다.

effect size와 p-value : 두 모집단의 평균 차이가 얼마 되지 않는 경우에 각 모집단에서 표본을 뽑아서 두 표본 간의 평균이 같은지 다른지 검정한다고 할때, 표본의 크기가 크면 p-value는 충분히 낮게 나올 수 있지만(i.e. statistically significant) 두 표본 간의 effect size는 낮게 나온다(i.e. 실질적으로 의미있는 정도의 차이가 아닐 수 있다).

Ch4. 선형대수

feature space : 가능한 모든 입력의 집합

x = np.array([4, 3, 7]) 
# array([4, 3, 7]) ; shape = (3, )
x[np.newaxis,:]
# array([[4, 3, 7]]) ; shape = (1, 3)

vector를 이용한 diagonal matrix의 표기법

\[\vec{x}=\begin{bmatrix}x_0\\x_1\\\vdots\\x_n\end{bmatrix},\ \text{diag}(\vec{x})=\begin{bmatrix}x_0&0&\cdots&0\\0&x_1&\cdots&0\\\vdots&\vdots& &\vdots\\0&0&\cdots&x_n\end{bmatrix}\]

block matrix : 더 작은 matrix 들로 구성된 matrix

\[A = \begin{bmatrix}1&2\\3&4\end{bmatrix},\ B = \begin{bmatrix}1&1\\1&1\end{bmatrix}\] \[B = \begin{bmatrix}A&B\\B&A\end{bmatrix}=\begin{bmatrix}1&2&1&1\\3&4&1&1\\1&1&1&2\\1&1&3&4\end{bmatrix}\]

operation
- element-wise
  - hadamard product : $C=A\odot B,\ c_{ij}=a_{ij} b_{ij}$
- transpose : numpy에서 1차원 tensor인 array는 transpose 안 먹힌다
- vector
  - inner product(dot product) : $a\cdot b=\langle a,b\rangle=a^{\top}b=\vert a\vert\vert b\vert \cos\theta$
  - outer product : $a\otimes b=ab^{\top}$
    - cf. cartesian product : $A\times B={(a, b)\ \vert\ a\in A, b\in B}$
  - cross product : $a\times b=\vert a\vert \vert b\vert sin\theta\ \hat{n}$
- matrix
  - matrix multiplication
  - kronecker product(matrix direct product) :

\[A\otimes B = \begin{bmatrix}a_{0,0}B&a_{0,1}B&\cdots&a_{0,{n-1}}B\\ a_{1, 0}B&a_{1, 1}B& & \\ \vdots& &\ddots&\vdots\\ a_{n-1,0}B& &\cdots&a_{n-1,n-1}B\end{bmatrix}\]

np.dot(a, b)
np.outer(a, b)
np.cross(a, b)
np.dot(a, b) # 스칼라 값 가능
np.matmul(a, b) # 스칼라 값 불가능
np.kron(a, b)

dot이나 matmul 연산에서 1차 tensor(그냥 array)끼리 연산하면 스칼라가 나오고 1차 tensor랑 2차 tensor끼리 연산하면 1차 tensor가 나오고 2차 이상끼리 하면 2차 이상끼리 나온다.

\[\begin{bmatrix}x^{\prime} \\y^{\prime}\\1\end{bmatrix}=\begin{bmatrix}a&b&l\\c&d&m\\0&0&1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}\]

affine transformation에서 bias를 matrix 안에 포함시킬 수도 있다.

trace

\[trA=\sum^{n-1}_{i=0}a_{ii}\] \[tr(AB)=tr(BA)\]

np.diag(A) # 1차 tensor 반환
np.trace(A)
np.linalg.matrix_power(A, n) # A**n 반환
np.identity(n) # np.eye(n)랑 동일
np.triu(A) # upper triangular matrix 반환
np.tril(A) # lower triangular matrix 반환

$A$의 $(1,1)$-minor matrix ($A_{11}$)

\[A=\begin{bmatrix}1&2&3\\4&5&6\\7&8&9 \end{bmatrix},\ A_{11}=\begin{bmatrix}1&3\\7&9\end{bmatrix}\]

cofactor $C_{ij}$

\[C_{ij}=(-1)^{i+j+2} det(A_{ij})\]

determinant $A$

\[det(A)=\sum_{j=0}^{n-1}a_{0j}C_{0j}=\sum_{j=0}^{n-1}a_{j0}C_{j0}\]

inverse matrix가 존재하면 invertible matrix(nonsingular matrix, nondegenerate matrix)라고 부르고 없으면 singular matrix라고 한다.

\[AA^{-1}=A^{-1}A=I,\ (AB)^{-1}=B^{-1}A^{-1}\]

np.linalg.inv(A)

symmetric matrix의 inverse matrix가 존재한다고 할 때, 그 inverse matrix는 symmetric matrix다. symmetric matrix 끼리의 연산에서는 교환법칙 $AB=BA$이 성립한다.

orthogonal matrix : $AA^{\top}=A^{\top}A=I,\ A^{\top}=A^{-1}$

복소수 행렬 $U$가 $U^{\ast}U=UU^{\ast}=I$를 충족할 때 $U$를 unitary matrix, $U^{\ast}$를 conjugate transpose(hermitian adjoint)라고 부른다. conjugate transpose는 matrix를 transpose 한 후에 각 성분들에 conjugate 연산(복소켤레)을 한 것이다.

hermitian matrix : 자신의 conjugate transpose와 동일한 matrix

\[\begin{align} Av=\lambda v\ (v\ne \vec{0})\\(A-I\lambda)v=0\\ det(A-I\lambda)=0 \end{align}\]

행렬식을 전개해서 나온 다항식을 characteristic polynomial이라고 하고 characteristic polynomial을 0으로 둬서 만든 방정식을 characteristic equation이라고 한다. triangular matrix나 diagonal matrix는 determinant가 대각성분 성분들 곱한거니까 main diagonal이 eigenvalue들이다.

\[k=x^{\top}Ax, \forall x\ne \vec{0}\]

symmetric matrix의 definiteness
- $k > 0$이면 positive definite $\longrightarrow$ 모든 eigenvalue가 양수
- $k<0$이면 negative definite $\longrightarrow$ 모든 eigenvalue가 음수
- 해당 안되면 indefinite

0을 포함시켜서 positive 또는 negative semidefinite도 가능하다.

eigenvalues, eigenvectors = np.linalg.eig(A) # 반환되는 eigenvector는 unit vector
eigenvalues = np.linalg.eigvals(A)

$p$-norm ($L_p$ norm)
- L1 norm(manhattan distance)
- L2 norm(euclidean distance)

\[\Vert x\Vert_p=(\sum_i\vert x_i\vert^p)^{\frac{1}{p}}\]

$L_{\infty}$ norm (chebyshev distance)

\[L_{\infty}=\max \vert x_i\vert\]

distance는 두 벡터 차에 대해서 norm 계산하는 것이다.

covariance matrix

\[\Sigma_{ij}=\frac{1}{n-1}\sum_{k=0}^{n-1}(x_{ki}-\bar{x}_i)(x_{kj}-\bar{x}_j)\]

mahalanobis distance : 데이터의 분산을 고려한 distance

\[D_M=\sqrt{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}\]

$\mu$는 벡터의 각 feature 값의 평균값으로 구성된 벡터지만 $\mu$ 대신 공간 상의 임의 벡터를 사용해서 distance를 구할 수 있다. 각 label에 속한 벡터들의 평균 벡터인 centroid vector를 계산해서 임의 벡터가 어떤 label의 centroid vector와 가까운지 계산하는 식으로 nearest centroid classifier를 만들 수 있다. 이상치 검출하는데 mahalanobis distance가 사용될 수 있다.

KL-divergence(Kullback-Leibler divergence, relative entropy) : 두 확률 분포 사이의 distance를 측정

\[D_{KL}(P\Vert Q)=\mathbb{E}[\Delta I_i]=\mathbb{E}[\log p_i-\log q_i]\] \[=\sum_x p_i\log_2(\frac{p_i}{q_i})\]

KL-divergence는 두 확률 분포의 정보량 차이의 기댓값이다. 정보량은 $I_i=-\log{p_i}$이므로 원본 확률 분포 $p$와 비교 대상 확률 분포 $q$에 대해서 KL-divergence는 위 식처럼 정의할 수 있다.
KL-divergence의 단위는 bit이고 자연로그 사용하면 nat($1$ nat = $log_2{e}$ bit) 이다. $D_{KL}$은 symmetric하지 않아서 수학적 의미에서는 distance function은 아니지만 distance 측정에 사용할 수는 있다.
KL-divergence를 symmetric하게 개량한 Jensen-Shannon Divergence라는 것도 있다.

\[\text{JSD}(P,Q)=\frac{1}{2}D_{KL}(P\Vert\frac{P+Q}{2})+\frac{1}{2}D_{KL}(Q\Vert\frac{P+Q}{2})\]

np.cov(X, rowvar=False) # covariance matrix 반환하고 unbiased estimator 이용(n-1)
np.std(X, axis=0, ddof=1) # 표준편차 반환하고 ddof가 1이면 unbiased estimator 이용
scipy.spatial.distance.mahalanobis(a, b, SI) # SI는 inverse covariance matrix
scipy.special.rel_entr(p, q).sum() # p, q는 이산확률분포 1차 텐서

PCA(principal component analysis) : principal component는 data point가 가장 길게 흩어져 있는 방향을 나타내고 이후에 구하는 성분들은 이전에 구한 모든 성분과 수직이다.
- 구하는 절차
  1. 관측값에서 평균을 빼는 mean centering 진행
  2. $\Sigma$(covariance matrix) 구하기
  3. $\Sigma$의 eigenvalue와 eigenvector를 구하고 eigenvalue를 정렬한 다음 큰 값들을 적절히 선택
  4. 선택한 eigenvalue에 대응되는 eigenvector로 변환 행렬 $W$를 만든다
  5. $W$에 기존 데이터를 곱해서 derived variable를 구한다
SVD(singular value decomposition)

\[A=U\Sigma V^{\top}\]

이때 $U, V$는 orthogonal matrix고 $\Sigma$는 대각 성분이 $A^{\top}A$의 positive eigenvalue의 제곱근인 rectangular diagonal matrix다.

SVD로 PCA 구하기 : mean centering한 데이터를 SVD를 하고 $\Sigma$에서 큰 eigenvalue들만 선택해서 나머지 열은 제거하고 $U\Sigma$를 구하면 각 열이 component를 이루는 값이 된다.
SVD로 Moore-Penrose pseudoinverse $A^{+}$ 구하기 :

\[AA^{+}A=A\] \[A^{+}= V\Sigma^{+}U^{\top}\]

$\Sigma^{+}$는 $\Sigma$의 원소에 역수를 취한 값을 원소로 한다.

# PCA
pca = sklearn.decomposition.PCA(n_components=n)
pca.fit(x) # 변환 행렬 형성
d = pca.fit_transform(x) # 데이터 변환
# SVD
u, s, vt = scipy.linalg.svd(A) # 반환된 s는 행렬이 아니고 대각 성분을 담은 1차 tensor

Ch5. 미분과 행렬 미분

secant line : 곡선의 두 점과 만나는 직선
tangent line : 곡선의 한 점에 접하는 직선
stationary point : 접선의 기울기가 0인 point
- local minimum point
- local maximum point
- inflection point(3차원에서는 saddle point)
extremum
- local minimum
- local maximum
rise over run

\[\frac{\Delta y}{\Delta x}=\frac{y_1-y_0}{x_1-x_0}\]

infinitesimal : 무한소
closed form function : 유한한 종류와 개수의 연산으로 이루어진 함수
scalar field : 벡터를 스칼라로 대응시키는 함수 ($f:\mathbb{R}^n\rightarrow\mathbb{R}$)
vector field : 벡터를 벡터로 대응시키는 함수 ($f:\mathbb{R}^n\rightarrow\mathbb{R}^n$)
행렬 미분 표기 방법
- numerator layout : 미분 행렬이 원래 출력이랑 shape이 같게 표현
- denominator layout : 미분 행렬이 입력이랑 shape이 같게 표현

하나의 표기 방법에 transpose를 취하면 다른 표현법이 된다.

gradient (denominator layout으로 표현)

\[\nabla f(\begin{bmatrix}x_0\\x_1\\\vdots\\x_n\end{bmatrix})=\nabla f(x_0,x_1,\dots,x_n)=\begin{bmatrix}\frac{\partial f}{\partial x_0}\\\frac{\partial f}{\partial x_1}\\\vdots\\\frac{\partial f}{\partial x_n}\end{bmatrix}\]

gradient field : vector field의 일종으로 각 점에 해당하는 벡터를 그 점에서의 gradient vector로 대응하는 함수
directional derivative : 방향벡터 $u$에 대한 함수값 변화

\[D_u f(\vec{x})=u\cdot \nabla f(\vec{x})=u^{\top}\nabla f(\vec{x})=\Vert u\Vert \Vert\nabla f(\vec{x})\Vert \cos\theta\]

스칼라를 argument로 받고 벡터나 행렬을 반환하는 함수의 derivative를 tangent vector, tangent matrix라고 한다.

numerator layout에서 행렬 $X$를 argument로 받고 스칼라를 반환하는 함수의 derivative의 형태는 $X^{\top}$이다.

Jacobian matrix : 벡터값 함수 $f$의 벡터 $x$에 대한 derivative

\[J_x=\frac{\partial f}{\partial x}=\begin{bmatrix}\nabla f_0(x)^{\top}\\\nabla f_1(x)^{\top}\\\vdots\\\nabla f_{n-1}(x)^{\top}\end{bmatrix}=\begin{bmatrix}\frac{\partial f_0}{\partial x_0}&\frac{\partial f_0}{\partial x_1}&\cdots&\frac{\partial f_0}{\partial x_{m-1}}\\\frac{\partial f_1}{\partial x_0}&\frac{\partial f_1}{\partial x_1}&\cdots&\frac{\partial f_1}{\partial x_{m-1}}\\\vdots&\vdots& &\vdots\\\frac{\partial f_{n-1}}{\partial x_0}&\frac{\partial f_{n-1}}{\partial x_1}&\cdots&\frac{\partial f_{n-1}}{\partial x_{m-1}}\end{bmatrix}\]

Jacobian matrix는 점 $x_p$ 부근에서 벡터값 함수가 어떻게 변화하는지 나타낸다. Jacobian matrix의 eigenvalue가 모두 양수이면 주변 벡터의 방향성 형태가 unstable node로 봉우리(source)같은 형태고 모두 음수이면 stable node로 구덩이(sink)같은 형태다. 만약 부호가 서로 다르면 saddle point이다. 마지막으로 eigenvalue가 복소수 일때 실수부가 양수면 unstable spiral이고 음수면 stable spiral이다.

Newton’s method로 벡터를 argument로 받는 벡터 함수의 해 구하기

\[x_{n+1}=x_n-\frac{f(x_n)}{f^{\prime}(x_n)}\] \[x_{n+1}=x_n-J^{-1}\vert_{x=x_n} f(x_n)\]

Hessian matrix : scalar field의 second derivative인데 vector field로도 확장할 수 있다

\[H_f=J(\nabla f)=\begin{bmatrix}\frac{\partial^2 f}{\partial x_0^2}&\frac{\partial^2 f}{\partial x_0\partial x_1}&\cdots&\frac{\partial^2 f}{\partial x_0\partial x_{n-1}}\\\frac{\partial^2 f}{\partial x_1\partial x_0}&\frac{\partial^2 f}{\partial x_1^2}&\cdots&\frac{\partial^2 f}{\partial x_1\partial x_{n-1}}\\\vdots&\vdots& &\vdots\\\frac{\partial^2 f}{\partial x_{n-1}\partial x_0}&\frac{\partial^2 f}{\partial x_{n-1}\partial x_1}&\cdots&\frac{\partial^2 f}{\partial x_{n-1}^2}\end{bmatrix}\]

Hessian matrix로 stationary point의 극값 판단에 사용할 수 있다. Hessian matrix의 eigenvalue가 모두 양수면 local minimum point이고 모두 음수이면 local maximum point이고 부호가 다르면 saddle point이다.

vector field에서 Hessian matrix는 3차 tensor다

\[H_f=\begin{bmatrix}H_{f_0}\\H_{f_1}\\\vdots\\H_{f_{m-1}} \end{bmatrix}\]

optimization
- first-order optimization method : first derivative에만 의존하는 방법
- second-order optimization method : second derivative도 사용하는 방법
gradient decent

\[x_{n+1}=x_{n}-\eta f^{\prime}(x_n)\] \[x_{n+1}=x_{n}-\eta \nabla f(x_n)\]

Taylor series approximation을 고려한 Newton’s method

\[\begin{align}f(x)\approx f(a)+f^{\prime}(a)(x-a)+\frac{1}{2}f^{\prime\prime}(a)(x-a)^2\\f(x+\Delta x)\approx f(x)+f^{\prime}(x)\Delta x+\frac{1}{2}f^{\prime\prime}(x)(\Delta x)^2\\\frac{d}{d(\Delta x)}f(x+\Delta x)\approx f^{\prime}(x)+f^{\prime\prime}(x)\Delta x\\f^{\prime}(x+\Delta x)=0,\ \Delta x=-\frac{f^{\prime}(x)}{f^{\prime\prime}(x)} \\x_{n+1}=x_n-\frac{f^{\prime}(x_n)}{f^{\prime\prime}(x_n)}\\x_{n+1}=x_n-H_f^{-1}\vert_{x=x_n}\nabla f(x_n) \end{align}\]

inverse Hessian matrix를 구하는데 걸리는 시간 복잡도가 대략 $\mathcal{O}(k^3)$라서 실제로 사용하기 어렵다.

Ch6. 신경망의 데이터 흐름

Sigmoid function(logistic function)

\[\sigma (x)=\frac{e^x}{1+e^x}=\frac{1}{1+e^{-x}}\]

continuous time convolution

\[(f\ast g)(t)=\int_{-\infty}^{\infty}f(\tau)g(t-\tau)d\tau\]

kernel : $f$ 위로 stride 만큼 이동하는 집합 $g$로 sliding window에 포함된 $f$의 값과 곱해진다
filter : kernel을 쌓은 것

np.convolve(f, g[::-1]), mode=mode) # mode가 'same'이면 zero padding 적용하고 valid면 valid convolution
'''수학적 정의랑 동일하게 g가 reverse 되어서 적용된다'''
scipy.signal.convolve2d(img, k, mode=mode)

pooling layer
- max pooling
- mean pooling

Ch7. 역전파

Loss function

\[\begin{align} L(\theta;x,y)=L(w_1,w_2,\dots,w_n,b_1,b_2,\dots,b_m;x,y)\\\nabla L(\theta;x,y)\end{align}\]

MSE(mean squared error)

\[\text{MSE}=\frac{1}{n}\sum(y-y_{\text{pred}})^2\]

backpropagation

\[\frac{\partial L}{\partial w}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial w}\]

confusion matrix(contingency table) : 한 축은 실제 label로 하고 한 축은 모델이 예측한 label로 나타낸 matrix
- true positive
- false positive : positive로 예상했으나 실제론 negative인 경우
- true negative
- false negative : negative로 예상했으나 실제론 positive인 경우

\[\begin{align} \frac{\partial}{\partial x}(Wx+b)=W^{\top}\\\frac{\partial }{\partial W}(Wx+b)=x^{\top}\\\frac{\partial }{\partial b}(Wx+b)=1 \end{align}\]

computational graph : 데이터 의존 관계와 필요한 연산자를 포함하여 forward pass랑 backward pass의 데이터 흐름을 나타낸 그래프
- symbol-to-number : 데이터가 준비될 때 동적으로 그래프를 생성 e.g. pytorch
- symbol-to-symbol : 그래프를 미리 정적으로 생성 e.g. tensorflow

Ch8. 경사하강법

Vanilla gradient descent

\[\begin{align}W\leftarrow W-\eta \nabla W\\x\leftarrow x-\eta\frac{\partial f}{\partial x} \end{align}\]

batch training : 모든 training data를 한번에 사용해서 학습하는 방법
SGD(stochastic gradient training) : minibatch 단위로 가중치 갱신하는 방법으로 학습 시간도 줄어들고 기울기가 noisy해서 성능이 더 좋아질 수 있다
online learning : sample 하나하나에 gradient descent를 적용하여 학습하는 방법
momentum

\[\begin{align}v\leftarrow\mu v_{old}-\eta \frac{\partial f}{\partial w}\\w\leftarrow w+v \end{align}\]

Nesterov momentum : 수렴이 좀 더 빠를 수도 있다

\[\begin{align}v\leftarrow \mu v_{old}-\eta \nabla f(w+\mu v_{old})\\ w\leftarrow w+v\end{align}\]

RMSprop : decay term $\gamma$를 이용해서 gradient의 running average($m$)을 구한다. nonstationary한 process에서 robust한 알고리즘이어서 reinforcement learning에서 많이 사용된다.
- nonstationary process : 시간에 따라 통계량이 변하는 process
- stationary process

\[\begin{align}m\leftarrow \gamma m_{old}+(1-\gamma)(\frac{\partial f}{\partial w})^2\\ v \leftarrow - \frac{\eta}{\sqrt{m}} \frac{\partial f}{\partial w}\\ x\leftarrow x+v \end{align}\]

RMSprop with Nesterov momentum

\[\begin{align}m\leftarrow \gamma m_{old}+(1-\gamma)(\nabla f(w+\mu v))^2\\ v \leftarrow \mu v_{old} -\frac{\eta}{\sqrt{m}} \nabla f(w+\mu v)\\ w\leftarrow w+v \end{align}\]

AdaGrad : $w_i$는 $w$의 각 성분에 모두 적용한다는 의미이고 $\tau$는 이전까지의 모든 step이다

\[\begin{align} v_i\leftarrow \frac{-\eta}{\sqrt{\sum_{\tau} (\frac{\partial f}{\partial w_i})^2}}\frac{\partial f}{\partial w_i}\\ w\leftarrow w+v\end{align}\]

Adam : $m$은 first moment이고 $v$는 second moment이다. hat 붙은 항들은 bias correction terms으로 각 moment 값을 조정한다. $t$는 현재 timestep이다. $\epsilon$은 0으로 나누는 것을 막기 위해서 넣은 상수다.

\[\begin{align}m\leftarrow \beta_1m+(1-\beta_1)\frac{\partial f}{\partial w}\\ v\leftarrow \beta_2v+(1-\beta_2)(\frac{\partial f}{\partial w})^2\\ \hat{m}\leftarrow \frac{m}{1-\beta_1^t}\\\hat{v}\leftarrow\frac{v}{1-\beta_2^t}\\ w\leftarrow w-\frac{\eta}{\sqrt{\hat{v}}+\epsilon}\hat{m} \end{align}\]

Ch9. 부록

Probability and Statistics
- Probability and Statistics by Michael Evans and Jeffrey Rosenthal
  - A comprehensive textbook approach that’s available for free here: http://www.utstat.toronto.edu/mikevans/jeffrosenthal/book.pdf. Evans and Rosenthal’s book targets readers with exactly the sort of background that this book covers.
- Bayesian Statistics the Fun Way by Will Kurt
  - We discussed Bayes’ theorem in Chapter 3. This book presents Bayesian statistics in an approachable way. Bayesian statistics is strongly related to machine learning, and you’ll eventually encounter it as you progress in your studies.
- Introduction to Probability by Joseph Blitzstein and Jessica Hwang
  - Another well-liked introduction to probability, which includes Monte Carlo modeling.
- Python for Probability, Statistics, and Machine Learning by José Unpingco
  - This book provides an alternative view to the approach I took in this book. It covers slightly different topics, but still using Python and NumPy. The machine learning portion covers what I call “classic machine learning” with some mention of deep learning.
Calculus
- Essential Calculus Skills Practice Workbook by Chris McMullen
  - This popular textbook/workbook covers differentiation and beginning integration. It includes solutions to problems. View this book as a review of Chapter 7 and an introduction to integration.
- Matrix Differential Calculus with Applications in Statistics and Econometrics by Jan Magnus and Heinz Neudecker
  - Considered by some to be the standard matrix calculus reference, providing an in-depth and thorough treatment of the topic.
- The Matrix Cookbook by Kaare Brandt Petersen and Michael Syskind Pedersen
  - A popular reference for matrix calculus that goes beyond what we covered in Chapter 8. You can find it here: https://www2.imm.dtu.dk/pubdb/edoc/imm3274.pdf
Deep Learning
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
  - One of the first deep learning–specific textbooks and widely regarded as one of the best. It covers all the essentials but goes rather quickly at times.
- A Matrix Algebra Approach to Artificial Intelligence by Xian-Da Zhang
  - A new book covering both matrices and machine learning, including deep learning. View it as a more mathematical treatment of machine learning.
- Deep Learning: A Visual Approach by Andrew Glassner
  - This book covers the breadth of deep learning, from supervised learning to reinforcement learning—all without mathematics. Use it as a rapid, high level introduction to many topics in the field.

Contents