Investigations_Univariate.Rmd

---
title: "Efficient Change Point detection with COMBSS"
subtitle: "Investigation on the Change in Mean"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
  
## Stating the Problem  
  
Let $y_{1:p}$ be realizations of random variables $X_{1:p}$. Let $\mu_0,\mu_1,..,\mu_k\in\mathbb{R}$ with $Y_{1:\tau_1}\sim_{iid}\mathcal{N}(\mu_0,\sigma^2)$, $Y_{\tau_1+1:\tau_2}\sim_{iid}\mathcal{N}(\mu_1,\sigma^2)$, $...$, $Y_{\tau_{k}+1,p}\sim_{iid}\mathcal{N}(\mu_k,\sigma^2)$.   
The problem is to reliable identify the parameters $k$, $\tau_{1:k}$ and $\mu_0,\mu_1,..,\mu_k$ ideally with a quantification of uncertainty or evaluation criterion of some sort.  
  
## Existing Frameworks  
 
**Introduction to theoretical approaches:**  
  
Niu, Y. S., Hao, N., & Zhang, H. (2016). Multiple change-point detection: a selective overview. *Statistical Science*, 611-623.  
  
Approaches vary from hypothesis testing to formulating the problem in regression framework. However, developing sufficiently effective algorithms to apply the theory is where it starts to be a lot more complicated.  
  
**Introduction to Algorithm Evaluation:**  
  
van den Burg, G. J., & Williams, C. K. (2020). An evaluation of change point detection algorithms. *arXiv preprint arXiv:2003.06222.*
  
$\implies$ A variety of approaches exist with mainly sequential analysis (online) or maximum-likelihood estimation.

Example:  
  
(1) CUSUM (cumulative sum control chart) with  
  
$S_0=0$ and $S_{n+1}=max(0,S_n+y_{n+1}-w_n)$ for $y_{1:p}$ samples and $w_n$ assigned weights to detect changes in positive directions by choice of $w_n$. 
  
Page, E. S. (1957). On problems in which a change in a parameter occurs at an unknown point. *Biometrika*, 44(1/2), 248-252.

(2) Extension on the idea by Page in  
  
Basseville, M., & Nikiforov, I. V. (1993). *Detection of abrupt changes: theory and application* (Vol. 104). Englewood Cliffs: prentice Hall.  
  
## Univariate Mean Change Point Detection as a Regression Problem  
    
Adapted from Niu, Hao & Zhang (2016):  
  
Let $\beta_0=\mathbb{E}[Y_1]$ and $\beta_j=\mathbb{E}[Y_{j+1}]-\mathbb{E}[Y_{j}]$ for $j=1,...,p-1$.  
  
Then $X_i$ for $i=1,...,p$ is given by 
$$Y_i=\sum^{i-1}_{j=0}\beta_j + \epsilon_i$$  
with $\epsilon_i\sim_{iid}\mathcal{N}(0,\sigma^2)$.  
  
In matrix form this can pre rewritten as  
$$Y=X\beta+\epsilon$$ 
with $X_{pxp}$ being a lower triangular matrix with all entries being $1$.  
  
The closed form solution for the given univariate change point problem is the given by $\beta_0=\mu_0$, $\beta_{\tau_1}=\mu_1-\mu_0$, $...$, $\beta_{\tau_k}=\mu_k-\mu_{k-1}$ and $\beta_j=0$ otherwise.  
  
However, as the parameters are unknown, a regularization approach is applied instead in order to reduce all columns of the design matrix $X$ which do not contain a change point:  
  
$$||Y-X_s\hat{\beta}_s||+f(\lambda,s)$$  
  
with $s=(s_0, s_1, ..., s_{p-1})$, $s_0=1$ and $s_{\tau_i}=1$ for $i=1,..,k$, $X_s=diag(s)\cdot X$ and $\hat{\beta}_s$ the MLE for $X_s$ and $||\cdot||$ some norm and $f(\lambda,s)$ some penalty.  
  
Works on that approach have been done previously:  
  
Utilizing LASSO-regularization with a $l_1$ type penalty for variable selection.  
  
Harchaoui, Z., & Lévy-Leduc, C. (2010). Multiple change-point estimation with a total variation penalty. *Journal of the American Statistical Association*, 105(492), 1480-1493.  
  
And fused LASSO pattern recovery as a generalizations to the problem.  
  
Qian, J., & Jia, J. (2016). On stepwise pattern recovery of the fused lasso. *Computational Statistics & Data Analysis*, 94, 221-237.

  
---  
  
Essentially the approach boils down to a set of questions:  
  
(no specific order)
  
(1) Which form of regularization performs best depending on the type of change point problem?  
  
(2) Is an application of an approach computational feasible?  
  
(3) What is the relationship of the (unknown) parameters $\sigma^2$, $k$ and a penalty parameter (like $\lambda$)?  
  
While (3) depends on (1) in a way, (1) itself heavily depends on the exact formulation of the change point problem. For the most general case stated here, the $l_2$-norm is the obvious and likely the best choice. However, if we loose the normality assumption, a rank based norm will be more appropriate. If we extent to the multivariate case, we want to keep the $l_2$-norm but might want to utilize generalized distances $d_i$ instead of $Y_i$.  
  
Additionally, $Y$ should be decomposed to get rid of trend and cycles or seasonality before being inspected for change points. Also, while not required, $Y$ can be centered according to $Y_1$ in order to make reduce $s$ by one entry with $s_0=0$.  
  
### Obtaining the vector s with COMBSS  
  
The algorithm for best subset selection introduced with COMBSS promises fast results for $s$. In a way we answer (2) we first by wanting to utilize the strength of COMBSS to then look into (1) and from that deriving (3).  
  
***Idea:***  
  
Defining a continuous cost function $c(\lambda,t)=c_\lambda(t)$ by adapting the norm to the given type of change point problem (univariate/multivariate, known/unknown variance, normal/not normal) and adding a penalty based on sparsity of $t\in [0,1]^p$, the non-binary version of $s$.
  
*Example:* (multivariate normal case with known variance)  
  
Let $d$ be the vector of generalized distances of $y$ and $X$ the design matrix as described above. 
  
$$c_\lambda(t)=\frac1p||d-X_t\hat{\beta}_t||^2_2+\lambda\sum_{i=0}^{p-1} t_i$$
  
Minimize $c_\lambda(t)$ in $t$ for fixed $\lambda$.
  
Question: (3) and accordingly the selection of $\lambda$.  
  
  
---  

Nice to have: $X$ always has full rank.

To Do: Cases and Simulation. Choice of $\beta$ for non-mean problems (rank and variance changes).  
  
---  
  
## Notes til 21.10.  
  
```{r}
source(file = "functions.R")
```
  
  
We want to start with a redefinition of our data $Y_{1:p}$. Still keeping the univariate case with one change point at $\tau_1$, we now want to define the data as $Y_i=\mu_i+\Xi_i$ with $\Xi_i\sim_{iid}\mathcal{N}(0,\sigma^2)$ and $\mu_i=\mu_0$ for $i<\tau_1$ and $\mu_i=\mu_1$ otherwise.  
  
Let $t_i=1$ for all $i$, then
$$\hat{\beta}=(\mu_0+\xi_1, (\mu_0+\xi_2)-(\mu_0+\xi_1),..., (\mu_1+\xi_{\tau_1})-(\mu_0-\xi_{\tau_1-1}),(\mu_1+\xi_{\tau_1+1})-(\mu_1+\xi_{\tau_1}),...,(\mu_1+\xi_n)-(\mu_1+\xi_{n-1})^T$$  
$$=(\mu_0+\xi_1, \xi_2-\xi_1,...,(\mu_1-\mu_0)+(\xi_{\tau_1}-\xi_{\tau_1-1}),\xi_{\tau_1+1}-\xi_{\tau_1},...,\xi_n-\xi_{n-1})$$
*Proof will follow but will likely be pretty straight forward.*  
  
Sketching the ideal case, with $t_1=t_{\tau_{1}}=1$ and all other $t_i$ equal to $0$ and the corresponding $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$, the $l_2$-norm is simply given by the squared errors. This is why we want to adapt this notation in the first place as it faces us with a simple problem:  
$$l_\lambda(t_{\tilde{\beta}})<l_\lambda(t_{\hat\beta})$$  
$$\iff\frac1p\sum_{i=1}^p\xi_i^2+2\lambda<p\lambda.$$
$$\implies\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda$$  
$$\implies\mathbb{P}[\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda]=\mathbb{P}[\frac{\sigma^2}p\mathcal{X}^2(p)<(p-2)\lambda]\geq0.95$$
$$\iff\frac{\sigma^2}{p(p-2)}\mathcal{X}_{0.95}^2(p)<\lambda$$
  
And more general for $k$ many change points:  
  
$\frac{\sigma^2}{p(p-(k+1)}\mathcal{X}_{0.95}^2(p)<\lambda$ with $k=1,...,p-2$ as the first data point should not be considered a change point and if every single point may suggest a change point it misses the point in the first place.   
  
As of know, $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$ is what may be considered the ideal solution for $k=1$, however, an estimation of $\tilde{\beta}^*=(\mu_0+\xi_1,0,..,0,\mu_1+\xi_{\tau_1},0,..0)^T$ instead with the method is also realistic and has to be investigated in simulations first to then be proven. In that case the degrees of freedom would simply be reduced by $k+1$ as those errors don't need to be covered by the chi-squared distribution.   

Let's state that first simulation question then as:  
Does the estimation of the most optimal beta vector for the change point regression problem lead to an unbiased estimation of the true mean values? In order to do so not all $t_i$ values can be exactly zero as the errors need to be kept in some scalar variety to counteract their effects on the estimation of $\mu_i$.  
  
Further, we want to state the following assumptions about the interactions of changes in $t_i$ on the other $\tilde{\beta}_t$ estimates in relation to its neighbors, the regime and the whole vector:  
  
**Assumption 1:**  
Will $\tilde{\beta}_t^i=\frac1{t_i}(\xi_i-\xi_{i-1})$ for $i$ not a change point and $t_j\neq0$ for all $j$?  
The idea: As the system of equations has an exact solution as long as there is no $t_i=0$, the $t_i$ will simply be inverted in the estimation of $\tilde{\beta}_t^i$. This may lead to the requirement, that there needs to be a cutoff for the $t_i$ forcing them to be 0 if they are below that. The cutoff should be in relation to $\sigma^2$, likely also with a chi-square estimate for homogenous normal errors.  
  
**Assumption 2:**  
Forcing a non change point $t_i$ to be zero by "previous knowledge" and all other $t_i$ to be $1$ will necessarily lead to a change of the other parameters. The nature of this change has yet to be confirmed. Two likely assumptions are either a weighting in the mean estimate for the regime the change point belongs to or simply a change in the next estimate, i.e. let $i=2$:  
(1) $\tilde{\beta}^1_t=\mu_0 + \frac12(\xi_1+\xi_2)$ 
This would be great to cancel out error terms for large numbers of observations in between change points.  
  
(2) $\tilde{\beta}^3_t=(\xi_3-\xi_2)+(\xi_2-\xi_1)=\xi_3-\xi_2$ with $\tilde{\beta}^1_t$ as in the regular estimate. The assumption is simply based on the assumed mechanics of the LS-Estimate for the regular beta.

**Next Step:** We want to get information about both assumptions with simulations to then have a better insight for a formal proof. Especially for ***Assumption 1*** the proof should be trivial if after the proof for $\tilde\beta$.  
  
---  
  
## Simulations  
  
**Mean Estimation**  
  
```{r}
set.seed(241022)

X10 <- genX(10)
X15 <- genX(15)
```


```{r}
y1 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5))
t1 <- c(1, rep(0, 4), 1, rep(0, 4))

betaH(y1, Xt(X10, t1))
loss(y1, t = t1)

mean(y1[1:5])
mean(y1[6:10]) - mean(y1[1:5])
```


```{r}
y2 <- c(rnorm(3, 100, 5), rnorm(7, -100, 5))
t2 <- c(1, rep(0, 2), 1, rep(0, 6))

betaH(y2, Xt(X10, t2))
loss(y2, t = t2)

mean(y2[1:3])
mean(y2[4:10]) - mean(y2[1:3])
```


```{r}
y3 <- c(rnorm(1, 100, 5), rnorm(9, -100, 5))
t3 <- c(1, 1, rep(0, 8))

betaH(y3, Xt(X10, t3))
loss(y3, t = t3)

mean(y3[1])
mean(y3[2:10]) - mean(y3[1])
```


```{r}
y4 <- c(rnorm(5, 100, 5),rnorm(5, 0, 5), rnorm(5, -100, 5))
t4 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))

betaH(y4, Xt(X15, t4))
loss(y4, t = t4)

mean(y4[1:5])
mean(y4[6:10]) - mean(y4[1:5])
mean(y4[11:15]) - mean(y4[6:10])
```


```{r}
y5 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5), rnorm(5, 100, 5))
t5 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))

betaH(y5, Xt(X15, t5))
loss(y5, t = t5)

mean(y5[1:5])
mean(y5[6:10]) - mean(y5[1:5])
mean(y5[11:15]) - mean(y5[6:10])
```
  
  
Take Away:  
  
  
---  
  
**Assumption 1:**  
  
```{r}
y <- y1
t <- rep(1, 10)

betaH(y, Xt(X10, t))
loss(y, t = t)
```
  
```{r}
t[2] <- 0.75

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
  
```{r}
t[2] <- 0.5

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
  
```{r}
t[2] <- 0.25

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```  
  
```{r}
t[2] <- 0.1

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```  
  
```{r}
t[2] <- 0.01

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```  
  
```{r}
t[2] <- 0.001

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```  
  
Take Away:  
  
  
---  
  
**Assumption 2:**  
  
```{r}
betaH(y, X = Xt(X10))
loss(y, t = t)
```
  
```{r}
b <- betaH(y, X = Xt(X10))
b[1] + 0.5 * b[2]
b[3] + 0.5 * b[2]
```
  
```{r}
t <- c(1, 0, rep(1, 8))

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
  
```{r}
b <- betaH(y, X = Xt(X10))
b[1] + 1/3 * (2 * b[2] + b[3])
b[4] + 1/3 * (b[2] + 2* b[3])
```  
  
```{r}
t <- c(1, rep(0, 2), rep(1, 7))

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
  
```{r}
t <- c(1, rep(0, 4), rep(1, 5))

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)

mean(y[1:5])
mean(y[6:10]) - mean(y[1:5])
```
  
  
```{r}
t <- c(rep(1,5), 0, rep(1, 4))

betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
  
Take Away:  
  
  
---  
  
## Gradient in t  
  
  
In order to ensure a smooth and fast convergence to the with a gradient descent method to minimize the vector $t$ we want to find a solution for $\frac{\partial c_\lambda(t)}{t_i}$ for all $t_i$ in $t\in[0,1]^p$:  
  
$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac{\partial}{\partial t_i}(\frac1p||y-X_t\tilde{\beta}_t||^2_2+\lambda\sum_{i=1}^{p} t_i)$$
$$=\frac1p\frac{\partial}{\partial t_i}(y-X_t\tilde{\beta}_t)^T(y-X_t\tilde{\beta}_t)+\frac{\partial}{\partial t_i}\lambda\sum_{i=1}^{p} t_i$$
$$=\frac1p\frac{\partial}{\partial t_i}(y^Ty-2y^TX_t\tilde{\beta}_t+\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
$$=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda.$$
  
Accordingly we want to find close form solutions for:  
  
$$\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t=\frac{\partial}{\partial t_i}y^TX_t(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t$$  
and  
$$\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^T(X_t^TX_t)(X_t^TX_t)^{-1}X_t^Ty$$
$$=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^TX_t^Ty=\frac{\partial}{\partial t_i}y^TX_t^T(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t.$$
  
So in conclusion:  
  
$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
$$=\frac1p(-2\frac{\partial}{\partial t_i}y^T\tilde{y}_t+\frac{\partial}{\partial t_i}y^T\tilde{y}_t)+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^t\tilde{y}_t+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tH_ty+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tX_t\tilde{\beta}_t+\lambda=.$$
  
So to find the gradient we may either solve for a closed form solution of our hat matrix $H_t=X_t(X_t^TX_t)X_t^T$ given the design matrix $X_t$ or the expression $X_t\tilde{\beta}_t$. The second one is somewhat more desirable as we are also provided an exact solution path $\{\tilde{\beta}_t^{(l)}\}_{l=1}^\infty$.  
  
In both cases we have to $(1)$ prove that $X^TX$ is invertible and $(2)$ determine the inverse depending on the vector $t$.  
  
As long as $X_t$ is a square matrix, either as all $t_1>0$ or some kind of substitution with $X_t+\delta(Id - diag(\lceil t\rceil)$, $H_t=X_t(X_t^TX_t)^{-1}X_t^T=X_tX_t^{-1}(X_t^T)^{-1}X_t^T=Id$ _and $\tilde{\beta}_t=(X_t^TX_t)^{-1}X_t^Ty=X_t^{-1}y$ with $X_t^{-1}=(X\cdot diag(t))^{-1}=diag(1/t)X^{-1}$.  
  
From here it is easy to determine $\det(X)=1$, $\det(X_t)=\prod_{i=1}^pt_i$ and $\det(X^TX)=\prod_{i=1}^pt_i^2$, provided $t_i>0$ for all $i=1,...,p$.  
  
Further, $(X^TX)_{ij}=((p+1)-\max(i,j))t_it_j=<x_{t,i},x_{t,j}>$ symmetric. 
  
In conclusion up to this point, the gradient descent initially is determined by the penalization parameter $\lambda$ and can ideally be skipped by simply setting the $t_{i^*}\overset{!}{=}0$ for $i^*=argmin_{i=1,...,p}(y_i)$.  
  
However, this is where things start to get somewhat more complicated.