Newer
Older
title: "Efficient Change Point detection with COMBSS"
subtitle: "Investigation on the Change in Mean"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Stating the Problem
Let $y_{1:p}$ be realizations of random variables $X_{1:p}$. Let $\mu_0,\mu_1,..,\mu_k\in\mathbb{R}$ with $Y_{1:\tau_1}\sim_{iid}\mathcal{N}(\mu_0,\sigma^2)$, $Y_{\tau_1+1:\tau_2}\sim_{iid}\mathcal{N}(\mu_1,\sigma^2)$, $...$, $Y_{\tau_{k}+1,p}\sim_{iid}\mathcal{N}(\mu_k,\sigma^2)$.
The problem is to reliable identify the parameters $k$, $\tau_{1:k}$ and $\mu_0,\mu_1,..,\mu_k$ ideally with a quantification of uncertainty or evaluation criterion of some sort.
## Existing Frameworks
**Introduction to theoretical approaches:**
Niu, Y. S., Hao, N., & Zhang, H. (2016). Multiple change-point detection: a selective overview. *Statistical Science*, 611-623.
Approaches vary from hypothesis testing to formulating the problem in regression framework. However, developing sufficiently effective algorithms to apply the theory is where it starts to be a lot more complicated.
van den Burg, G. J., & Williams, C. K. (2020). An evaluation of change point detection algorithms. *arXiv preprint arXiv:2003.06222.*
$\implies$ A variety of approaches exist with mainly sequential analysis (online) or maximum-likelihood estimation.
Example:
(1) CUSUM (cumulative sum control chart) with
$S_0=0$ and $S_{n+1}=max(0,S_n+y_{n+1}-w_n)$ for $y_{1:p}$ samples and $w_n$ assigned weights to detect changes in positive directions by choice of $w_n$.
Page, E. S. (1957). On problems in which a change in a parameter occurs at an unknown point. *Biometrika*, 44(1/2), 248-252.
Basseville, M., & Nikiforov, I. V. (1993). *Detection of abrupt changes: theory and application* (Vol. 104). Englewood Cliffs: prentice Hall.
## Univariate Mean Change Point Detection as a Regression Problem
Adapted from Niu, Hao & Zhang (2016):
Let $\beta_0=\mathbb{E}[Y_1]$ and $\beta_j=\mathbb{E}[Y_{j+1}]-\mathbb{E}[Y_{j}]$ for $j=1,...,p-1$.
Then $X_i$ for $i=1,...,p$ is given by
$$Y_i=\sum^{i-1}_{j=0}\beta_j + \epsilon_i$$
with $\epsilon_i\sim_{iid}\mathcal{N}(0,\sigma^2)$.
In matrix form this can pre rewritten as
$$Y=X\beta+\epsilon$$
with $X_{pxp}$ being a lower triangular matrix with all entries being $1$.
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
The closed form solution for the given univariate change point problem is the given by $\beta_0=\mu_0$, $\beta_{\tau_1}=\mu_1-\mu_0$, $...$, $\beta_{\tau_k}=\mu_k-\mu_{k-1}$ and $\beta_j=0$ otherwise.
However, as the parameters are unknown, a regularization approach is applied instead in order to reduce all columns of the design matrix $X$ which do not contain a change point:
$$||Y-X_s\hat{\beta}_s||+f(\lambda,s)$$
with $s=(s_0, s_1, ..., s_{p-1})$, $s_0=1$ and $s_{\tau_i}=1$ for $i=1,..,k$, $X_s=diag(s)\cdot X$ and $\hat{\beta}_s$ the MLE for $X_s$ and $||\cdot||$ some norm and $f(\lambda,s)$ some penalty.
Works on that approach have been done previously:
Utilizing LASSO-regularization with a $l_1$ type penalty for variable selection.
Harchaoui, Z., & Lévy-Leduc, C. (2010). Multiple change-point estimation with a total variation penalty. *Journal of the American Statistical Association*, 105(492), 1480-1493.
And fused LASSO pattern recovery as a generalizations to the problem.
Qian, J., & Jia, J. (2016). On stepwise pattern recovery of the fused lasso. *Computational Statistics & Data Analysis*, 94, 221-237.
---
Essentially the approach boils down to a set of questions:
(no specific order)
(1) Which form of regularization performs best depending on the type of change point problem?
(2) Is an application of an approach computational feasible?
(3) What is the relationship of the (unknown) parameters $\sigma^2$, $k$ and a penalty parameter (like $\lambda$)?
While (3) depends on (1) in a way, (1) itself heavily depends on the exact formulation of the change point problem. For the most general case stated here, the $l_2$-norm is the obvious and likely the best choice. However, if we loose the normality assumption, a rank based norm will be more appropriate. If we extent to the multivariate case, we want to keep the $l_2$-norm but might want to utilize generalized distances $d_i$ instead of $Y_i$.
Additionally, $Y$ should be decomposed to get rid of trend and cycles or seasonality before being inspected for change points. Also, while not required, $Y$ can be centered according to $Y_1$ in order to make reduce $s$ by one entry with $s_0=0$.
### Obtaining the vector s with COMBSS
The algorithm for best subset selection introduced with COMBSS promises fast results for $s$. In a way we answer (2) we first by wanting to utilize the strength of COMBSS to then look into (1) and from that deriving (3).
***Idea:***
Defining a continuous cost function $c(\lambda,t)=c_\lambda(t)$ by adapting the norm to the given type of change point problem (univariate/multivariate, known/unknown variance, normal/not normal) and adding a penalty based on sparsity of $t\in [0,1]^p$, the non-binary version of $s$.
*Example:* (multivariate normal case with known variance)
Let $d$ be the vector of generalized distances of $y$ and $X$ the design matrix as described above.
$$c_\lambda(t)=\frac1p||d-X_t\hat{\beta}_t||^2_2+\lambda\sum_{i=0}^{p-1} t_i$$
Question: (3) and accordingly the selection of $\lambda$.
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
To Do: Cases and Simulation. Choice of $\beta$ for non-mean problems (rank and variance changes).
---
## Notes til 21.10.
```{r}
source(file = "functions.R")
```
We want to start with a redefinition of our data $Y_{1:p}$. Still keeping the univariate case with one change point at $\tau_1$, we now want to define the data as $Y_i=\mu_i+\Xi_i$ with $\Xi_i\sim_{iid}\mathcal{N}(0,\sigma^2)$ and $\mu_i=\mu_0$ for $i<\tau_1$ and $\mu_i=\mu_1$ otherwise.
Let $t_i=1$ for all $i$, then
$$\hat{\beta}=(\mu_0+\xi_1, (\mu_0+\xi_2)-(\mu_0+\xi_1),..., (\mu_1+\xi_{\tau_1})-(\mu_0-\xi_{\tau_1-1}),(\mu_1+\xi_{\tau_1+1})-(\mu_1+\xi_{\tau_1}),...,(\mu_1+\xi_n)-(\mu_1+\xi_{n-1})^T$$
$$=(\mu_0+\xi_1, \xi_2-\xi_1,...,(\mu_1-\mu_0)+(\xi_{\tau_1}-\xi_{\tau_1-1}),\xi_{\tau_1+1}-\xi_{\tau_1},...,\xi_n-\xi_{n-1})$$
*Proof will follow but will likely be pretty straight forward.*
Sketching the ideal case, with $t_1=t_{\tau_{1}}=1$ and all other $t_i$ equal to $0$ and the corresponding $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$, the $l_2$-norm is simply given by the squared errors. This is why we want to adapt this notation in the first place as it faces us with a simple problem:
$$l_\lambda(t_{\tilde{\beta}})<l_\lambda(t_{\hat\beta})$$
$$\iff\frac1p\sum_{i=1}^p\xi_i^2+2\lambda<p\lambda.$$
$$\implies\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda$$
$$\implies\mathbb{P}[\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda]=\mathbb{P}[\frac{\sigma^2}p\mathcal{X}^2(p)<(p-2)\lambda]\geq0.95$$
$$\iff\frac{\sigma^2}{p(p-2)}\mathcal{X}_{0.95}^2(p)<\lambda$$
And more general for $k$ many change points:
$\frac{\sigma^2}{p(p-(k+1)}\mathcal{X}_{0.95}^2(p)<\lambda$ with $k=1,...,p-2$ as the first data point should not be considered a change point and if every single point may suggest a change point it misses the point in the first place.
As of know, $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$ is what may be considered the ideal solution for $k=1$, however, an estimation of $\tilde{\beta}^*=(\mu_0+\xi_1,0,..,0,\mu_1+\xi_{\tau_1},0,..0)^T$ instead with the method is also realistic and has to be investigated in simulations first to then be proven. In that case the degrees of freedom would simply be reduced by $k+1$ as those errors don't need to be covered by the chi-squared distribution.
Let's state that first simulation question then as:
Does the estimation of the most optimal beta vector for the change point regression problem lead to an unbiased estimation of the true mean values? In order to do so not all $t_i$ values can be exactly zero as the errors need to be kept in some scalar variety to counteract their effects on the estimation of $\mu_i$.
Further, we want to state the following assumptions about the interactions of changes in $t_i$ on the other $\tilde{\beta}_t$ estimates in relation to its neighbors, the regime and the whole vector:
**Assumption 1:**
Will $\tilde{\beta}_t^i=\frac1{t_i}(\xi_i-\xi_{i-1})$ for $i$ not a change point and $t_j\neq0$ for all $j$?
The idea: As the system of equations has an exact solution as long as there is no $t_i=0$, the $t_i$ will simply be inverted in the estimation of $\tilde{\beta}_t^i$. This may lead to the requirement, that there needs to be a cutoff for the $t_i$ forcing them to be 0 if they are below that. The cutoff should be in relation to $\sigma^2$, likely also with a chi-square estimate for homogenous normal errors.
**Assumption 2:**
Forcing a non change point $t_i$ to be zero by "previous knowledge" and all other $t_i$ to be $1$ will necessarily lead to a change of the other parameters. The nature of this change has yet to be confirmed. Two likely assumptions are either a weighting in the mean estimate for the regime the change point belongs to or simply a change in the next estimate, i.e. let $i=2$:
(1) $\tilde{\beta}^1_t=\mu_0 + \frac12(\xi_1+\xi_2)$
This would be great to cancel out error terms for large numbers of observations in between change points.
(2) $\tilde{\beta}^3_t=(\xi_3-\xi_2)+(\xi_2-\xi_1)=\xi_3-\xi_2$ with $\tilde{\beta}^1_t$ as in the regular estimate. The assumption is simply based on the assumed mechanics of the LS-Estimate for the regular beta.
**Next Step:** We want to get information about both assumptions with simulations to then have a better insight for a formal proof. Especially for ***Assumption 1*** the proof should be trivial if after the proof for $\tilde\beta$.
---
## Simulations
**Mean Estimation**
```{r}
set.seed(241022)
X10 <- genX(10)
X15 <- genX(15)
```
```{r}
y1 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5))
t1 <- c(1, rep(0, 4), 1, rep(0, 4))
betaH(y1, Xt(X10, t1))
loss(y1, t = t1)
mean(y1[1:5])
mean(y1[6:10]) - mean(y1[1:5])
```
```{r}
y2 <- c(rnorm(3, 100, 5), rnorm(7, -100, 5))
t2 <- c(1, rep(0, 2), 1, rep(0, 6))
betaH(y2, Xt(X10, t2))
loss(y2, t = t2)
mean(y2[1:3])
mean(y2[4:10]) - mean(y2[1:3])
```
```{r}
y3 <- c(rnorm(1, 100, 5), rnorm(9, -100, 5))
t3 <- c(1, 1, rep(0, 8))
betaH(y3, Xt(X10, t3))
loss(y3, t = t3)
mean(y3[1])
mean(y3[2:10]) - mean(y3[1])
```
```{r}
y4 <- c(rnorm(5, 100, 5),rnorm(5, 0, 5), rnorm(5, -100, 5))
t4 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))
betaH(y4, Xt(X15, t4))
loss(y4, t = t4)
mean(y4[1:5])
mean(y4[6:10]) - mean(y4[1:5])
mean(y4[11:15]) - mean(y4[6:10])
```
```{r}
y5 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5), rnorm(5, 100, 5))
t5 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))
betaH(y5, Xt(X15, t5))
loss(y5, t = t5)
mean(y5[1:5])
mean(y5[6:10]) - mean(y5[1:5])
mean(y5[11:15]) - mean(y5[6:10])
```
Take Away:
---
**Assumption 1:**
```{r}
y <- y1
t <- rep(1, 10)
betaH(y, Xt(X10, t))
loss(y, t = t)
```
```{r}
t[2] <- 0.75
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t[2] <- 0.5
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t[2] <- 0.25
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t[2] <- 0.1
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t[2] <- 0.01
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t[2] <- 0.001
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
Take Away:
---
**Assumption 2:**
```{r}
betaH(y, X = Xt(X10))
loss(y, t = t)
```
```{r}
b <- betaH(y, X = Xt(X10))
b[1] + 0.5 * b[2]
b[3] + 0.5 * b[2]
```
```{r}
t <- c(1, 0, rep(1, 8))
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
b <- betaH(y, X = Xt(X10))
b[1] + 1/3 * (2 * b[2] + b[3])
b[4] + 1/3 * (b[2] + 2* b[3])
```
```{r}
t <- c(1, rep(0, 2), rep(1, 7))
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
```{r}
t <- c(1, rep(0, 4), rep(1, 5))
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
mean(y[1:5])
mean(y[6:10]) - mean(y[1:5])
```
```{r}
t <- c(rep(1,5), 0, rep(1, 4))
betaH(y, X = Xt(X10, t = t))
loss(y, t = t)
```
Take Away:
---
## Gradient in t
In order to ensure a smooth and fast convergence to the with a gradient descent method to minimize the vector $t$ we want to find a solution for $\frac{\partial c_\lambda(t)}{t_i}$ for all $t_i$ in $t\in[0,1]^p$:
$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac{\partial}{\partial t_i}(\frac1p||y-X_t\tilde{\beta}_t||^2_2+\lambda\sum_{i=1}^{p} t_i)$$
$$=\frac1p\frac{\partial}{\partial t_i}(y-X_t\tilde{\beta}_t)^T(y-X_t\tilde{\beta}_t)+\frac{\partial}{\partial t_i}\lambda\sum_{i=1}^{p} t_i$$
$$=\frac1p\frac{\partial}{\partial t_i}(y^Ty-2y^TX_t\tilde{\beta}_t+\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
$$=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda.$$
Accordingly we want to find close form solutions for:
$$\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t=\frac{\partial}{\partial t_i}y^TX_t(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t$$
and
$$\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^T(X_t^TX_t)(X_t^TX_t)^{-1}X_t^Ty$$
$$=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^TX_t^Ty=\frac{\partial}{\partial t_i}y^TX_t^T(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t.$$
So in conclusion:
$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
$$=\frac1p(-2\frac{\partial}{\partial t_i}y^T\tilde{y}_t+\frac{\partial}{\partial t_i}y^T\tilde{y}_t)+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^t\tilde{y}_t+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tH_ty+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tX_t\tilde{\beta}_t+\lambda=.$$
So to find the gradient we may either solve for a closed form solution of our hat matrix $H_t=X_t(X_t^TX_t)X_t^T$ given the design matrix $X_t$ or the expression $X_t\tilde{\beta}_t$. The second one is somewhat more desirable as we are also provided an exact solution path $\{\tilde{\beta}_t^{(l)}\}_{l=1}^\infty$.
In both cases we have to $(1)$ prove that $X^TX$ is invertible and $(2)$ determine the inverse depending on the vector $t$.
As long as $X_t$ is a square matrix, either as all $t_1>0$ or some kind of substitution with $X_t+\delta(Id - diag(\lceil t\rceil)$, $H_t=X_t(X_t^TX_t)^{-1}X_t^T=X_tX_t^{-1}(X_t^T)^{-1}X_t^T=Id$ _and $\tilde{\beta}_t=(X_t^TX_t)^{-1}X_t^Ty=X_t^{-1}y$ with $X_t^{-1}=(X\cdot diag(t))^{-1}=diag(1/t)X^{-1}$.
From here it is easy to determine $\det(X)=1$, $\det(X_t)=\prod_{i=1}^pt_i$ and $\det(X^TX)=\prod_{i=1}^pt_i^2$, provided $t_i>0$ for all $i=1,...,p$.
Further, $(X^TX)_{ij}=((p+1)-\max(i,j))t_it_j=<x_{t,i},x_{t,j}>$ symmetric.
In conclusion up to this point, the gradient descent initially is determined by the penalization parameter $\lambda$ and can ideally be skipped by simply setting the $t_{i^*}\overset{!}{=}0$ for $i^*=argmin_{i=1,...,p}(y_i)$.
However, this is where things start to get somewhat more complicated.