Compare revisions

Hans Reimann · Hans Reimann · d51fb75c · d51fb75c · d51fb75c · d51fb75c
--- a/Investigations_Univariate.Rmd
+++ b/Investigations_Univariate.Rmd
@@ -111,4 +111,285 @@ Question: (3) and accordingly the selection of $\lambda$.
 Nice to have: $X$ always has full rank.
-To Do: Cases and Simulation. Choice of $\beta$ for non-mean problems (rank and variance changes).
+To Do: Cases and Simulation. Choice of $\beta$ for non-mean problems (rank and variance changes).  
\ No newline at end of file
+---  
+## Notes til 21.10.  
+```{r}
+source(file = "functions.R")
+```
+We want to start with a redefinition of our data $Y_{1:p}$. Still keeping the univariate case with one change point at $\tau_1$, we now want to define the data as $Y_i=\mu_i+\Xi_i$ with $\Xi_i\sim_{iid}\mathcal{N}(0,\sigma^2)$ and $\mu_i=\mu_0$ for $i<\tau_1$ and $\mu_i=\mu_1$ otherwise.  
+Let $t_i=1$ for all $i$, then
+$$\hat{\beta}=(\mu_0+\xi_1, (\mu_0+\xi_2)-(\mu_0+\xi_1),..., (\mu_1+\xi_{\tau_1})-(\mu_0-\xi_{\tau_1-1}),(\mu_1+\xi_{\tau_1+1})-(\mu_1+\xi_{\tau_1}),...,(\mu_1+\xi_n)-(\mu_1+\xi_{n-1})^T$$  
+$$=(\mu_0+\xi_1, \xi_2-\xi_1,...,(\mu_1-\mu_0)+(\xi_{\tau_1}-\xi_{\tau_1-1}),\xi_{\tau_1+1}-\xi_{\tau_1},...,\xi_n-\xi_{n-1})$$
+*Proof will follow but will likely be pretty straight forward.*  
+Sketching the ideal case, with $t_1=t_{\tau_{1}}=1$ and all other $t_i$ equal to $0$ and the corresponding $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$, the $l_2$-norm is simply given by the squared errors. This is why we want to adapt this notation in the first place as it faces us with a simple problem:  
+$$l_\lambda(t_{\tilde{\beta}})<l_\lambda(t_{\hat\beta})$$  
+$$\iff\frac1p\sum_{i=1}^p\xi_i^2+2\lambda<p\lambda.$$
+$$\implies\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda$$  
+$$\implies\mathbb{P}[\frac{\sigma^2}p\sum_{i=1}^pZ_i^2<(p-2)\lambda]=\mathbb{P}[\frac{\sigma^2}p\mathcal{X}^2(p)<(p-2)\lambda]\geq0.95$$
+$$\iff\frac{\sigma^2}{p(p-2)}\mathcal{X}_{0.95}^2(p)<\lambda$$
+And more general for $k$ many change points:  
+$\frac{\sigma^2}{p(p-(k+1)}\mathcal{X}_{0.95}^2(p)<\lambda$ with $k=1,...,p-2$ as the first data point should not be considered a change point and if every single point may suggest a change point it misses the point in the first place.   
+As of know, $\tilde{\beta}=(\mu_0,0,..,0,\mu_1,0,..0)^T$ is what may be considered the ideal solution for $k=1$, however, an estimation of $\tilde{\beta}^*=(\mu_0+\xi_1,0,..,0,\mu_1+\xi_{\tau_1},0,..0)^T$ instead with the method is also realistic and has to be investigated in simulations first to then be proven. In that case the degrees of freedom would simply be reduced by $k+1$ as those errors don't need to be covered by the chi-squared distribution.   
+Let's state that first simulation question then as:  
+Does the estimation of the most optimal beta vector for the change point regression problem lead to an unbiased estimation of the true mean values? In order to do so not all $t_i$ values can be exactly zero as the errors need to be kept in some scalar variety to counteract their effects on the estimation of $\mu_i$.  
+Further, we want to state the following assumptions about the interactions of changes in $t_i$ on the other $\tilde{\beta}_t$ estimates in relation to its neighbors, the regime and the whole vector:  
+**Assumption 1:**  
+Will $\tilde{\beta}_t^i=\frac1{t_i}(\xi_i-\xi_{i-1})$ for $i$ not a change point and $t_j\neq0$ for all $j$?  
+The idea: As the system of equations has an exact solution as long as there is no $t_i=0$, the $t_i$ will simply be inverted in the estimation of $\tilde{\beta}_t^i$. This may lead to the requirement, that there needs to be a cutoff for the $t_i$ forcing them to be 0 if they are below that. The cutoff should be in relation to $\sigma^2$, likely also with a chi-square estimate for homogenous normal errors.  
+**Assumption 2:**  
+Forcing a non change point $t_i$ to be zero by "previous knowledge" and all other $t_i$ to be $1$ will necessarily lead to a change of the other parameters. The nature of this change has yet to be confirmed. Two likely assumptions are either a weighting in the mean estimate for the regime the change point belongs to or simply a change in the next estimate, i.e. let $i=2$:  
+(1) $\tilde{\beta}^1_t=\mu_0 + \frac12(\xi_1+\xi_2)$ 
+This would be great to cancel out error terms for large numbers of observations in between change points.  
+(2) $\tilde{\beta}^3_t=(\xi_3-\xi_2)+(\xi_2-\xi_1)=\xi_3-\xi_2$ with $\tilde{\beta}^1_t$ as in the regular estimate. The assumption is simply based on the assumed mechanics of the LS-Estimate for the regular beta.
+**Next Step:** We want to get information about both assumptions with simulations to then have a better insight for a formal proof. Especially for ***Assumption 1*** the proof should be trivial if after the proof for $\tilde\beta$.  
+---  
+## Simulations  
+**Mean Estimation**  
+```{r}
+set.seed(241022)
+X10 <- genX(10)
+X15 <- genX(15)
+```
+```{r}
+y1 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5))
+t1 <- c(1, rep(0, 4), 1, rep(0, 4))
+betaH(y1, Xt(X10, t1))
+loss(y1, t = t1)
+mean(y1[1:5])
+mean(y1[6:10]) - mean(y1[1:5])
+```
+```{r}
+y2 <- c(rnorm(3, 100, 5), rnorm(7, -100, 5))
+t2 <- c(1, rep(0, 2), 1, rep(0, 6))
+betaH(y2, Xt(X10, t2))
+loss(y2, t = t2)
+mean(y2[1:3])
+mean(y2[4:10]) - mean(y2[1:3])
+```
+```{r}
+y3 <- c(rnorm(1, 100, 5), rnorm(9, -100, 5))
+t3 <- c(1, 1, rep(0, 8))
+betaH(y3, Xt(X10, t3))
+loss(y3, t = t3)
+mean(y3[1])
+mean(y3[2:10]) - mean(y3[1])
+```
+```{r}
+y4 <- c(rnorm(5, 100, 5),rnorm(5, 0, 5), rnorm(5, -100, 5))
+t4 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))
+betaH(y4, Xt(X15, t4))
+loss(y4, t = t4)
+mean(y4[1:5])
+mean(y4[6:10]) - mean(y4[1:5])
+mean(y4[11:15]) - mean(y4[6:10])
+```
+```{r}
+y5 <- c(rnorm(5, 100, 5), rnorm(5, -100, 5), rnorm(5, 100, 5))
+t5 <- c(1, rep(0, 4), 1, rep(0, 4), 1, rep(0, 4))
+betaH(y5, Xt(X15, t5))
+loss(y5, t = t5)
+mean(y5[1:5])
+mean(y5[6:10]) - mean(y5[1:5])
+mean(y5[11:15]) - mean(y5[6:10])
+```
+Take Away:  
+---  
+**Assumption 1:**  
+```{r}
+y <- y1
+t <- rep(1, 10)
+betaH(y, Xt(X10, t))
+loss(y, t = t)
+```
+```{r}
+t[2] <- 0.75
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```
+```{r}
+t[2] <- 0.5
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```
+```{r}
+t[2] <- 0.25
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```  
+```{r}
+t[2] <- 0.1
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```  
+```{r}
+t[2] <- 0.01
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```  
+```{r}
+t[2] <- 0.001
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```  
+Take Away:  
+---  
+**Assumption 2:**  
+```{r}
+betaH(y, X = Xt(X10))
+loss(y, t = t)
+```
+```{r}
+b <- betaH(y, X = Xt(X10))
+b[1] + 0.5 * b[2]
+b[3] + 0.5 * b[2]
+```
+```{r}
+t <- c(1, 0, rep(1, 8))
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```
+```{r}
+b <- betaH(y, X = Xt(X10))
+b[1] + 1/3 * (2 * b[2] + b[3])
+b[4] + 1/3 * (b[2] + 2* b[3])
+```  
+```{r}
+t <- c(1, rep(0, 2), rep(1, 7))
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```
+```{r}
+t <- c(1, rep(0, 4), rep(1, 5))
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+mean(y[1:5])
+mean(y[6:10]) - mean(y[1:5])
+```
+```{r}
+t <- c(rep(1,5), 0, rep(1, 4))
+betaH(y, X = Xt(X10, t = t))
+loss(y, t = t)
+```
+Take Away:  
+---  
+## Gradient in t  
+In order to ensure a smooth and fast convergence to the with a gradient descent method to minimize the vector $t$ we want to find a solution for $\frac{\partial c_\lambda(t)}{t_i}$ for all $t_i$ in $t\in[0,1]^p$:  
+$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac{\partial}{\partial t_i}(\frac1p||y-X_t\tilde{\beta}_t||^2_2+\lambda\sum_{i=1}^{p} t_i)$$
+$$=\frac1p\frac{\partial}{\partial t_i}(y-X_t\tilde{\beta}_t)^T(y-X_t\tilde{\beta}_t)+\frac{\partial}{\partial t_i}\lambda\sum_{i=1}^{p} t_i$$
+$$=\frac1p\frac{\partial}{\partial t_i}(y^Ty-2y^TX_t\tilde{\beta}_t+\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
+$$=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda.$$
+Accordingly we want to find close form solutions for:  
+$$\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t=\frac{\partial}{\partial t_i}y^TX_t(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t$$  
+and  
+$$\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^T(X_t^TX_t)(X_t^TX_t)^{-1}X_t^Ty$$
+$$=\frac{\partial}{\partial t_i}((X_t^TX_t)^{-1}X_t^Ty)^TX_t^Ty=\frac{\partial}{\partial t_i}y^TX_t^T(X_t^TX_t)^{-1}X_t^Ty=\frac{\partial}{\partial t_i}y^TH_ty=\frac{\partial}{\partial t_i}y^T\tilde{y}_t.$$
+So in conclusion:  
+$$\frac{\partial c_\lambda(t)}{\partial t_i}=\frac1p(-2\frac{\partial}{\partial t_i}y^TX_t\tilde{\beta}_t+\frac{\partial}{\partial t_i}\tilde{\beta}_t^T(X_t^TX_t)\tilde{\beta}_t)+\lambda$$
+$$=\frac1p(-2\frac{\partial}{\partial t_i}y^T\tilde{y}_t+\frac{\partial}{\partial t_i}y^T\tilde{y}_t)+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^t\tilde{y}_t+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tH_ty+\lambda=-\frac1p\frac{\partial}{\partial t_i}y^tX_t\tilde{\beta}_t+\lambda=.$$
+So to find the gradient we may either solve for a closed form solution of our hat matrix $H_t=X_t(X_t^TX_t)X_t^T$ given the design matrix $X_t$ or the expression $X_t\tilde{\beta}_t$. The second one is somewhat more desirable as we are also provided an exact solution path $\{\tilde{\beta}_t^{(l)}\}_{l=1}^\infty$.  
+In both cases we have to $(1)$ prove that $X^TX$ is invertible and $(2)$ determine the inverse depending on the vector $t$.  
+As long as $X_t$ is a square matrix, either as all $t_1>0$ or some kind of substitution with $X_t+\delta(Id - diag(\lceil t\rceil)$, $H_t=X_t(X_t^TX_t)^{-1}X_t^T=X_tX_t^{-1}(X_t^T)^{-1}X_t^T=Id$ _and $\tilde{\beta}_t=(X_t^TX_t)^{-1}X_t^Ty=X_t^{-1}y$ with $X_t^{-1}=(X\cdot diag(t))^{-1}=diag(1/t)X^{-1}$.  
+From here it is easy to determine $\det(X)=1$, $\det(X_t)=\prod_{i=1}^pt_i$ and $\det(X^TX)=\prod_{i=1}^pt_i^2$, provided $t_i>0$ for all $i=1,...,p$.  
+Further, $(X^TX)_{ij}=((p+1)-\max(i,j))t_it_j=<x_{t,i},x_{t,j}>$ symmetric. 
+In conclusion up to this point, the gradient descent initially is determined by the penalization parameter $\lambda$ and can ideally be skipped by simply setting the $t_{i^*}\overset{!}{=}0$ for $i^*=argmin_{i=1,...,p}(y_i)$.  
+However, this is where things start to get somewhat more complicated.
\ No newline at end of file
--- a/Investigations_Univariate.aux
+++ b/Investigations_Univariate.aux
@@ -21,4 +21,10 @@
 \newlabel{univariate-mean-change-point-detection-as-a-regression-problem}{{}{2}{Univariate Mean Change Point Detection as a Regression Problem}{section*.3}{}}
 \@writefile{toc}{\contentsline {subsubsection}{Obtaining the vector s with COMBSS}{3}{section*.4}\protected@file@percent }
 \newlabel{obtaining-the-vector-s-with-combss}{{}{3}{Obtaining the vector s with COMBSS}{section*.4}{}}
-\gdef \@abspage@last{3}
+\@writefile{toc}{\contentsline {subsection}{Notes til 21.10.}{3}{section*.5}\protected@file@percent }
+\newlabel{notes-til-21.10.}{{}{3}{Notes til 21.10}{section*.5}{}}
+\@writefile{toc}{\contentsline {subsection}{Simulations}{5}{section*.6}\protected@file@percent }
+\newlabel{simulations}{{}{5}{Simulations}{section*.6}{}}
+\@writefile{toc}{\contentsline {subsection}{Gradient in t}{11}{section*.7}\protected@file@percent }
+\newlabel{gradient-in-t}{{}{11}{Gradient in t}{section*.7}{}}
+\gdef \@abspage@last{12}
--- a/Investigations_Univariate.out
+++ b/Investigations_Univariate.out
@@ -2,3 +2,6 @@
 \BOOKMARK [2][-]{section*.2}{\376\377\000E\000x\000i\000s\000t\000i\000n\000g\000\040\000F\000r\000a\000m\000e\000w\000o\000r\000k\000s}{}% 2
 \BOOKMARK [2][-]{section*.3}{\376\377\000U\000n\000i\000v\000a\000r\000i\000a\000t\000e\000\040\000M\000e\000a\000n\000\040\000C\000h\000a\000n\000g\000e\000\040\000P\000o\000i\000n\000t\000\040\000D\000e\000t\000e\000c\000t\000i\000o\000n\000\040\000a\000s\000\040\000a\000\040\000R\000e\000g\000r\000e\000s\000s\000i\000o\000n\000\040\000P\000r\000o\000b\000l\000e\000m}{}% 3
 \BOOKMARK [3][-]{section*.4}{\376\377\000O\000b\000t\000a\000i\000n\000i\000n\000g\000\040\000t\000h\000e\000\040\000v\000e\000c\000t\000o\000r\000\040\000s\000\040\000w\000i\000t\000h\000\040\000C\000O\000M\000B\000S\000S}{section*.3}% 4
+\BOOKMARK [2][-]{section*.5}{\376\377\000N\000o\000t\000e\000s\000\040\000t\000i\000l\000\040\0002\0001\000.\0001\0000\000.}{}% 5
+\BOOKMARK [2][-]{section*.6}{\376\377\000S\000i\000m\000u\000l\000a\000t\000i\000o\000n\000s}{}% 6
+\BOOKMARK [2][-]{section*.7}{\376\377\000G\000r\000a\000d\000i\000e\000n\000t\000\040\000i\000n\000\040\000t}{}% 7
--- a/Investigations_Univariate.pdf
+++ b/Investigations_Univariate.pdf
--- a/Papers/Change Points in Sequential Bayes.pdf
+++ b/Papers/Change Points in Sequential Bayes.pdf
--- a/functions.R
+++ b/functions.R
+genX <- function(p) {
+  X <- c()
+  for (i in 1:p) {
+    X <- cbind(X, c(rep(0, times = i - 1),
+                    rep(1, times = p+1 - i)))
+  }
+  return(X)
+}
+Xt <- function(X,t = rep(1, times = ncol(X))) {
+  return(X%*%diag(x = t))
+}
+betaH <- function(y, X) {
+  p <- ncol(X)
+  temp <- c()
+  for (i in 1:p) {
+    if(sum(abs(X[ ,i])) == 0){
+      temp <- c(temp, i)
+    }
+  }
+  if (is.null(temp) == FALSE) {
+    X <- X[ ,-temp]
+  }
+  tempBeta <- solve(t(X)%*%X)%*%t(X)%*%y
+  beta <- 1:p
+  if (is.null(temp) == FALSE) {
+    beta <- replace(x = beta, list = beta[-temp], values = tempBeta)
+    beta <- replace(x = beta, list = temp, values = 0)
+  } 
+  if (is.null(temp) == TRUE) {
+    beta <- c(tempBeta)
+  }
+  return(beta)
+}
+loss <- function(y, t = rep(x = 1, times = length(y)), lam = 0) {
+  p <- length(y)
+  X <- genX(p = p)
+  Xt <- Xt(X = X, t = t)
+  betaHt <- betaH(y = y, X = Xt)
+  pen <- lam * sum(t)
+  nm <- sum((y - Xt%*%betaHt)^2)
+  return((1/p) * nm + pen)
+}
No results found