Resources for students in Stat 111 (Spring 2023). Managed by aL Xin.
View files on GitHub awqx/stat111-2023
SectionsThe standard setup is $(X_i, Y_i)$ are i.i.d. pairs. Components are dependent but the pairs are independent
When conditioning on our data $X$, our model becomes a lot more sensical because we can use prediction based on our current data. Getting the joint distribution of $Y$ is usually unhelpful.
Let error $U = Y - \mathbb{E}[Y \mid X] = Y - \theta X$
We can use Adam’s law to get $\mathbb{E} [U]$ as
\[\begin{aligned} \mathbb E[U] &= \mathbb E[ \mathbb E[U | X]] \\ &= \mathbb E [\mathbb E[Y | X] - \theta X]] \\ &= \mathbb E[0] = 0. \end{aligned}\]We also have
\[\begin{aligned} UX &= XY - \theta X^2 \\ &\to \mathbb E[UX] = \mathbb E[XY] - \theta \mathbb E[X^2]\\ \mathbb E[UX] &= \mathbb E[\mathbb E[UX | X]] \\ &= \mathbb E[X 0] = 0, \end{aligned}\]which implies
\[\theta = \frac{\mathbb E[XY]}{\mathbb E[X^2]}\]The previous result gives us an MoM estimator (not necessarily the MLE) of
\[\hat\theta = \frac{\sum_{j = 1}^n X_j Y_j}{\sum_{j = 1}^n X_j^2}\]The above is also the MLE only in the case of Gaussian assumptions about the model.
Is $\hat\theta$ unbiased?
We can also write $Y_j = U_j + \theta X_j$, giving us
\[\hat\theta = \frac{\sum_{j = 1}^n X_j (U_j + \theta X_j)}{\sum_{j = 1}^n X_j^2} = \theta + \frac{\sum_{j = 1}^n X_j U_j}{\sum_{j = 1}^n X_j^2},\]which shows that the estimator is unbiased if the second term is 0. Using Adam’s Law, this can be shown, as we already found that $\mathbb E [UX] = 0$.
What is the asymptotic distribution? We have
\[\sqrt{n} (\hat\theta - \theta) = \sqrt{n} \frac{ \frac{1}{n} \sum_{j = 1}^n X_j U_j} {\frac{1}{n} \sum_{j = 1}^n X_j^2} \overset{d}{\to} \mathcal{N} \left( 0, \frac{\textrm{Var}(X_1 U_1)}{(\mathbb E[X_1^2])^2} \right).\]The above also follows from LLN and Slutsky’s.
Joe generally dislikes hypothesis testing. Usually, Bayesian methods or making a confidence interval are more informative and assumptions are clearer.
Hypothesis testing and use of $p$-values, unfortunately, are very widely used in the sciences. Often, the use of hypothesis testing is done recklessly (or purposefully misused) to find significant results in an experiment.
Let $\Theta$ be the parameter space (capital theta). The parameter space contains our possible values of $\theta$. $\Theta$ can have any dimension, as appropriate to the problem. We want to test the null hypothesis (denoted $H_0$) that $\theta \in \Theta_0$ versus the alternate hypothesis (denoted $H_1$) that $\theta \in \Theta_1$.
Normally, we assume that $\Theta_0, \Theta_1$ form a partition of $\Theta$. Some usage does not require this, which is strange because it would leave part of the parameter space uncovered.
Although this setup looks like we’re deciding between $H_0$ and $H_1$, this is not how we’re reasoning through hypotheses. For example, scientists might say they “reject” or “fail to reject” the null rather than “reject/accept” (in this class, you can say “retain” instead of “fail to reject”). If you’re interested in making decisions under uncertainty, check out game theory or decision theory.
This is the most common use of hypothesis testing, so you often do not need to declare you’re using the Neyman-Pearson framework while performing hypothesis testing.
…
(Bonus stats joke)
A statistician is someone who wants to be wrong exactly 5% of the time.
…
The framework is as follows:
Note that the framework only focuses on Type I error and $\alpha$.
Type II error is often important, too. We could also consider situations when we would like to minimize $\Pr(\textrm{retain }H_0 \mid H_1)$.
When testing, we create a test statistic $T$, where we reject $H_0$ if $\mid T \mid > c$ and retain $H_0$ if $\mid T \mid < c$.
What do we do when $T$ is discrete and $T = c$? To be consistent with the framework, we would have to randomly choose with certain probability (not necessarily a fair probability) to keep Type I error rate less than $\alpha$.
We often redefine retention to be $\mid T \mid \leq c$ to avoid the above problem.
Solving for $c$ is a fairly standard STAT 110 problem. We only need to find $c$ such that $\Pr(\mid T \mid > c \mid \theta_0) = \alpha$.
Ex: One sample $t$-test.
Let $Y_j$ be i.i.d. Normals with mean $\mu$ and variance $\sigma^2$. We have $H_0: \mu = \mu_0$ and $H_1: \mu \neq \mu_0$.
We can use $T = \sqrt{n}\frac{\bar Y - \mu_0}{\hat\sigma}$ as our test statistic. This is a standard Normal under the null. Intuitively, if $\bar Y$ is far from $\mu_0$ (and $T$’s magnitude is large), it would seem like the null should be rejected.
Under the null hypothesis, $T \sim t_{n - 1}$ ($t$ distribution with $n - 1$ degrees of freedom). We can write $1 - \alpha = \Pr(\mid T \mid \leq c)$. Because of the symmetry of $t_{n - 1}$ and our null, we can write
\[\alpha = P(\mid T \mid \geq c) = 2 P(T > c) \to 1 - \frac{\alpha}{2} = \Pr(T \leq c),\]where $c$ is easy to find from the quantile function of the $t_{n - 1}$ distribution.