Statistical Inference and Learning

Statistical
Your post description
Published

February 26, 2025

1

2 I. Overview of Statistical Inference

  • Definition:
    Statistical inference (often called “learning” in computer science) is the process of using data to deduce the underlying distribution \(F\) that generated the data. This may involve estimating the entire distribution or specific features (such as the mean).

  • Applications:

    • Extracting meaningful information from data
    • Making informed decisions and predictions
    • Serving as the foundation for more advanced topics in statistics and machine learning

3 II. Modeling Approaches

3.1 A. Parametric Models

  • Definition:
    A model defined by a finite number of parameters.
  • Example (Normal Distribution):
    \[ f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]
  • Characteristics:
    • Simpler to analyze and interpret
    • More efficient when the assumptions hold true

3.2 B. Nonparametric Models

  • Definition:
    Models that do not restrict the distribution to a finite-dimensional parameter space.
  • Examples:
    • Estimating the entire cumulative distribution function (cdf)\(F\)
    • Estimating a probability density function (pdf) with smoothness assumptions (e.g., assuming the pdf belongs to a [[Sobolev space]])
  • Characteristics:
    • Greater flexibility to model complex data
    • Fewer assumptions about the form of the distribution

4 Example 6.1: One-Dimensional Parametric Estimation -

Scenario: We observe independent Bernoulli(\(p\)) random variables \(X_1, X_2, \dots, X_n\).

Goal: Estimate the unknown parameter \(p\) (the probability of success).

Estimator: The natural estimator is the sample mean: \[ \hat{p}_n = \frac{1}{n}\sum_{i=1}^n X_i. \] Key Points: - Unbiasedness: \[E(\hat{p}_n) = p.\] Thus, the estimator is unbiased.

Variance: Since \[\operatorname{Var}(X_i) = p(1-p)\], the variance of the estimator is

\[ \operatorname{Var}(\hat{p}_n) = \frac{p(1-p)}{n}. \] - Consistency: As \(n\) increases, the variance shrinks, making \(\hat{p}_n\) a consistent estimator of \(p\).


# Example 6.1: One-Dimensional Parametric Estimation (Bernoulli)
import numpy as np

# True parameter for Bernoulli distribution
p_true = 0.7
n = 1000  # number of observations

# Generate n independent Bernoulli(p) observations (0 or 1)
X = np.random.binomial(1, p_true, n)

# Estimator: sample mean is the natural estimator for p
p_hat = np.mean(X)

print("Example 6.1: Bernoulli Parameter Estimation")
print("True p:", p_true)
print("Estimated p:", p_hat)

5 Example 6.2: Two-Dimensional Parametric Estimation -

Scenario: Suppose \[X_1, X_2, \dots, X_n\] are independent observations from a distribution \[F\] whose probability density function is given by a parametric family: \[ f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). \] - Goal: Estimate the two parameters: the mean \[\mu\] and the standard deviation \[\sigma\]. - Nuisance Parameter: If we are primarily interested in \[\mu\], then \[\sigma\] becomes a nuisance parameter—an additional parameter that must be estimated but is not of direct interest. - Key Points: - Multidimensionality: The estimation problem involves simultaneous estimation of \[\mu\] and \[\sigma\]. - Methods: Techniques such as maximum likelihood estimation (MLE) are commonly used, sometimes incorporating methods (like profile likelihood) to eliminate the effect of nuisance parameters when focusing on \[\mu\]. . # Analytical Explanation of Two Nonparametric Estimation Examples

Below, we analyze and explain two examples that illustrate nonparametric estimation techniques: one for estimating the cumulative distribution function (CDF) and another for estimating the probability density function (PDF).


6 Example 6.3: Nonparametric Estimation of the CDF

6.1 Problem Statement

  • Data:
    We have independent observations \[X_1, X_2, \dots, X_n\] drawn from an unknown distribution with CDF \[F\].

  • Objective:
    Estimate the entire cumulative distribution function \[F\], assuming minimal assumptions—namely, that \[F\] is any valid CDF (denoted by \[\mathcal{F}_{\text{ALL}}\]).

6.2 Approach

  • Estimator:
    The natural nonparametric estimator for the CDF is the empirical distribution function (EDF) defined as: \[ \hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{X_i \le x\}, \] where \[\mathbf{1}\{X_i \le x\}\] is an indicator function that is 1 if \[X_i \le x\] and 0 otherwise.

6.3 Why This Works

  • Minimal Assumptions:
    No specific parametric form for \[F\] is assumed; all that is required is that \[F\] is a valid CDF. This makes the method very general.

  • Convergence Properties:
    The Glivenko–Cantelli theorem guarantees that the empirical CDF converges uniformly to the true CDF: \[ \sup_x \left| \hat{F}_n(x) - F(x) \right| \to 0 \quad \text{as} \quad n \to \infty. \] This property ensures that the estimator is consistent.

  • Intuitive Interpretation:
    The EDF simply calculates the proportion of observations less than or equal to a given value, which is the natural way to “build” the CDF from data.


7 Example 6.4: Nonparametric Density Estimation

7.1 Problem Statement

  • Data:
    Again, we have independent observations \[X_1, X_2, \dots, X_n\] from a distribution with CDF \[F\]. Let the associated PDF be \[f = F'\].

  • Objective:
    Estimate the PDF \[f\]. However, unlike the CDF, estimating the density function nonparametrically is not possible under the sole assumption that \[F\] is any CDF.

7.2 Need for Additional Assumptions

  • Ill-Posed Without Smoothness:
    The space of all CDFs (denoted by \[\mathcal{F}_{\text{ALL}}\]) is too vast; a generic CDF need not be differentiable. Even if a density exists, it can be highly irregular, making consistent estimation difficult or impossible.

  • Introducing Smoothness via Sobolev Spaces:
    To estimate \[f\] reliably, we assume that \[f\] belongs to a more restricted function class. One common assumption is that \[f\] lies in a Sobolev space (denoted by \[\mathcal{F}_{\text{SOB}}\]).
    For instance, one might assume: \[ \mathcal{F}_{\text{SOB}} = \left\{ f \in \mathcal{F}_{\text{DENS}} : \int \left(f^{(s)}(x)\right)^2 dx < \infty \right\}, \] where:

    • \[\mathcal{F}_{\text{DENS}}\] is the set of all probability density functions.
    • \[f^{(s)}(x)\] denotes the \[s\]-th derivative of \[f\].
    • The condition \[\int \left(f^{(s)}(x)\right)^2 dx < \infty\] ensures that \[f\] is not “too wiggly” or irregular.

7.3 Estimation Methods

  • Kernel Density Estimation (KDE):
    With the smoothness assumption in place, methods such as kernel density estimation can be employed. A kernel density estimator has the form: \[ \hat{f}_n(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - X_i}{h}\right), \] where:
    • \[K(\cdot)\] is a smooth kernel function (e.g., Gaussian).
    • \[h\] is a bandwidth parameter that controls the smoothness of the estimate.

7.4 Why These Assumptions are Necessary

  • Regularization:
    The smoothness condition imposed by the Sobolev space helps regularize the estimation problem. It restricts the set of possible densities to those that have bounded variation or a controlled number of oscillations.

  • Improved Convergence:
    Smoothness assumptions lead to better convergence properties of the density estimator, allowing for rates of convergence that can be rigorously analyzed.

  • Practical Feasibility:
    In many real-world scenarios, the underlying density is indeed smooth (e.g., physical phenomena, economic variables), making this assumption both realistic and useful.


8 Summary

  • Example 6.3:
    • Task: Estimate the CDF \[F\] from data with minimal assumptions.
    • Method: Use the empirical CDF \[\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{X_i \le x\}\].
    • Key Property: Convergence guaranteed by the Glivenko–Cantelli theorem.
  • Example 6.4:
    • Task: Estimate the density \[f\] from data.
    • Challenge: Estimation is ill-posed without additional assumptions.
    • Solution: Assume that \[f\] is smooth by requiring it to belong to a Sobolev space (e.g., \[\mathcal{F}_{\text{DENS}} \cap \mathcal{F}_{\text{SOB}}\]), then use methods like kernel density estimation.
    • Benefit: Smoothness constraints make the problem well-posed and lead to estimators with favorable convergence properties.

These examples highlight the progression from estimating a distribution function under minimal assumptions to needing extra regularity conditions when estimating derivatives like the density.

9 III. Core Concepts in Inference

9.1 1. Point Estimation

  • Concept:
    A point estimator is a function of the data, denoted as
    \[ \hat{\theta}_n = g(X_1, X_2, \dots, X_n) \]
    used to provide a single “best guess” for the unknown parameter \[\theta\].

  • Key Properties:

    • Bias:
      \[ \text{bias}(\hat{\theta}_n) = E(\hat{\theta}_n) - \theta \]
    • Variance and Standard Error (se):
      \[ \text{se} = \sqrt{Var(\hat{\theta}_n)} \]
    • Mean Squared Error (MSE):
      \[ \text{mse} = E\left[(\hat{\theta}_n - \theta)^2\right] = \text{bias}^2(\hat{\theta}_n) + Var(\hat{\theta}_n) \]
    • Consistency:
      An estimator is consistent if
      \[ \hat{\theta}_n \xrightarrow{P} \theta \quad \text{as } n \to \infty \]
    • Asymptotic Normality:
      Many estimators satisfy
      \[ \frac{\hat{\theta}_n - \theta}{\text{se}} \approx N(0, 1) \]
      for large samples, which facilitates the construction of confidence intervals.

9.2 2. Confidence Sets

  • Concept:
    A confidence interval (or set) is a range constructed from the data that, over many repetitions of the experiment, contains the true parameter \[\theta\] with a specified probability (coverage).

  • Example (Normal-Based Interval):
    When \[\hat{\theta}_n\] is approximately normally distributed, an approximate \[1-\alpha\] confidence interval is: \[ C_n = \left( \hat{\theta}_n - z_{\alpha/2}\,\text{se}, \quad \hat{\theta}_n + z_{\alpha/2}\,\text{se} \right) \] where \[z_{\alpha/2}\] is the quantile of the standard Normal distribution such that \[ P(Z > z_{\alpha/2}) = \frac{\alpha}{2}. \]

9.3 3. Hypothesis Testing

  • Concept:
    Hypothesis testing involves formulating a null hypothesis \[H_0\] (a default statement, such as a coin being fair) and an alternative hypothesis \[H_1\], then using the data to decide whether there is sufficient evidence to reject \[H_0\].

  • Example (Testing Coin Fairness):
    \[ H_0: p = 0.5 \quad \text{versus} \quad H_1: p \neq 0.5 \]

  • Process:

    • Define an appropriate test statistic (e.g., \[T = |\hat{p}_n - 0.5|\])
    • Set a significance level \[\alpha\]
    • Determine the rejection region based on \[\alpha\] or compute a p-value
    • Reject \[H_0\] if the test statistic falls into the rejection region

10 IV. Frequentist vs. Bayesian Inference

  • Frequentist Inference:
    • Treats parameters as fixed but unknown
    • Focuses on the properties of estimators over repeated sampling (e.g., confidence intervals, hypothesis tests)
  • Bayesian Inference:
    • Treats parameters as random variables with prior distributions
    • Uses Bayes’ theorem to update beliefs in light of new data, allowing direct probability statements about parameters
  • Comparison:
    • Frequentist methods emphasize long-run frequency properties.
    • Bayesian methods provide a framework for incorporating prior knowledge and making probabilistic statements about parameters.

11 V. Additional Information

11.1 Bibliographic References

  • Elementary Level:
    • DeGroot and Schervish (2002)
    • Larsen and Marx (1986)
  • Intermediate Level:
    • Casella and Berger (2002)
    • Bickel and Doksum (2000)
    • Rice (1995)
  • Advanced Level:
    • Cox and Hinkley (2000)
    • Lehmann and Casella (1998)
    • Lehmann (1986)
    • van der Vaart (1998)

11.2 Exercises

  1. Poisson Estimation:
    For \[X_1, X_2, \dots, X_n \sim \text{Poisson}(\lambda)\] with the estimator
    \[ \hat{\lambda} = \frac{1}{n}\sum_{i=1}^n X_i, \]
    determine the bias, standard error, and mean squared error.

  2. Uniform Distribution Estimation (Method 1):
    For \[X_1, X_2, \dots, X_n \sim \text{Uniform}(0, \theta)\] and the estimator
    \[ \hat{\theta} = \max\{X_1, X_2, \dots, X_n\}, \]
    calculate the bias, standard error, and mse.

  3. Uniform Distribution Estimation (Method 2):
    For the same model with the estimator
    \[ \hat{\theta} = 2X_{(n)}, \]
    where \[X_{(n)}\] is the maximum, compute the bias, standard error, and mse.


12 VI. Key Takeaways

  • Inference Fundamentals:
    Learning how to deduce properties of a population from a sample is central to statistics and machine learning.

  • Model Choice:

    • Parametric models are simpler but rely on strong assumptions.
    • Nonparametric models offer flexibility with fewer assumptions.
  • Estimator Evaluation:
    Properties such as bias, variance (or standard error), and mean squared error are essential in assessing the quality of estimators.

  • Confidence and Testing:

    • Confidence intervals quantify the uncertainty in estimates.
    • Hypothesis testing provides a formal framework for decision-making.
  • Philosophical Approaches:
    The frequentist and Bayesian paradigms provide different perspectives on probability and inference, influencing how uncertainty is quantified and interpreted.