# R compatibility notes

This package aims to reproduce the **numerical results** of the R `moderndive`
and `infer` packages, not just their API. In several places that means making a
deliberate choice that differs from the "default" or "typical" Python/SciPy/
statsmodels behavior. They're collected here so the differences are intentional
and discoverable.

## Statistics

- **`pop_sd` divides by `n`** (population SD, `ddof=0`), unlike the sample
  `sd`/`calculate(stat="sd")` and `numpy.std`'s default which use `n − 1`. This
  matches R `moderndive::pop_sd`.
- **`get_regression_summaries`**: `mse` is the mean squared residual using `n`
  in the denominator (so `rmse = sqrt(mse)`), while `sigma` is the residual
  standard error using `n − p`. Both match the R package; note `mse` is *not*
  statsmodels' `mse_resid` (which uses `n − p`).
- **GLM summaries** use the log-likelihood-based BIC (`bic_llf`), not
  statsmodels' deviance-based `bic`, to align with `broom::glance`.
- **`get_p_value` two-sided** = `2 × min(left, right)` capped at 1 — the `infer`
  convention, which can differ slightly from a symmetric-tail p-value.
- **`F` and `Chisq`** are treated as inherently one-sided (right tail) for
  p-values regardless of the `direction` argument, matching `infer`.

## `prop_test` (mirrors R's `prop.test`, not a plain z-test)

- **Chi-square statistic by default** (with a `chisq_df` column), like R's
  `prop.test`. Pass `z=True` for the signed z-statistic that a "typical" Python
  two-proportion test would report.
- **Yates' continuity correction** is on by default (`correct=True`), as in R.
- **Confidence intervals** match R: a **Wilson score** interval for one
  proportion (not the Wald interval `statsmodels` returns by default), and a
  Wald interval **widened by the continuity correction** for a two-proportion
  difference.

## Correlation

- **`get_correlation` drops nulls by default** (`na_rm=True`) so beginners get a
  number rather than `nan`. R's `na.rm` defaults to `FALSE`; pass `na_rm=False`
  to match R exactly.
- `method="spearman"`/`"kendall"` use the SciPy implementations and match R's
  `cor(method=...)`.

## Regression points / tables

- **In-formula transformations** are reshaped to match R: a transformed outcome
  like `np.log(mpg)` is shown on the model scale as `log_mpg`/`log_mpg_hat`, and
  transformed predictors (`poly`, `scale`, `I`) show their original columns
  rather than the patsy basis matrix.
- **`get_regression_table`** prettifies categorical term names
  (`income[T.High]` → `income: High`) by default; `default_categorical_levels=True`
  keeps the raw statsmodels names.

## Datasets

- **Datetime columns are stored in UTC.** R's nycflights `time_hour` is stored in
  `America/New_York`; the bundled Parquet stores the identical instants in UTC, so
  a displayed hour differs by the UTC offset (the integer `hour` column matches R).
- **`early_january_2023_weather` is derived from `weather`.** The R dataset ships
  `temp`/`dewp`/`humid`/`pressure` as all-`NA`; this package recomputes the table
  from `weather` (Newark, first 15 days of Jan 2023) so those columns hold real
  values.

## Plotting

- **plotly is the default engine** (the book is moving to interactive plots);
  pass `engine="plotnine"` anywhere for grammar-of-graphics output. R returns
  ggplot2 objects.
- `visualize`'s default `bins` is 20 (R's is 15).
- Two-sided p-value shading mirrors the observed statistic about 0, matching
  `infer`'s `shade_p_value`.

## Reproducibility

- Pass `seed=` to `generate()` / `rep_slice_sample()` for reproducible draws
  (R uses `set.seed()`). Identical seeds will **not** reproduce R's exact random
  draws — only the statistical behavior matches, not the specific RNG stream.