R compatibility notes

This package aims to reproduce the numerical results of the R moderndive and infer packages, not just their API. In several places that means making a deliberate choice that differs from the “default” or “typical” Python/SciPy/ statsmodels behavior. They’re collected here so the differences are intentional and discoverable.

Statistics

  • pop_sd divides by n (population SD, ddof=0), unlike the sample sd/calculate(stat="sd") and numpy.std’s default which use n 1. This matches R moderndive::pop_sd.

  • get_regression_summaries: mse is the mean squared residual using n in the denominator (so rmse = sqrt(mse)), while sigma is the residual standard error using n p. Both match the R package; note mse is not statsmodels’ mse_resid (which uses n p).

  • GLM summaries use the log-likelihood-based BIC (bic_llf), not statsmodels’ deviance-based bic, to align with broom::glance.

  • get_p_value two-sided = 2 × min(left, right) capped at 1 — the infer convention, which can differ slightly from a symmetric-tail p-value.

  • F and Chisq are treated as inherently one-sided (right tail) for p-values regardless of the direction argument, matching infer.

prop_test (mirrors R’s prop.test, not a plain z-test)

  • Chi-square statistic by default (with a chisq_df column), like R’s prop.test. Pass z=True for the signed z-statistic that a “typical” Python two-proportion test would report.

  • Yates’ continuity correction is on by default (correct=True), as in R.

  • Confidence intervals match R: a Wilson score interval for one proportion (not the Wald interval statsmodels returns by default), and a Wald interval widened by the continuity correction for a two-proportion difference.

Correlation

  • get_correlation drops nulls by default (na_rm=True) so beginners get a number rather than nan. R’s na.rm defaults to FALSE; pass na_rm=False to match R exactly.

  • method="spearman"/"kendall" use the SciPy implementations and match R’s cor(method=...).

Regression points / tables

  • In-formula transformations are reshaped to match R: a transformed outcome like np.log(mpg) is shown on the model scale as log_mpg/log_mpg_hat, and transformed predictors (poly, scale, I) show their original columns rather than the patsy basis matrix.

  • get_regression_table prettifies categorical term names (income[T.High]income: High) by default; default_categorical_levels=True keeps the raw statsmodels names.

Datasets

  • Datetime columns are stored in UTC. R’s nycflights time_hour is stored in America/New_York; the bundled Parquet stores the identical instants in UTC, so a displayed hour differs by the UTC offset (the integer hour column matches R).

  • early_january_2023_weather is derived from weather. The R dataset ships temp/dewp/humid/pressure as all-NA; this package recomputes the table from weather (Newark, first 15 days of Jan 2023) so those columns hold real values.

Plotting

  • plotly is the default engine (the book is moving to interactive plots); pass engine="plotnine" anywhere for grammar-of-graphics output. R returns ggplot2 objects.

  • visualize’s default bins is 20 (R’s is 15).

  • Two-sided p-value shading mirrors the observed statistic about 0, matching infer’s shade_p_value.

Reproducibility

  • Pass seed= to generate() / rep_slice_sample() for reproducible draws (R uses set.seed()). Identical seeds will not reproduce R’s exact random draws — only the statistical behavior matches, not the specific RNG stream.