R compatibility notes¶

This package aims to reproduce the numerical results of the R moderndive and infer packages, not just their API. In several places that means making a deliberate choice that differs from the “default” or “typical” Python/SciPy/ statsmodels behavior. They’re collected here so the differences are intentional and discoverable.

Statistics¶

pop_sd divides by n (population SD, ddof=0), unlike the sample sd/calculate(stat="sd") and numpy.std’s default which use n − 1. This matches R moderndive::pop_sd.
get_regression_summaries: mse is the mean squared residual using n in the denominator (so rmse = sqrt(mse)), while sigma is the residual standard error using n − p. Both match the R package; note mse is not statsmodels’ mse_resid (which uses n − p).
GLM summaries use the log-likelihood-based BIC (bic_llf), not statsmodels’ deviance-based bic, to align with broom::glance.
get_p_value two-sided = 2 × min(left, right) capped at 1 — the infer convention, which can differ slightly from a symmetric-tail p-value.
F and Chisq are treated as inherently one-sided (right tail) for p-values regardless of the direction argument, matching infer.

`prop_test` (mirrors R’s `prop.test`, not a plain z-test)¶

Chi-square statistic by default (with a chisq_df column), like R’s prop.test. Pass z=True for the signed z-statistic that a “typical” Python two-proportion test would report.
Yates’ continuity correction is on by default (correct=True), as in R.
Confidence intervals match R: a Wilson score interval for one proportion (not the Wald interval statsmodels returns by default), and a Wald interval widened by the continuity correction for a two-proportion difference.

Correlation¶

get_correlation drops nulls by default (na_rm=True) so beginners get a number rather than nan. R’s na.rm defaults to FALSE; pass na_rm=False to match R exactly.
method="spearman"/"kendall" use the SciPy implementations and match R’s cor(method=...).

Regression points / tables¶

In-formula transformations are reshaped to match R: a transformed outcome like np.log(mpg) is shown on the model scale as log_mpg/log_mpg_hat, and transformed predictors (poly, scale, I) show their original columns rather than the patsy basis matrix.
get_regression_table prettifies categorical term names (income[T.High] → income: High) by default; default_categorical_levels=True keeps the raw statsmodels names.

Datasets¶

Datetime columns are stored in UTC. R’s nycflights time_hour is stored in America/New_York; the bundled Parquet stores the identical instants in UTC, so a displayed hour differs by the UTC offset (the integer hour column matches R).
early_january_2023_weather is derived from weather. The R dataset ships temp/dewp/humid/pressure as all-NA; this package recomputes the table from weather (Newark, first 15 days of Jan 2023) so those columns hold real values.

Plotting¶

plotly is the default engine (the book is moving to interactive plots); pass engine="plotnine" anywhere for grammar-of-graphics output. R returns ggplot2 objects.
visualize’s default bins is 20 (R’s is 15).
Two-sided p-value shading mirrors the observed statistic about 0, matching infer’s shade_p_value.

Reproducibility¶

Pass seed= to generate() / rep_slice_sample() for reproducible draws (R uses set.seed()). Identical seeds will not reproduce R’s exact random draws — only the statistical behavior matches, not the specific RNG stream.