R compatibility notes¶
This package aims to reproduce the numerical results of the R moderndive
and infer packages, not just their API. In several places that means making a
deliberate choice that differs from the “default” or “typical” Python/SciPy/
statsmodels behavior. They’re collected here so the differences are intentional
and discoverable.
Statistics¶
pop_sddivides byn(population SD,ddof=0), unlike the samplesd/calculate(stat="sd")andnumpy.std’s default which usen − 1. This matches Rmoderndive::pop_sd.get_regression_summaries:mseis the mean squared residual usingnin the denominator (sormse = sqrt(mse)), whilesigmais the residual standard error usingn − p. Both match the R package; notemseis not statsmodels’mse_resid(which usesn − p).GLM summaries use the log-likelihood-based BIC (
bic_llf), not statsmodels’ deviance-basedbic, to align withbroom::glance.get_p_valuetwo-sided =2 × min(left, right)capped at 1 — theinferconvention, which can differ slightly from a symmetric-tail p-value.FandChisqare treated as inherently one-sided (right tail) for p-values regardless of thedirectionargument, matchinginfer.
prop_test (mirrors R’s prop.test, not a plain z-test)¶
Chi-square statistic by default (with a
chisq_dfcolumn), like R’sprop.test. Passz=Truefor the signed z-statistic that a “typical” Python two-proportion test would report.Yates’ continuity correction is on by default (
correct=True), as in R.Confidence intervals match R: a Wilson score interval for one proportion (not the Wald interval
statsmodelsreturns by default), and a Wald interval widened by the continuity correction for a two-proportion difference.
Correlation¶
get_correlationdrops nulls by default (na_rm=True) so beginners get a number rather thannan. R’sna.rmdefaults toFALSE; passna_rm=Falseto match R exactly.method="spearman"/"kendall"use the SciPy implementations and match R’scor(method=...).
Regression points / tables¶
In-formula transformations are reshaped to match R: a transformed outcome like
np.log(mpg)is shown on the model scale aslog_mpg/log_mpg_hat, and transformed predictors (poly,scale,I) show their original columns rather than the patsy basis matrix.get_regression_tableprettifies categorical term names (income[T.High]→income: High) by default;default_categorical_levels=Truekeeps the raw statsmodels names.
Datasets¶
Datetime columns are stored in UTC. R’s nycflights
time_houris stored inAmerica/New_York; the bundled Parquet stores the identical instants in UTC, so a displayed hour differs by the UTC offset (the integerhourcolumn matches R).early_january_2023_weatheris derived fromweather. The R dataset shipstemp/dewp/humid/pressureas all-NA; this package recomputes the table fromweather(Newark, first 15 days of Jan 2023) so those columns hold real values.
Plotting¶
plotly is the default engine (the book is moving to interactive plots); pass
engine="plotnine"anywhere for grammar-of-graphics output. R returns ggplot2 objects.visualize’s defaultbinsis 20 (R’s is 15).Two-sided p-value shading mirrors the observed statistic about 0, matching
infer’sshade_p_value.
Reproducibility¶
Pass
seed=togenerate()/rep_slice_sample()for reproducible draws (R usesset.seed()). Identical seeds will not reproduce R’s exact random draws — only the statistical behavior matches, not the specific RNG stream.