API reference¶

Inference grammar¶

moderndive.specify(data, *, response=None, explanatory=None, formula=None, success=None)[source]¶

Specify the response (and optional explanatory) variable(s) for inference.

Two equivalent forms, mirroring R infer:

specify(df, response="weight")
specify(df, formula="popular_or_not ~ track_genre", success="popular")

Multi-term formulas ("y ~ a + b") are supported for fit().

Return type:

Specification

Parameters:

data (DataFrame)
response (str | None)
explanatory (str | None)
formula (str | None)
success (object | None)

moderndive.observe(data, *, response=None, explanatory=None, formula=None, success=None, stat='mean', order=None, null=None, mu=None, med=None, p=None, sigma=None)[source]¶

Shortcut for specify() |> [hypothesize()] |> calculate().

Mirrors R infer::observe() — compute an observed statistic in one call.

Return type:

ObservedStatistic

Parameters:

data (DataFrame)
response (str | None)
explanatory (str | None)
formula (str | None)
success (object | None)
stat (str)
order (tuple[object, object] | None)
null (str | None)
mu (float | None)
med (float | None)
p (float | dict | None)
sigma (float | None)

moderndive.assume(distribution, df=None)[source]¶

Set a theoretical distribution ("t", "z", "F", "Chisq").

df is the degrees of freedom: a scalar for t/Chisq, a (df1, df2) tuple for F, and unused for z.

Return type:

TheoreticalDistribution

Parameters:

distribution (str)
df (object | None)

class moderndive.infer.core.Specification(data, response, explanatory=None, success=None, formula=None)[source]¶

Result of specify(): the chosen response (+ optional explanatory).

Parameters:

data (DataFrame)
response (str)
explanatory (str | None)
success (object | None)
formula (str | None)

calculate(stat, *, order=None, mu=None, p=None, sigma=None)[source]¶

Compute the observed statistic (no resampling).

Return type:

ObservedStatistic

Parameters:

order (tuple[object, object] | None)
mu (float | None)
p (float | dict | None)
sigma (float | None)

assume(distribution, df=None)[source]¶

Set a theoretical sampling distribution (see assume()).

Parameters:

distribution (str)
df (object | None)

fit()[source]¶

Fit the observed regression (ordinary least squares) for the formula.

Return type:: FitResult

class moderndive.infer.core.Hypothesis(spec, null, mu=None, med=None, p=None, sigma=None)[source]¶

Parameters:

spec (Specification)
null (str)
mu (float | None)
med (float | None)
p (float | dict | None)
sigma (float | None)

calculate(stat, *, order=None)[source]¶

Observed statistic that needs the hypothesized value (e.g. stat=’t’/’z’).

Return type:: ObservedStatistic
Parameters:: order (tuple[object, object] | None)

class moderndive.infer.core.GeneratedReplicates(spec, type, null, plans, shifted_response=None, hyp_mu=None, hyp_p=None, hyp_sigma=None, variables=None)[source]¶

Materialized resampling plan; supports both calculate() and fit().

plans holds one numpy array per replicate: - bootstrap: row indices (with replacement) into the data, - permute: a permutation of the row positions (used to shuffle a column), - draw: a simulated response array under a point null proportion.

Parameters:

spec (Specification)
type (str)
null (str | None)
plans (list[ndarray])
shifted_response (ndarray | None)
hyp_mu (float | None)
hyp_p (float | dict | None)
hyp_sigma (float | None)
variables (str | None)

fit()[source]¶

Fit OLS on each replicate, returning per-term estimates (long form).

Return type:: FitResult

class moderndive.infer.core.Distribution(data, stat, null=None, type='bootstrap')[source]¶

A simulated distribution of statistics (one row per replicate).

Parameters:

data (DataFrame)
stat (str)
null (str | None)
type (str)

class moderndive.infer.core.FitResult(data, null=None, type='bootstrap')[source]¶

Regression coefficients: observed (one row per term) or a distribution of them across replicates (replicate, term, estimate).

Parameters:

data (DataFrame)
null (str | None)
type (str)

class moderndive.infer.theoretical.TheoreticalDistribution(distribution, df=None)[source]¶

A named theoretical distribution (from assume()).

Parameters:

distribution (str)
df (object | None)

get_p_value(obs_stat, direction)[source]¶

Theory-based p-value for a (standardized) observed statistic.

Return type:: DataFrame
Parameters:: direction (str)

visualize(bins=100, *, engine='plotly')[source]¶

Plot the theoretical density curve (plotly by default; engine=”plotnine”).

Parameters:

bins (int)
engine (str)

Getters and visualization¶

moderndive.get_p_value(distribution, obs_stat, direction)[source]¶

Return a one-row frame with the simulation-based p_value.

direction is one of right/greater, left/less, or two-sided. The two-sided p-value uses infer’s convention: twice the smaller one-sided tail proportion, capped at 1.

Return type:

DataFrame

Parameters:

distribution (Distribution)
direction (str)

moderndive.get_confidence_interval(distribution, level=0.95, type='percentile', *, point_estimate=None)[source]¶

Return a one-row frame with lower_ci and upper_ci.

type="percentile": the (1-level)/2 and 1-(1-level)/2 quantiles of the bootstrap distribution.
type="se": point_estimate ± z* · SE where SE is the SD of the bootstrap distribution (requires point_estimate).

Return type:

DataFrame

Parameters:

distribution (Distribution)
level (float)
type (str)
point_estimate (float | None)

moderndive.visualize(distribution, bins=20, *, engine='plotly', method='simulation', dens_color=None, shade_pvalue=None, shade_ci=None, **kwargs)[source]¶

Histogram of the simulated statistics, as an InferPlot.

method is "simulation" (histogram, default), "theoretical" (a normal-approximation density curve), or "both" (histogram in density units overlaid with the normal curve), mirroring R infer’s visualize(method=). dens_color sets the theoretical-curve color (for "theoretical"/"both"). Pass shade_pvalue=/shade_ci= to shade in one call, or compose with +.

Return type:

InferPlot

Parameters:

bins (int)
engine (str)
method (str)
dens_color (str | None)

moderndive.shade_p_value(obs_stat, direction, *, color=None, fill=None)[source]¶

A p-value shading spec; add it to a visualize() plot with +.

direction ∈ {right/greater, left/less, two-sided}. For a faceted visualize_fit() plot, pass a per-term obs_stat — an observed FitResult, a term-keyed frame, or a dict — to shade each facet.

Return type:

ShadeSpec

Parameters:

direction (str)
color (str | None)
fill (str | None)

moderndive.shade_confidence_interval(endpoints, color=None, fill=None)[source]¶

A confidence-interval shading spec; add it to a visualize() plot with +.

endpoints is a CI DataFrame (lower_ci/upper_ci) or a (lower, upper) tuple. For a faceted visualize_fit() plot, pass a per-term CI table (with a term column) to shade each facet from its own interval.

Return type:

ShadeSpec

Parameters:

color (str | None)
fill (str | None)

Theory-based tests¶

moderndive.t_test(data, *, formula=None, response=None, explanatory=None, order=None, alternative='two-sided', mu=0.0, conf_level=0.95)[source]¶

One-sample (no explanatory) or two-sample (Welch) t-test, tidy output.

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)
order (tuple[object, object] | None)
alternative (str)
mu (float)
conf_level (float)

moderndive.t_stat(data, **kwargs)[source]¶

The t statistic only (see t_test()).

Return type:: float
Parameters:: data (DataFrame)

moderndive.prop_test(data, *, formula=None, response=None, explanatory=None, success=None, order=None, p=None, alternative='two-sided', z=False, correct=True, conf_int=True, conf_level=0.95)[source]¶

Tidy one- or two-proportion test, mirroring R infer::prop_test.

By default reports the chi-square statistic (like R’s prop.test) with a chisq_df column; pass z=True for the signed z statistic instead. correct applies Yates’ continuity correction. With conf_int=True (default) the output includes a conf_level confidence interval — on the proportion (one-sample) or on the difference in proportions (two-sample).

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)
success (object | None)
order (tuple[object, object] | None)
p (float | None)
alternative (str)
z (bool)
correct (bool)
conf_int (bool)
conf_level (float)

moderndive.chisq_test(data, *, formula=None, response=None, explanatory=None, p=None)[source]¶

Tidy chi-squared test.

With an explanatory variable, this is a test of independence. With only a response and a p={level: probability, ...} mapping, it is a goodness-of-fit test against those hypothesized proportions. Returns statistic, chisq_df, p_value.

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)
p (dict | None)

moderndive.chisq_stat(data, **kwargs)[source]¶

The chi-squared statistic only (see chisq_test()).

Return type:: float
Parameters:: data (DataFrame)

Theory-based inference wrappers (scipy.stats).

The book deliberately teaches simulation-based inference first, then ties results back to the traditional theory-based methods (t-distribution CIs, the two-sample test, normal approximations). These helpers provide those theory-based companions so the chapters can draw the simulation-vs-theory comparison.

All functions return small polars frames with tidy column names.

moderndive.theory.t_test_one_sample(x, mu=0.0, alternative='two-sided')[source]¶

One-sample t-test of H0: mean == mu.

Return type:

DataFrame

Parameters:

mu (float)
alternative (str)

moderndive.theory.t_test_two_sample(x, y, alternative='two-sided', equal_var=False)[source]¶

Two-sample (Welch by default) t-test of equal means.

Return type:

DataFrame

Parameters:

alternative (str)
equal_var (bool)

moderndive.theory.t_confidence_interval(x, level=0.95)[source]¶

Theory-based t confidence interval for a single mean.

Return type:: DataFrame
Parameters:: level (float)

moderndive.theory.prop_test_two_sample(successes, totals, alternative='two-sided')[source]¶

Two-sample z-test for a difference in proportions (normal approximation).

Return type:

DataFrame

Parameters:

successes (tuple[int, int])
totals (tuple[int, int])
alternative (str)

Regression & summary helpers¶

moderndive.get_regression_table(model, digits=3, conf_level=0.95, exponentiate=False, default_categorical_levels=False)[source]¶

Tidy regression table: term, estimate, std_error, statistic, p_value, lower/upper_ci.

model is a fitted statsmodels results object — either OLS (smf.ols("y ~ x", data).fit()) or GLM (smf.glm(...).fit()).

For GLMs with a log or logit link, pass exponentiate=True to report the coefficient estimate and its confidence interval as rate / odds ratios (std_error, statistic, and p_value stay on the model’s link scale, matching broom::tidy).

By default, categorical-predictor terms are prettified (e.g. income[T.High income] → income: High income). Pass default_categorical_levels=True to keep the raw statsmodels term names.

Return type:

DataFrame

Parameters:

digits (int)
conf_level (float)
exponentiate (bool)
default_categorical_levels (bool)

moderndive.get_regression_points(model, digits=3, *, newdata=None, ID=None)[source]¶

Fitted values + residuals per observation (~ broom::augment).

Columns: ID, the outcome, each original predictor, <outcome>_hat, residual. In-formula transformations are handled gracefully: a transformed outcome (np.log(mpg)) is shown on the model’s scale under a sanitized name (log_mpg / log_mpg_hat), and transformed predictors (poly(), scale(), I()) are shown as their original columns rather than leaking basis matrices. For GLMs, fitted values and residuals are on the response scale (e.g. probabilities for logistic regression).

Pass newdata (a polars/pandas frame) to apply the model to new observations: predictions are returned, plus a residual if the outcome is present in newdata. ID names a column to use as the identifier (placed first); without it, ID is 1..n.

Return type:

DataFrame

Parameters:

digits (int)
ID (str | None)

moderndive.get_regression_summaries(model, digits=3)[source]¶

Model-fit summaries as a tidy 1-row frame (~ broom::glance).

For an OLS model: r_squared, adj_r_squared, mse, rmse, sigma, statistic (overall F), p_value, df, nobs.

For a GLM (no R² applies): mse, rmse, deviance, null_deviance, aic, bic, log_lik, df_residual, df_null, nobs. mse/rmse use response-scale residuals.

mse is the mean squared residual using n in the denominator (so rmse = sqrt(mse)); for OLS sigma is the residual standard error using n - p — matching the R package.

Return type:: DataFrame
Parameters:: digits (int)

moderndive.get_correlation(data, formula=None, *, x=None, y=None, method='pearson', na_rm=True, wide=False, quiet=False)[source]¶

Correlation between an outcome and one or more predictors.

Mirrors moderndive::get_correlation. Give the variables either as a formula ("y ~ x" or "y ~ x1 + x2 + x3") or, for a single predictor, via x= and y=.

method is "pearson" (default), "spearman" (rank correlation), or "kendall" (rank concordance). na_rm drops rows with a null in either column before computing (per predictor pair); set na_rm=False to keep them (yielding nan if any are present).

With one predictor the result is a 1-row frame with a cor column. With multiple predictors the result is long by default — columns predictor and cor (one row each) — or pass wide=True for one column per predictor.

A short note points to a full pairwise correlation matrix when there are multiple predictors; silence it with quiet=True.

Return type:

DataFrame

Parameters:

formula (str | None)
x (str | None)
y (str | None)
method (str)
na_rm (bool)
wide (bool)
quiet (bool)

moderndive.pop_sd(x)[source]¶

Population standard deviation (divides by n, not n - 1).

Mirrors moderndive::pop_sd. Accepts a polars Series, list, numpy array, or any sequence; nulls/NaNs are dropped before computing.

Return type:: float

moderndive.tidy_summary(data, columns=None, digits=3)[source]¶

Per-variable summary statistics for the selected columns.

Mirrors the R moderndive::tidy_summary column layout: column, n, group, type, min, Q1, mean, median, Q3, max, sd. Numeric columns get the five-number summary + mean/sd; non-numeric columns report n and type with the numeric fields left null.

Return type:

DataFrame

Parameters:

columns (list[str] | None)
digits (int)

moderndive.count_missing(data, columns=None)[source]¶

Count missing (null) values in each column.

A beginner-friendly alternative to df.select(pl.all().is_null().sum()): it returns a tidy two-column data frame with one row per column (column, n_missing), sorted from most to fewest missing values so the columns needing attention surface first.

Parameters:

data – A polars (or pandas) data frame.
columns (list[str] | None) – Optional list of column names to check; defaults to every column.

Return type:

DataFrame

Sampling and plots¶

All plotting helpers accept engine="plotly" (default) or engine="plotnine".

moderndive.rep_slice_sample(data, n=None, *, prop=None, reps=1, replace=False, weight_by=None, seed=None)[source]¶

Take reps samples from data.

Give the sample size as either n (a count) or prop (a fraction of the rows, e.g. prop=0.5). Returns a polars DataFrame with a leading replicate column identifying which sample each row belongs to. Set replace=True for sampling with replacement (bootstrap-style). weight_by gives unequal selection probabilities — a column name or a sequence of weights. Pass seed for reproducibility.

Return type:

DataFrame

Parameters:

data (DataFrame)
n (int | None)
prop (float | None)
reps (int)
replace (bool)
seed (int | None)

moderndive.rep_sample_n(data, n, *, reps=1, replace=False, prob=None, seed=None)[source]¶

Take reps samples of size n (older moderndive name).

Like rep_slice_sample(), but the sample size is always the count n and unequal selection weights are passed as prob (a column name or a sequence), matching the R rep_sample_n signature.

Return type:

DataFrame

Parameters:

data (DataFrame)
n (int)
reps (int)
replace (bool)
seed (int | None)

moderndive.pairplot(data, columns=None, hue=None, *, engine='plotly')[source]¶

Scatterplot matrix of the numeric columns (the analog of GGally::ggpairs).

engine="plotly" (default) returns a plotly go.Figure from plotly.express.scatter_matrix. engine="plotnine" (alias "seaborn") returns the matplotlib Figure from seaborn.pairplot — the non-plotly backend here is seaborn-backed, since plotnine has no first-class SPLOM. hue colors points by a categorical column.

Parameters:

data (DataFrame)
columns (list[str] | None)
hue (str | None)
engine (str)

moderndive.gg_parallel_slopes(data, response, explanatory, by, *, alpha=1.0, engine='plotly')[source]¶

Scatterplot with a parallel-slopes regression model overlaid.

Fits response ~ explanatory + C(by) (one common slope, a separate intercept per level of by) and draws one fitted line per group over the data. alpha sets the point transparency (0–1), useful when points overlap.

Parameters:

response (str)
explanatory (str)
by (str)
alpha (float)
engine (str)

moderndive.geom_parallel_slopes(data, response, explanatory, by, color=None)[source]¶

plotnine layer(s) drawing the parallel-slopes fitted lines.

Add to a ggplot with + (plotnine-only; for a plotly version call gg_parallel_slopes() with engine="plotly").

Parameters:

response (str)
explanatory (str)
by (str)
color (str | None)

moderndive.gg_categorical_model(data, response, explanatory, *, engine='plotly')[source]¶

Regression with one categorical predictor (~ geom_categorical_model).

Fits response ~ C(explanatory); each category’s fitted value is its group mean, drawn as a horizontal marker over the (jittered) data points.

Parameters:

response (str)
explanatory (str)
engine (str)

moderndive.geom_categorical_model(data, response, explanatory, *, engine='plotly')¶

Regression with one categorical predictor (~ geom_categorical_model).

Fits response ~ C(explanatory); each category’s fitted value is its group mean, drawn as a horizontal marker over the (jittered) data points.

Parameters:

response (str)
explanatory (str)
engine (str)

moderndive.plot_3d_regression(data, formula, n=25)[source]¶

Interactive 3D scatterplot with a fitted regression plane.

Mirrors moderndive::plot_3d_regression. Pass a formula z ~ x + y — one numeric outcome and exactly two numeric predictors — and get a plotly go.Figure with the data points and the fitted lm plane.

In-formula transformations (e.g. log(z) ~ x + y) are not supported, since the plane and the raw points would be on different scales; transform the columns of data first and pass plain names. n sets the plane’s grid resolution per axis.

Parameters:

formula (str)
n (int)

Viewing data¶

moderndive.View(x, title=None)[source]¶

Display a data frame as an interactive table (search, sort, paginate).

In a notebook / Quarto context this renders an interactive table via the optional itables package (install with pip install "moderndive[view]"). Without itables it returns the data frame so it still displays. Accepts a polars or pandas DataFrame (or anything coercible to one). title is shown as the table caption.

Parameters:: title (str | None)

Datasets¶

moderndive.load_dataset(name)[source]¶

Load a dataset by name, returning a polars DataFrame.

Return type:: DataFrame
Parameters:: name (str)

moderndive.data.available_datasets()[source]¶

Return the sorted names of all loadable datasets.

Return type:: list[str]

Each dataset also has a convenience loader moderndive.load_<name>() returning a polars DataFrame. Call :func:moderndive.data.available_datasets for the full list.