API reference¶

Inference grammar¶

moderndive.specify(data, *, response=None, explanatory=None, formula=None, success=None)[source]¶

Specify the response (and optional explanatory) variable(s) for inference.

Two equivalent forms, mirroring R infer:

specify(df, response="weight")
specify(df, formula="popular_or_not ~ track_genre", success="popular")

Multi-term formulas ("y ~ a + b") are supported for fit().

Return type:

Specification

Parameters:

data (DataFrame)
response (str | None)
explanatory (str | None)
formula (str | None)
success (object | None)

moderndive.observe(data, *, response=None, explanatory=None, formula=None, success=None, stat='mean', order=None, null=None, mu=None, p=None, sigma=None)[source]¶

Shortcut for specify() |> [hypothesize()] |> calculate().

Mirrors R infer::observe() — compute an observed statistic in one call.

Return type:

ObservedStatistic

Parameters:

data (DataFrame)
response (str | None)
explanatory (str | None)
formula (str | None)
success (object | None)
stat (str)
order (tuple[object, object] | None)
null (str | None)
mu (float | None)
p (float | None)
sigma (float | None)

moderndive.assume(distribution, df=None)[source]¶

Set a theoretical distribution ("t", "z", "F", "Chisq").

df is the degrees of freedom: a scalar for t/Chisq, a (df1, df2) tuple for F, and unused for z.

Return type:

TheoreticalDistribution

Parameters:

distribution (str)
df (object | None)

class moderndive.infer.core.Specification(data, response, explanatory=None, success=None, formula=None)[source]¶

Result of specify(): the chosen response (+ optional explanatory).

Parameters:

data (DataFrame)
response (str)
explanatory (str | None)
success (object | None)
formula (str | None)

calculate(stat, *, order=None, mu=None, p=None, sigma=None)[source]¶

Compute the observed statistic (no resampling).

Return type:

ObservedStatistic

Parameters:

order (tuple[object, object] | None)
mu (float | None)
p (float | None)
sigma (float | None)

assume(distribution, df=None)[source]¶

Set a theoretical sampling distribution (see assume()).

Parameters:

distribution (str)
df (object | None)

fit()[source]¶

Fit the observed regression (ordinary least squares) for the formula.

Return type:: FitResult

class moderndive.infer.core.Hypothesis(spec, null, mu=None, p=None, sigma=None)[source]¶

Parameters:

spec (Specification)
null (str)
mu (float | None)
p (float | None)
sigma (float | None)

calculate(stat, *, order=None)[source]¶

Observed statistic that needs the hypothesized value (e.g. stat=’t’/’z’).

Return type:: ObservedStatistic
Parameters:: order (tuple[object, object] | None)

class moderndive.infer.core.GeneratedReplicates(spec, type, null, plans, shifted_response=None, hyp_mu=None, hyp_p=None, hyp_sigma=None)[source]¶

Materialized resampling plan; supports both calculate() and fit().

plans holds one numpy array per replicate: - bootstrap: row indices (with replacement) into the data, - permute: a permutation of the row positions (used to shuffle a column), - draw: a simulated response array under a point null proportion.

Parameters:

spec (Specification)
type (str)
null (str | None)
plans (list[ndarray])
shifted_response (ndarray | None)
hyp_mu (float | None)
hyp_p (float | None)
hyp_sigma (float | None)

fit()[source]¶

Fit OLS on each replicate, returning per-term estimates (long form).

Return type:: FitResult

class moderndive.infer.core.Distribution(data, stat, null=None, type='bootstrap')[source]¶

A simulated distribution of statistics (one row per replicate).

Parameters:

data (DataFrame)
stat (str)
null (str | None)
type (str)

class moderndive.infer.core.FitResult(data, null=None, type='bootstrap')[source]¶

Regression coefficients: observed (one row per term) or a distribution of them across replicates (replicate, term, estimate).

Parameters:

data (DataFrame)
null (str | None)
type (str)

class moderndive.infer.theoretical.TheoreticalDistribution(distribution, df=None)[source]¶

A named theoretical distribution (from assume()).

Parameters:

distribution (str)
df (object | None)

get_p_value(obs_stat, direction)[source]¶

Theory-based p-value for a (standardized) observed statistic.

Return type:: DataFrame
Parameters:: direction (str)

visualize(bins=100, *, engine='plotly')[source]¶

Plot the theoretical density curve (plotly by default; engine=”plotnine”).

Parameters:

bins (int)
engine (str)

Getters and visualization¶

moderndive.get_p_value(distribution, obs_stat, direction)[source]¶

Return a one-row frame with the simulation-based p_value.

direction is one of right/greater, left/less, or two-sided. The two-sided p-value uses infer’s convention: twice the smaller one-sided tail proportion, capped at 1.

Return type:

DataFrame

Parameters:

distribution (Distribution)
direction (str)

moderndive.get_confidence_interval(distribution, level=0.95, type='percentile', *, point_estimate=None)[source]¶

Return a one-row frame with lower_ci and upper_ci.

type="percentile": the (1-level)/2 and 1-(1-level)/2 quantiles of the bootstrap distribution.
type="se": point_estimate ± z* · SE where SE is the SD of the bootstrap distribution (requires point_estimate).

Return type:

DataFrame

Parameters:

distribution (Distribution)
level (float)
type (str)
point_estimate (float | None)

moderndive.visualize(distribution, bins=20, *, engine='plotly', method='simulation', shade_pvalue=None, shade_ci=None, **kwargs)[source]¶

Histogram of the simulated statistics, as an InferPlot.

method is "simulation" (histogram, default), "theoretical" (a normal-approximation density curve), or "both" (histogram in density units overlaid with the normal curve), mirroring R infer’s visualize(method=). Pass shade_pvalue=/shade_ci= to shade in one call, or compose with +.

Return type:

InferPlot

Parameters:

bins (int)
engine (str)
method (str)

moderndive.shade_p_value(obs_stat, direction, *, color=None)[source]¶

A p-value shading spec; add it to a visualize() plot with +.

direction ∈ {right/greater, left/less, two-sided}. For a faceted visualize_fit() plot, pass a per-term obs_stat — an observed FitResult, a term-keyed frame, or a dict — to shade each facet.

Return type:

ShadeSpec

Parameters:

direction (str)
color (str | None)

moderndive.shade_confidence_interval(endpoints, color=None)[source]¶

A confidence-interval shading spec; add it to a visualize() plot with +.

endpoints is a CI DataFrame (lower_ci/upper_ci) or a (lower, upper) tuple. For a faceted visualize_fit() plot, pass a per-term CI table (with a term column) to shade each facet from its own interval.

Return type:: ShadeSpec
Parameters:: color (str | None)

Theory-based tests¶

moderndive.t_test(data, *, formula=None, response=None, explanatory=None, order=None, alternative='two-sided', mu=0.0, conf_level=0.95)[source]¶

One-sample (no explanatory) or two-sample (Welch) t-test, tidy output.

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)
order (tuple[object, object] | None)
alternative (str)
mu (float)
conf_level (float)

moderndive.t_stat(data, **kwargs)[source]¶

The t statistic only (see t_test()).

Return type:: float
Parameters:: data (DataFrame)

moderndive.prop_test(data, *, formula=None, response=None, explanatory=None, success=None, order=None, p=None, alternative='two-sided')[source]¶

One- or two-proportion z-test (normal approximation), tidy output.

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)
success (object | None)
order (tuple[object, object] | None)
p (float | None)
alternative (str)

moderndive.chisq_test(data, *, formula=None, response=None, explanatory=None)[source]¶

Chi-squared test of independence (categorical response ~ categorical explanatory).

Return type:

DataFrame

Parameters:

data (DataFrame)
formula (str | None)
response (str | None)
explanatory (str | None)

moderndive.chisq_stat(data, **kwargs)[source]¶

The chi-squared statistic only (see chisq_test()).

Return type:: float
Parameters:: data (DataFrame)

Theory-based inference wrappers (scipy.stats).

The book deliberately teaches simulation-based inference first, then ties results back to the traditional theory-based methods (t-distribution CIs, the two-sample test, normal approximations). These helpers provide those theory-based companions so the chapters can draw the simulation-vs-theory comparison.

All functions return small polars frames with tidy column names.

moderndive.theory.t_test_one_sample(x, mu=0.0, alternative='two-sided')[source]¶

One-sample t-test of H0: mean == mu.

Return type:

DataFrame

Parameters:

mu (float)
alternative (str)

moderndive.theory.t_test_two_sample(x, y, alternative='two-sided', equal_var=False)[source]¶

Two-sample (Welch by default) t-test of equal means.

Return type:

DataFrame

Parameters:

alternative (str)
equal_var (bool)

moderndive.theory.t_confidence_interval(x, level=0.95)[source]¶

Theory-based t confidence interval for a single mean.

Return type:: DataFrame
Parameters:: level (float)

moderndive.theory.prop_test_two_sample(successes, totals, alternative='two-sided')[source]¶

Two-sample z-test for a difference in proportions (normal approximation).

Return type:

DataFrame

Parameters:

successes (tuple[int, int])
totals (tuple[int, int])
alternative (str)

Regression & summary helpers¶

moderndive.get_regression_table(model, digits=3, conf_level=0.95)[source]¶

Tidy regression table: term, estimate, std_error, statistic, p_value, lower/upper_ci.

model is a fitted statsmodels results object (e.g. from statsmodels.formula.api.ols("y ~ x", data).fit()).

Return type:

DataFrame

Parameters:

digits (int)
conf_level (float)

moderndive.get_regression_points(model, digits=3)[source]¶

Fitted values + residuals per observation (~ broom::augment).

Columns: ID, the response, each explanatory term, <response>_hat, residual.

Return type:: DataFrame
Parameters:: digits (int)

moderndive.get_regression_summaries(model, digits=3)[source]¶

Model-fit summaries as a tidy 1-row frame (~ moderndive::get_regression_summaries).

Columns: r_squared, adj_r_squared, mse, rmse, sigma, statistic (overall F), p_value, df (model degrees of freedom), nobs. model is a fitted statsmodels results object.

mse is the mean squared residual using n in the denominator (so rmse = sqrt(mse)), while sigma is the residual standard error using n - p — matching the R package.

Return type:: DataFrame
Parameters:: digits (int)

moderndive.get_correlation(data, formula=None, *, x=None, y=None)[source]¶

Pearson correlation as a tidy 1-row frame with a cor column.

Mirrors moderndive::get_correlation(data, y ~ x). Specify the variable pair either as a formula string ("y ~ x") or via the x= and y= keyword arguments. Rows with a null in either column are dropped.

Return type:

DataFrame

Parameters:

formula (str | None)
x (str | None)
y (str | None)

moderndive.pop_sd(x)[source]¶

Population standard deviation (divides by n, not n - 1).

Mirrors moderndive::pop_sd. Accepts a polars Series, list, numpy array, or any sequence; nulls/NaNs are dropped before computing.

Return type:: float

moderndive.tidy_summary(data, columns=None, digits=3)[source]¶

Per-variable summary statistics for the selected columns.

Mirrors the R moderndive::tidy_summary column layout: column, n, group, type, min, Q1, mean, median, Q3, max, sd. Numeric columns get the five-number summary + mean/sd; non-numeric columns report n and type with the numeric fields left null.

Return type:

DataFrame

Parameters:

columns (list[str] | None)
digits (int)

moderndive.count_missing(data, columns=None)[source]¶

Count missing (null) values in each column.

A beginner-friendly alternative to df.select(pl.all().is_null().sum()): it returns a tidy two-column data frame with one row per column (column, n_missing), sorted from most to fewest missing values so the columns needing attention surface first.

Parameters:

data – A polars (or pandas) data frame.
columns (list[str] | None) – Optional list of column names to check; defaults to every column.

Return type:

DataFrame

Sampling and plots¶

All plotting helpers accept engine="plotly" (default) or engine="plotnine".

moderndive.rep_slice_sample(data, n, reps=1, replace=False, seed=None)[source]¶

Take reps samples of size n from data.

Returns a polars DataFrame with a leading replicate column identifying which sample each row belongs to. Set replace=True for sampling with replacement (e.g. bootstrap-style). Pass seed for reproducibility.

Return type:

DataFrame

Parameters:

data (DataFrame)
n (int)
reps (int)
replace (bool)
seed (int | None)

moderndive.rep_sample_n(data, n, reps=1, replace=False, seed=None)[source]¶

Alias for rep_slice_sample() (older moderndive name).

Return type:

DataFrame

Parameters:

data (DataFrame)
n (int)
reps (int)
replace (bool)
seed (int | None)

moderndive.pairplot(data, columns=None, hue=None, *, engine='plotly')[source]¶

Scatterplot matrix of the numeric columns (the analog of GGally::ggpairs).

engine="plotly" (default) returns a plotly go.Figure from plotly.express.scatter_matrix. engine="plotnine" (alias "seaborn") returns the matplotlib Figure from seaborn.pairplot — the non-plotly backend here is seaborn-backed, since plotnine has no first-class SPLOM. hue colors points by a categorical column.

Parameters:

data (DataFrame)
columns (list[str] | None)
hue (str | None)
engine (str)

moderndive.gg_parallel_slopes(data, response, explanatory, by, *, engine='plotly')[source]¶

Scatterplot with a parallel-slopes regression model overlaid.

Fits response ~ explanatory + C(by) (one common slope, a separate intercept per level of by) and draws one fitted line per group over the data.

Parameters:

response (str)
explanatory (str)
by (str)
engine (str)

moderndive.geom_parallel_slopes(data, response, explanatory, by, color=None)[source]¶

plotnine layer(s) drawing the parallel-slopes fitted lines.

Add to a ggplot with + (plotnine-only; for a plotly version call gg_parallel_slopes() with engine="plotly").

Parameters:

response (str)
explanatory (str)
by (str)
color (str | None)

moderndive.gg_categorical_model(data, response, explanatory, *, engine='plotly')[source]¶

Regression with one categorical predictor (~ geom_categorical_model).

Fits response ~ C(explanatory); each category’s fitted value is its group mean, drawn as a horizontal marker over the (jittered) data points.

Parameters:

response (str)
explanatory (str)
engine (str)

Datasets¶

moderndive.load_dataset(name)[source]¶

Load a dataset by name, returning a polars DataFrame.

Return type:: DataFrame
Parameters:: name (str)

moderndive.data.available_datasets()[source]¶

Return the sorted names of all loadable datasets.

Return type:: list[str]

Each dataset also has a convenience loader moderndive.load_<name>() returning a polars DataFrame. Call :func:moderndive.data.available_datasets for the full list.