API reference

Inference grammar

moderndive.specify(data, *, response=None, explanatory=None, formula=None, success=None)[source]

Specify the response (and optional explanatory) variable(s) for inference.

Two equivalent forms, mirroring R infer:

  • specify(df, response="weight")

  • specify(df, formula="popular_or_not ~ track_genre", success="popular")

Multi-term formulas ("y ~ a + b") are supported for fit().

Return type:

Specification

Parameters:
  • data (DataFrame)

  • response (str | None)

  • explanatory (str | None)

  • formula (str | None)

  • success (object | None)

moderndive.observe(data, *, response=None, explanatory=None, formula=None, success=None, stat='mean', order=None, null=None, mu=None, p=None, sigma=None)[source]

Shortcut for specify() |> [hypothesize()] |> calculate().

Mirrors R infer::observe() — compute an observed statistic in one call.

Return type:

ObservedStatistic

Parameters:
moderndive.assume(distribution, df=None)[source]

Set a theoretical distribution ("t", "z", "F", "Chisq").

df is the degrees of freedom: a scalar for t/Chisq, a (df1, df2) tuple for F, and unused for z.

Return type:

TheoreticalDistribution

Parameters:
class moderndive.infer.core.Specification(data, response, explanatory=None, success=None, formula=None)[source]

Result of specify(): the chosen response (+ optional explanatory).

Parameters:
  • data (DataFrame)

  • response (str)

  • explanatory (str | None)

  • success (object | None)

  • formula (str | None)

calculate(stat, *, order=None, mu=None, p=None, sigma=None)[source]

Compute the observed statistic (no resampling).

Return type:

ObservedStatistic

Parameters:
assume(distribution, df=None)[source]

Set a theoretical sampling distribution (see assume()).

Parameters:
fit()[source]

Fit the observed regression (ordinary least squares) for the formula.

Return type:

FitResult

class moderndive.infer.core.Hypothesis(spec, null, mu=None, p=None, sigma=None)[source]
Parameters:
calculate(stat, *, order=None)[source]

Observed statistic that needs the hypothesized value (e.g. stat=’t’/’z’).

Return type:

ObservedStatistic

Parameters:

order (tuple[object, object] | None)

class moderndive.infer.core.GeneratedReplicates(spec, type, null, plans, shifted_response=None, hyp_mu=None, hyp_p=None, hyp_sigma=None)[source]

Materialized resampling plan; supports both calculate() and fit().

plans holds one numpy array per replicate: - bootstrap: row indices (with replacement) into the data, - permute: a permutation of the row positions (used to shuffle a column), - draw: a simulated response array under a point null proportion.

Parameters:
fit()[source]

Fit OLS on each replicate, returning per-term estimates (long form).

Return type:

FitResult

class moderndive.infer.core.Distribution(data, stat, null=None, type='bootstrap')[source]

A simulated distribution of statistics (one row per replicate).

Parameters:
  • data (DataFrame)

  • stat (str)

  • null (str | None)

  • type (str)

class moderndive.infer.core.FitResult(data, null=None, type='bootstrap')[source]

Regression coefficients: observed (one row per term) or a distribution of them across replicates (replicate, term, estimate).

Parameters:
  • data (DataFrame)

  • null (str | None)

  • type (str)

class moderndive.infer.theoretical.TheoreticalDistribution(distribution, df=None)[source]

A named theoretical distribution (from assume()).

Parameters:
get_p_value(obs_stat, direction)[source]

Theory-based p-value for a (standardized) observed statistic.

Return type:

DataFrame

Parameters:

direction (str)

visualize(bins=100, *, engine='plotly')[source]

Plot the theoretical density curve (plotly by default; engine=”plotnine”).

Parameters:

Getters and visualization

moderndive.get_p_value(distribution, obs_stat, direction)[source]

Return a one-row frame with the simulation-based p_value.

direction is one of right/greater, left/less, or two-sided. The two-sided p-value uses infer’s convention: twice the smaller one-sided tail proportion, capped at 1.

Return type:

DataFrame

Parameters:
moderndive.get_confidence_interval(distribution, level=0.95, type='percentile', *, point_estimate=None)[source]

Return a one-row frame with lower_ci and upper_ci.

  • type="percentile": the (1-level)/2 and 1-(1-level)/2 quantiles of the bootstrap distribution.

  • type="se": point_estimate ± z* · SE where SE is the SD of the bootstrap distribution (requires point_estimate).

Return type:

DataFrame

Parameters:
moderndive.visualize(distribution, bins=20, *, engine='plotly', method='simulation', shade_pvalue=None, shade_ci=None, **kwargs)[source]

Histogram of the simulated statistics, as an InferPlot.

method is "simulation" (histogram, default), "theoretical" (a normal-approximation density curve), or "both" (histogram in density units overlaid with the normal curve), mirroring R infer’s visualize(method=). Pass shade_pvalue=/shade_ci= to shade in one call, or compose with +.

Return type:

InferPlot

Parameters:
moderndive.shade_p_value(obs_stat, direction, *, color=None)[source]

A p-value shading spec; add it to a visualize() plot with +.

direction ∈ {right/greater, left/less, two-sided}. For a faceted visualize_fit() plot, pass a per-term obs_stat — an observed FitResult, a term-keyed frame, or a dict — to shade each facet.

Return type:

ShadeSpec

Parameters:
  • direction (str)

  • color (str | None)

moderndive.shade_confidence_interval(endpoints, color=None)[source]

A confidence-interval shading spec; add it to a visualize() plot with +.

endpoints is a CI DataFrame (lower_ci/upper_ci) or a (lower, upper) tuple. For a faceted visualize_fit() plot, pass a per-term CI table (with a term column) to shade each facet from its own interval.

Return type:

ShadeSpec

Parameters:

color (str | None)

Theory-based tests

moderndive.t_test(data, *, formula=None, response=None, explanatory=None, order=None, alternative='two-sided', mu=0.0, conf_level=0.95)[source]

One-sample (no explanatory) or two-sample (Welch) t-test, tidy output.

Return type:

DataFrame

Parameters:
moderndive.t_stat(data, **kwargs)[source]

The t statistic only (see t_test()).

Return type:

float

Parameters:

data (DataFrame)

moderndive.prop_test(data, *, formula=None, response=None, explanatory=None, success=None, order=None, p=None, alternative='two-sided')[source]

One- or two-proportion z-test (normal approximation), tidy output.

Return type:

DataFrame

Parameters:
moderndive.chisq_test(data, *, formula=None, response=None, explanatory=None)[source]

Chi-squared test of independence (categorical response ~ categorical explanatory).

Return type:

DataFrame

Parameters:
  • data (DataFrame)

  • formula (str | None)

  • response (str | None)

  • explanatory (str | None)

moderndive.chisq_stat(data, **kwargs)[source]

The chi-squared statistic only (see chisq_test()).

Return type:

float

Parameters:

data (DataFrame)

Theory-based inference wrappers (scipy.stats).

The book deliberately teaches simulation-based inference first, then ties results back to the traditional theory-based methods (t-distribution CIs, the two-sample test, normal approximations). These helpers provide those theory-based companions so the chapters can draw the simulation-vs-theory comparison.

All functions return small polars frames with tidy column names.

moderndive.theory.t_test_one_sample(x, mu=0.0, alternative='two-sided')[source]

One-sample t-test of H0: mean == mu.

Return type:

DataFrame

Parameters:
moderndive.theory.t_test_two_sample(x, y, alternative='two-sided', equal_var=False)[source]

Two-sample (Welch by default) t-test of equal means.

Return type:

DataFrame

Parameters:
  • alternative (str)

  • equal_var (bool)

moderndive.theory.t_confidence_interval(x, level=0.95)[source]

Theory-based t confidence interval for a single mean.

Return type:

DataFrame

Parameters:

level (float)

moderndive.theory.prop_test_two_sample(successes, totals, alternative='two-sided')[source]

Two-sample z-test for a difference in proportions (normal approximation).

Return type:

DataFrame

Parameters:

Regression & summary helpers

moderndive.get_regression_table(model, digits=3, conf_level=0.95)[source]

Tidy regression table: term, estimate, std_error, statistic, p_value, lower/upper_ci.

model is a fitted statsmodels results object (e.g. from statsmodels.formula.api.ols("y ~ x", data).fit()).

Return type:

DataFrame

Parameters:
moderndive.get_regression_points(model, digits=3)[source]

Fitted values + residuals per observation (~ broom::augment).

Columns: ID, the response, each explanatory term, <response>_hat, residual.

Return type:

DataFrame

Parameters:

digits (int)

moderndive.get_regression_summaries(model, digits=3)[source]

Model-fit summaries as a tidy 1-row frame (~ moderndive::get_regression_summaries).

Columns: r_squared, adj_r_squared, mse, rmse, sigma, statistic (overall F), p_value, df (model degrees of freedom), nobs. model is a fitted statsmodels results object.

mse is the mean squared residual using n in the denominator (so rmse = sqrt(mse)), while sigma is the residual standard error using n - p — matching the R package.

Return type:

DataFrame

Parameters:

digits (int)

moderndive.get_correlation(data, formula=None, *, x=None, y=None)[source]

Pearson correlation as a tidy 1-row frame with a cor column.

Mirrors moderndive::get_correlation(data, y ~ x). Specify the variable pair either as a formula string ("y ~ x") or via the x= and y= keyword arguments. Rows with a null in either column are dropped.

Return type:

DataFrame

Parameters:
  • formula (str | None)

  • x (str | None)

  • y (str | None)

moderndive.pop_sd(x)[source]

Population standard deviation (divides by n, not n - 1).

Mirrors moderndive::pop_sd. Accepts a polars Series, list, numpy array, or any sequence; nulls/NaNs are dropped before computing.

Return type:

float

moderndive.tidy_summary(data, columns=None, digits=3)[source]

Per-variable summary statistics for the selected columns.

Mirrors the R moderndive::tidy_summary column layout: column, n, group, type, min, Q1, mean, median, Q3, max, sd. Numeric columns get the five-number summary + mean/sd; non-numeric columns report n and type with the numeric fields left null.

Return type:

DataFrame

Parameters:
moderndive.count_missing(data, columns=None)[source]

Count missing (null) values in each column.

A beginner-friendly alternative to df.select(pl.all().is_null().sum()): it returns a tidy two-column data frame with one row per column (column, n_missing), sorted from most to fewest missing values so the columns needing attention surface first.

Parameters:
  • data – A polars (or pandas) data frame.

  • columns (list[str] | None) – Optional list of column names to check; defaults to every column.

Return type:

DataFrame

Sampling and plots

All plotting helpers accept engine="plotly" (default) or engine="plotnine".

moderndive.rep_slice_sample(data, n, reps=1, replace=False, seed=None)[source]

Take reps samples of size n from data.

Returns a polars DataFrame with a leading replicate column identifying which sample each row belongs to. Set replace=True for sampling with replacement (e.g. bootstrap-style). Pass seed for reproducibility.

Return type:

DataFrame

Parameters:
  • data (DataFrame)

  • n (int)

  • reps (int)

  • replace (bool)

  • seed (int | None)

moderndive.rep_sample_n(data, n, reps=1, replace=False, seed=None)[source]

Alias for rep_slice_sample() (older moderndive name).

Return type:

DataFrame

Parameters:
  • data (DataFrame)

  • n (int)

  • reps (int)

  • replace (bool)

  • seed (int | None)

moderndive.pairplot(data, columns=None, hue=None, *, engine='plotly')[source]

Scatterplot matrix of the numeric columns (the analog of GGally::ggpairs).

engine="plotly" (default) returns a plotly go.Figure from plotly.express.scatter_matrix. engine="plotnine" (alias "seaborn") returns the matplotlib Figure from seaborn.pairplot — the non-plotly backend here is seaborn-backed, since plotnine has no first-class SPLOM. hue colors points by a categorical column.

Parameters:
  • data (DataFrame)

  • columns (list[str] | None)

  • hue (str | None)

  • engine (str)

moderndive.gg_parallel_slopes(data, response, explanatory, by, *, engine='plotly')[source]

Scatterplot with a parallel-slopes regression model overlaid.

Fits response ~ explanatory + C(by) (one common slope, a separate intercept per level of by) and draws one fitted line per group over the data.

Parameters:
moderndive.geom_parallel_slopes(data, response, explanatory, by, color=None)[source]

plotnine layer(s) drawing the parallel-slopes fitted lines.

Add to a ggplot with + (plotnine-only; for a plotly version call gg_parallel_slopes() with engine="plotly").

Parameters:
  • response (str)

  • explanatory (str)

  • by (str)

  • color (str | None)

moderndive.gg_categorical_model(data, response, explanatory, *, engine='plotly')[source]

Regression with one categorical predictor (~ geom_categorical_model).

Fits response ~ C(explanatory); each category’s fitted value is its group mean, drawn as a horizontal marker over the (jittered) data points.

Parameters:
  • response (str)

  • explanatory (str)

  • engine (str)

Datasets

moderndive.load_dataset(name)[source]

Load a dataset by name, returning a polars DataFrame.

Return type:

DataFrame

Parameters:

name (str)

moderndive.data.available_datasets()[source]

Return the sorted names of all loadable datasets.

Return type:

list[str]

Each dataset also has a convenience loader moderndive.load_<name>() returning a polars DataFrame. Call :func:moderndive.data.available_datasets for the full list.