Getting started

This page walks through a complete analysis end to end, then points you at the task guides for more depth.

Install

pip install moderndive

moderndive returns polars DataFrames, but every function also accepts pandas DataFrames as input.

Load a dataset

All datasets ship with the package and load with load_<name>():

import moderndive as md

yawn = md.load_mythbusters_yawn()
yawn.head()
shape: (5, 3)
subjgroupyawn
i64strstr
1"seed""yes"
2"control""yes"
3"seed""no"
4"seed""yes"
5"seed""no"

List everything that’s available with md.available_datasets() (58 datasets), and see Datasets for a thematic tour.

A first summary

tidy_summary gives a per-variable five-number summary (numeric) or counts (categorical):

from moderndive import tidy_summary

tidy_summary(md.load_almonds_sample_100(), columns=["weight"])
shape: (1, 11)
columnngrouptypeminQ1meanmedianQ3maxsd
stri64strstrf64f64f64f64f64f64f64
"weight"100null"numeric"2.93.43.6823.73.94.50.362

count_missing reports how many null values each column has, sorted worst-first — handy for a quick data-quality check:

from moderndive import count_missing

count_missing(md.load_evals())
shape: (14, 2)
columnn_missing
stri64
"ID"0
"prof_ID"0
"score"0
"age"0
"bty_avg"0
"pic_outfit"0
"pic_color"0
"cls_did_eval"0
"cls_students"0
"cls_level"0

The inference pipeline

The core grammar mirrors R infer. You build a pipeline and read it like a sentence:

from moderndive import specify, observe, get_p_value

# 1. The observed statistic: do "seeded" people yawn more than the control group?
obs = observe(
    yawn, formula="yawn ~ group", success="yes",
    stat="diff in props", order=("seed", "control"),
)

# 2. A null distribution: specify → hypothesize → generate → calculate
null = (
    yawn.specify(formula="yawn ~ group", success="yes")
    .hypothesize(null="independence")
    .generate(reps=1000, type="permute", seed=42)
    .calculate(stat="diff in props", order=("seed", "control"))
)

# 3. Summarize
get_p_value(null, obs_stat=obs, direction="right")
shape: (1, 1)
p_value
f64
0.512

Each verb has a focused guide: Sampling, Bootstrapping & confidence intervals, and Hypothesis testing.

Visualizing — choose your engine

Plots default to plotly (interactive). Pass engine="plotnine" for grammar-of-graphics output. The composition syntax is identical:

Note

The plots shown in this documentation are static images. Running the code yourself yields interactive plotly figures by default.

from moderndive import visualize, shade_p_value

# Interactive plotly figure
visualize(null) + shade_p_value(obs_stat=obs, direction="right")

# Same plot, plotnine
visualize(null, engine="plotnine") + shade_p_value(obs_stat=obs, direction="right")

See Plotting: plotly & plotnine for shading, confidence-interval overlays, theoretical overlays, and the regression-model plots.

Regression

import statsmodels.formula.api as smf
from moderndive import get_regression_table

houses = md.load_saratoga_houses()
model = smf.ols("price ~ living_area + bedrooms", data=houses.to_pandas()).fit()
get_regression_table(model)
shape: (3, 7)
termestimatestd_errorstatisticp_valuelower_ciupper_ci
strf64f64f64f64f64f64
"intercept"20986.0946816.2513.0790.0027611.12834361.06
"living_area"93.8423.10930.1830.087.74199.943
"bedrooms"-7483.0952783.531-2.6880.007-12944.988-2021.203

Full details in Regression.