Getting started¶

This page walks through a complete analysis end to end, then points you at the task guides for more depth.

Install¶

pip install moderndive

moderndive returns polars DataFrames, but every function also accepts pandas DataFrames as input.

Load a dataset¶

All datasets ship with the package and load with load_<name>():

import moderndive as md

yawn = md.load_mythbusters_yawn()
yawn.head()

shape: (5, 3)

subj	group	yawn
i64	str	str
1	"seed"	"yes"
2	"control"	"yes"
3	"seed"	"no"
4	"seed"	"yes"
5	"seed"	"no"

List everything that’s available with md.available_datasets() (58 datasets), and see Datasets for a thematic tour.

A first summary¶

tidy_summary gives a per-variable five-number summary (numeric) or counts (categorical):

from moderndive import tidy_summary

tidy_summary(md.load_almonds_sample_100(), columns=["weight"])

shape: (1, 11)

column	n	group	type	min	Q1	mean	median	Q3	max	sd
str	i64	str	str	f64	f64	f64	f64	f64	f64	f64
"weight"	100	null	"numeric"	2.9	3.4	3.682	3.7	3.9	4.5	0.362

count_missing reports how many null values each column has, sorted worst-first — handy for a quick data-quality check:

from moderndive import count_missing

count_missing(md.load_evals())

shape: (14, 2)

column	n_missing
str	i64
"ID"	0
"prof_ID"	0
"score"	0
"age"	0
"bty_avg"	0
…	…
"pic_outfit"	0
"pic_color"	0
"cls_did_eval"	0
"cls_students"	0
"cls_level"	0

The inference pipeline¶

The core grammar mirrors R infer. You build a pipeline and read it like a sentence:

from moderndive import specify, observe, get_p_value

# 1. The observed statistic: do "seeded" people yawn more than the control group?
obs = observe(
    yawn, formula="yawn ~ group", success="yes",
    stat="diff in props", order=("seed", "control"),
)

# 2. A null distribution: specify → hypothesize → generate → calculate
null = (
    yawn.specify(formula="yawn ~ group", success="yes")
    .hypothesize(null="independence")
    .generate(reps=1000, type="permute", seed=42)
    .calculate(stat="diff in props", order=("seed", "control"))
)

# 3. Summarize
get_p_value(null, obs_stat=obs, direction="right")

shape: (1, 1)

p_value
f64
0.512

Each verb has a focused guide: Sampling, Bootstrapping & confidence intervals, and Hypothesis testing.

Visualizing — choose your engine¶

Plots default to plotly (interactive). Pass engine="plotnine" for grammar-of-graphics output. The composition syntax is identical:

Note

The plots shown in this documentation are static images. Running the code yourself yields interactive plotly figures by default.

from moderndive import visualize, shade_p_value

# Interactive plotly figure
visualize(null) + shade_p_value(obs_stat=obs, direction="right")

# Same plot, plotnine
visualize(null, engine="plotnine") + shade_p_value(obs_stat=obs, direction="right")

_images/b4c1772bd122420f2f0c93009cc171b20565e2f441af10576c553411cc8a5978.png

See Plotting: plotly & plotnine for shading, confidence-interval overlays, theoretical overlays, and the regression-model plots.

Regression¶

import statsmodels.formula.api as smf
from moderndive import get_regression_table

houses = md.load_saratoga_houses()
model = smf.ols("price ~ living_area + bedrooms", data=houses.to_pandas()).fit()
get_regression_table(model)

shape: (3, 7)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
str	f64	f64	f64	f64	f64	f64
"intercept"	20986.094	6816.251	3.079	0.002	7611.128	34361.06
"living_area"	93.842	3.109	30.183	0.0	87.741	99.943
"bedrooms"	-7483.095	2783.531	-2.688	0.007	-12944.988	-2021.203

Full details in Regression.