---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  name: python3
---

```{code-cell} python
:tags: [remove-input]
import matplotlib
matplotlib.use("Agg")
import plotly.io as pio
pio.renderers.default = "png"
```

# Regression

The regression helpers turn a fitted [statsmodels](https://www.statsmodels.org)
model into tidy polars tables — the analog of R `moderndive`'s
`get_regression_table()` / `get_regression_points()` / `get_regression_summaries()`.

Fit models with statsmodels' formula API, then tidy them:

```{code-cell} python
import polars as pl
import statsmodels.formula.api as smf
import moderndive as md
from moderndive import (
    get_regression_table, get_regression_points, get_regression_summaries, get_correlation,
)

houses = md.load_saratoga_houses()
model = smf.ols("price ~ living_area + bedrooms", data=houses.to_pandas()).fit()
```

## Coefficient table (with confidence intervals)

```{code-cell} python
get_regression_table(model)
```


Change the confidence level with `conf_level=` (e.g. `0.99`).

## Fitted values & residuals

```{code-cell} python
get_regression_points(model).head()
# columns: ID, price, living_area, bedrooms, price_hat, residual
```

Pass `ID=` to label rows by a column, and `newdata=` to predict on a held-out set
(a `residual` is included when the outcome is present in `newdata`):

```{code-cell} python
train = houses.head(800)
test = houses.tail(200)
model_train = smf.ols("price ~ living_area + bedrooms", data=train.to_pandas()).fit()
get_regression_points(model_train, newdata=test).head()
```

## Model-fit summaries

```{code-cell} python
get_regression_summaries(model)
# r_squared, adj_r_squared, mse, rmse, sigma, statistic (F), p_value, df, nobs
```

## Correlation

```{code-cell} python
get_correlation(houses, "price ~ living_area")   # ≈ 0.759
# or: get_correlation(houses, x="living_area", y="price")
```

Use `method=` for rank correlations (`"spearman"` or `"kendall"`):

```{code-cell} python
get_correlation(houses, "price ~ living_area", method="spearman")
```

Pass several predictors on the right-hand side to get one correlation per
predictor (long by default; `wide=True` for one column each):

```{code-cell} python
get_correlation(houses, "price ~ living_area + bedrooms + bathrooms", quiet=True)
```

## Logistic & other GLMs

The regression helpers also accept a fitted `glm()` model. For a logistic
regression, `get_regression_points()` returns fitted **probabilities** and
`get_regression_summaries()` returns a GLM-shaped summary (deviance, AIC, BIC, …).
Pass `exponentiate=True` to `get_regression_table()` for **odds ratios**:

```{code-cell} python
import statsmodels.api as sm

evals = md.load_evals().with_columns((pl.col("gender") == "male").cast(int).alias("is_male"))
logit = smf.glm("is_male ~ score + age", data=evals.to_pandas(),
                family=sm.families.Binomial()).fit()
get_regression_table(logit, exponentiate=True)
```

```{code-cell} python
get_regression_summaries(logit)
```

## Array-API models

You don't have to use the formula API. The helpers also accept models fit with
statsmodels' array API (`sm.OLS(y, X)` / `sm.GLM(...)`), which is handy when your
design matrix is already built. `get_regression_points()` reconstructs the
per-observation table from the design matrix (the constant column becomes the
`intercept` term and is dropped from the points):

```{code-cell} python
import statsmodels.api as sm

X = sm.add_constant(houses.select("living_area", "bedrooms").to_pandas())
y = houses["price"].to_pandas()
array_model = sm.OLS(y, X).fit()

get_regression_table(array_model)
```

```{code-cell} python
get_regression_points(array_model).head()
```

## 3D regression plane

`plot_3d_regression()` draws an interactive 3D scatter of an outcome against two
numeric predictors, with the fitted regression plane overlaid (a plotly figure):

```{code-cell} python
from moderndive import plot_3d_regression

plot_3d_regression(houses, "price ~ living_area + bedrooms")
```

## Visualizing models

Two ports of R `moderndive`'s ggplot helpers, both dual-engine. A
**parallel-slopes** model — one common slope with a separate intercept per group:

```{code-cell} python
from moderndive import gg_parallel_slopes, gg_categorical_model

evals = md.load_evals()
gg_parallel_slopes(evals, response="score", explanatory="age", by="gender")  # plotly
```

The same plot via the plotnine engine:

```{code-cell} python
gg_parallel_slopes(evals, response="score", explanatory="age", by="gender", engine="plotnine")
```

A regression with a single **categorical** predictor (`geom_categorical_model()`
is an alias of `gg_categorical_model()`):

```{code-cell} python
gg_categorical_model(evals, response="score", explanatory="rank")
```

For the plotnine engine you can also drop the parallel-slopes lines onto your own
`ggplot` with `geom_parallel_slopes()`.

## Inference for regression coefficients

`fit()` runs the regression on every bootstrap/permutation replicate, giving a
distribution per coefficient. Pair it with `visualize_fit` and **per-facet**
shading:

```{code-cell} python
from moderndive import specify
from moderndive.infer.viz import visualize_fit
from moderndive import shade_confidence_interval, shade_p_value

f = "price ~ living_area + bedrooms"
obs_fit = houses.specify(formula=f).fit()

# Bootstrap distribution of each coefficient → per-term CIs
boot_fit = houses.specify(formula=f).generate(reps=1000, type="bootstrap", seed=1).fit()
boot_fit.get_confidence_interval(level=0.95)          # one row per term
visualize_fit(boot_fit) + shade_confidence_interval(boot_fit.get_confidence_interval())

# Null distribution → per-term p-values, each facet shaded at its own estimate
null_fit = (
    houses.specify(formula=f).hypothesize(null="independence")
    .generate(reps=1000, type="permute", seed=1).fit()
)
null_fit.get_p_value(obs_stat=obs_fit, direction="two-sided")
visualize_fit(null_fit) + shade_p_value(obs_stat=obs_fit, direction="two-sided")
```

Per-facet shading works in both engines: `shade_p_value` / `shade_confidence_interval`
accept the observed `FitResult` or a term-keyed table, and each facet is shaded
from its own value.