--- jupytext: text_representation: extension: .md format_name: myst kernelspec: display_name: Python 3 name: python3 --- ```{code-cell} python :tags: [remove-input] import matplotlib matplotlib.use("Agg") import plotly.io as pio pio.renderers.default = "png" ``` # Regression The regression helpers turn a fitted [statsmodels](https://www.statsmodels.org) model into tidy polars tables — the analog of R `moderndive`'s `get_regression_table()` / `get_regression_points()` / `get_regression_summaries()`. Fit models with statsmodels' formula API, then tidy them: ```{code-cell} python import polars as pl import statsmodels.formula.api as smf import moderndive as md from moderndive import ( get_regression_table, get_regression_points, get_regression_summaries, get_correlation, ) houses = md.load_saratoga_houses() model = smf.ols("price ~ living_area + bedrooms", data=houses.to_pandas()).fit() ``` ## Coefficient table (with confidence intervals) ```{code-cell} python get_regression_table(model) ``` Change the confidence level with `conf_level=` (e.g. `0.99`). ## Fitted values & residuals ```{code-cell} python get_regression_points(model).head() # columns: ID, price, living_area, bedrooms, price_hat, residual ``` Pass `ID=` to label rows by a column, and `newdata=` to predict on a held-out set (a `residual` is included when the outcome is present in `newdata`): ```{code-cell} python train = houses.head(800) test = houses.tail(200) model_train = smf.ols("price ~ living_area + bedrooms", data=train.to_pandas()).fit() get_regression_points(model_train, newdata=test).head() ``` ## Model-fit summaries ```{code-cell} python get_regression_summaries(model) # r_squared, adj_r_squared, mse, rmse, sigma, statistic (F), p_value, df, nobs ``` ## Correlation ```{code-cell} python get_correlation(houses, "price ~ living_area") # ≈ 0.759 # or: get_correlation(houses, x="living_area", y="price") ``` Use `method=` for rank correlations (`"spearman"` or `"kendall"`): ```{code-cell} python get_correlation(houses, "price ~ living_area", method="spearman") ``` Pass several predictors on the right-hand side to get one correlation per predictor (long by default; `wide=True` for one column each): ```{code-cell} python get_correlation(houses, "price ~ living_area + bedrooms + bathrooms", quiet=True) ``` ## Logistic & other GLMs The regression helpers also accept a fitted `glm()` model. For a logistic regression, `get_regression_points()` returns fitted **probabilities** and `get_regression_summaries()` returns a GLM-shaped summary (deviance, AIC, BIC, …). Pass `exponentiate=True` to `get_regression_table()` for **odds ratios**: ```{code-cell} python import statsmodels.api as sm evals = md.load_evals().with_columns((pl.col("gender") == "male").cast(int).alias("is_male")) logit = smf.glm("is_male ~ score + age", data=evals.to_pandas(), family=sm.families.Binomial()).fit() get_regression_table(logit, exponentiate=True) ``` ```{code-cell} python get_regression_summaries(logit) ``` ## Array-API models You don't have to use the formula API. The helpers also accept models fit with statsmodels' array API (`sm.OLS(y, X)` / `sm.GLM(...)`), which is handy when your design matrix is already built. `get_regression_points()` reconstructs the per-observation table from the design matrix (the constant column becomes the `intercept` term and is dropped from the points): ```{code-cell} python import statsmodels.api as sm X = sm.add_constant(houses.select("living_area", "bedrooms").to_pandas()) y = houses["price"].to_pandas() array_model = sm.OLS(y, X).fit() get_regression_table(array_model) ``` ```{code-cell} python get_regression_points(array_model).head() ``` ## 3D regression plane `plot_3d_regression()` draws an interactive 3D scatter of an outcome against two numeric predictors, with the fitted regression plane overlaid (a plotly figure): ```{code-cell} python from moderndive import plot_3d_regression plot_3d_regression(houses, "price ~ living_area + bedrooms") ``` ## Visualizing models Two ports of R `moderndive`'s ggplot helpers, both dual-engine. A **parallel-slopes** model — one common slope with a separate intercept per group: ```{code-cell} python from moderndive import gg_parallel_slopes, gg_categorical_model evals = md.load_evals() gg_parallel_slopes(evals, response="score", explanatory="age", by="gender") # plotly ``` The same plot via the plotnine engine: ```{code-cell} python gg_parallel_slopes(evals, response="score", explanatory="age", by="gender", engine="plotnine") ``` A regression with a single **categorical** predictor (`geom_categorical_model()` is an alias of `gg_categorical_model()`): ```{code-cell} python gg_categorical_model(evals, response="score", explanatory="rank") ``` For the plotnine engine you can also drop the parallel-slopes lines onto your own `ggplot` with `geom_parallel_slopes()`. ## Inference for regression coefficients `fit()` runs the regression on every bootstrap/permutation replicate, giving a distribution per coefficient. Pair it with `visualize_fit` and **per-facet** shading: ```{code-cell} python from moderndive import specify from moderndive.infer.viz import visualize_fit from moderndive import shade_confidence_interval, shade_p_value f = "price ~ living_area + bedrooms" obs_fit = houses.specify(formula=f).fit() # Bootstrap distribution of each coefficient → per-term CIs boot_fit = houses.specify(formula=f).generate(reps=1000, type="bootstrap", seed=1).fit() boot_fit.get_confidence_interval(level=0.95) # one row per term visualize_fit(boot_fit) + shade_confidence_interval(boot_fit.get_confidence_interval()) # Null distribution → per-term p-values, each facet shaded at its own estimate null_fit = ( houses.specify(formula=f).hypothesize(null="independence") .generate(reps=1000, type="permute", seed=1).fit() ) null_fit.get_p_value(obs_stat=obs_fit, direction="two-sided") visualize_fit(null_fit) + shade_p_value(obs_stat=obs_fit, direction="two-sided") ``` Per-facet shading works in both engines: `shade_p_value` / `shade_confidence_interval` accept the observed `FitResult` or a term-keyed table, and each facet is shaded from its own value.