API Reference#

The stable API of scikit-stan. This will remain stable between versions, with the proimse of backwards compatibility between minor versions and ample warnings prior to large changes to the API.

This package also has an internal API, which is not stable and is not meant to be used by users, but contains useful functions for developers seeking to add additional models or improve existing models.

Generalized Linear Model#

class scikit_stan.GLM(algorithm='sample', algorithm_params=None, fit_intercept=True, family='gaussian', link=None, seed=None, priors=None, prior_intercept=None, prior_aux=None, autoscale=False)[source]#

A generalized linear model estimator with several options for families, links, and priors on regression coefficients, the intercept, and error scale, done in an scikit-learn style. This class also provides an autoscaling feature of the priors. For deterministic behavior from this model, the class’s seed can be set and is then passed to Stan computations.

Parameters:

algorithm (str, optional) –
Algorithm to be used by the Stan model. The following are supported:
- sample - runs the HMC-NUTS sampler,
- optimize - produces a likelihood estimate of model parameters,
- variational - runs Stan’s variational inference algorithm to compute the posterior.
algorithm_params (Dict[str, Any], optional) –
Parameters for the selected algorithm. The key words and values are derived from CmdStanPy’s API, so please refer to this documentation for more information: https://mc-stan.org/cmdstanpy/api.html#cmdstanmodel

Customizing these fields is done via a passed dictionary, which is validated on the level of CmdStan. As an example, to specify the number of chains for the HMC-NUTS sampler to run, it is sufficient to pass:
```
{
    "chains": 2
}
```
or to specify the number of warmup and sampling iterations, pass:
```
{
    "iter_warmup": 100,
    "iter_sampling": 100,
},
```
Default Stan parameters are used if nothing is passed.
fit_intercept (bool, optional) –
Whether the GLM should fit a global intercept. Some hierarchical models operate without an intercept, so setting this to False will fill the first column of the predictor matrix with ones.

By default, the GLM has an intercept term – the first column of the predictor matrix X is used for inference of the intercept.
family (str, optional) –
Distribution family used for linear regression. All R Families package are supported:
- ”gaussian” - Gaussian distribution,
- ”gamma” - Gamma distribution,
- ”inverse-gaussian” - Inverse Gaussian distribution,
- ”poisson” - Poisson distribution,
- ”binomial” - Binomial distribution
link (Optional[str], optional) –
Distribution link function used for linear regression, following R’s family-link combinations. These are family-specific:
- ”gaussian”:
  
  ”identity” - Identity link function,
  
  ”log” - Log link function,
  
  ”inverse” - Inverse link function
- ”gamma”:
  
  ”identity” - Identity link function,
  
  ”log” - Log link function,
  
  ”inverse” - Inverse link function
- ”inverse-gaussian”:
  
  ”identity” - Identity link function,
  
  ”log” - Log link function,
  
  ”inverse” - Inverse link function,
  
  ”inverse-square” - Inverse square link function
- ”poisson”:
  
  ”identity” - Identity link function,
  
  ”log” - Log link function,
  
  ”sqrt” - Square root link function
- ”binomial”:
  
  ”log” - Log link function,
  
  ”logit” - Logit link function,
  
  ”probit” - Probit link function,
  
  ”cloglog” - Complementary log-log link function,
  
  ”cauchit” - Cauchit link function
If an invalid combination of family and link is passed, a ValueError is raised.

When no link is specified, these are family-specific defaults for the link function:
- ”gaussian”: “identity”,
- ”gamma”: “inverse”,
- ”inverse-gaussian”: “inverse”,
- ”poisson”: “identity”,
- ”binomial”: “logit”
seed (Optional[int], optional) – Seed for random number generator. Must be an integer between 0 an 2^32-1. If this is left unspecified, then a random seed will be used for all chains via numpy.random.RandomState. Specifying this field will yield the same result for multiple uses if all other parameters are held the same.
priors (Optional[Dict[str, Union[int, float, List]]], optional) –
Dictionary for configuring prior distribution on coefficients. Currently supported priors are: “normal”, and “laplace”.

By default, all regression coefficient priors are set to

\[\beta \sim \text{normal}(0, 2.5 \cdot \text{sd}(y) / \text{sd}(X))\]

if Gaussian, else

\[\beta \sim \text{normal}(0, 2.5 / \text{sd}(X))\]

if auto_scale is True, otherwise

\[\beta \sim \text{normal}(0, 2.5 \cdot \text{sd}(y))\]

if Gaussian else

\[\beta \sim \text{normal}(0, 2.5)\]

If an empty dictionary {} is passed, no priors are used and the default Stan uniform(-inf, inf) prior is used for all coefficients.

The number of specified prior parameters cannot exceed the number of predictors in the data as each parameter is associated with a single coefficient. If the number of specified priors is less than the number of predictors, the remaining coefficients are set to the default prior. The prior on all regression coefficients is set with the following keys:
- ”prior_slope_dist”: distribution of the prior for each coefficient
- ”prior_slope_mu”: list of location parameters of the prior for each coefficient
- ”prior_slope_sigma”: list of scale parameters of the prior for each coefficient
Thus, to specify a standard normal prior on the first feature, the dictionary should be passed as:
```
{
    "prior_slope_dist": "normal",
    "prior_slope_mu": [0.],
    "prior_slope_sigma": [1.]
}
```
Any unspecified priors will be set to the default.

Also note that choosing a Laplace prior is equivalent to L1 regularization: https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions/177217#177217
prior_intercept (Optional[Dict[str, Any]], optional) –
Prior for the intercept alpha parameter for GLM. If this is not specified, the default is

\[\alpha \sim \text{normal}(\text{mu}(y), 2.5 \cdot \text{sd}(y))\]

if Gaussian family else

\[\alpha \sim \text{normal}(0, 2.5)\]

If an empty dictionary {} is passed, the default Stan uniform(-inf, inf) prior is used.

To set this prior explicitly, pass a dictionary with the following keys:
- ”prior_intercept_dist”: str, distribution of the prior from the list of supported prior distributions: “normal”, “laplace”
- ”prior_intercept_mu”: float, location parameter of the prior distribution
- ”prior_intercept_sigma”: float, error scale parameter of the prior distribution
Thus, for example, passing:
```
{
    "prior_intercept_dist": "normal",
    "prior_intercept_mu": 0,
    "prior_intercept_sigma": 1
}
```
results in

\[\alpha \sim \text{normal}(0, 1)\]

by default (without autoscaling, see below).

Also note that choosing a Laplace prior is equivalent to L1 regularization: https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions/177217#177217
prior_aux (Optional[Dict[str, Any]], optional) –
Prior on the auxiliary parameter for the family used in the regression: for example, the std for Gaussian, shape for gamma, etc… Currently supported priors are: “exponential”, “chi2”, “gamma”, and “inv_gamma”. These are parameterized by the following distribution parameters: “beta”, “nu”, “alpha” and “beta”, and “alpha” and “beta”, respectively.

Priors here with more parameters are a future feature.

If an empty dictionary {} is passed, the default Stan uniform(-inf, inf) prior is used.

For single-parameter priors, this field is a dictionary with the following keys
- ”prior_aux_dist”: distribution of the prior on this parameter
- ”prior_aux_param”: parameter of the prior on this parameter
  
  see above for admissible tags
For example, to specify a chi2 prior with nu=2.5, pass:
```
{
    "prior_aux_dist": "chi2",

    "nu": 2.5
}
```
The default un-scaled prior is exponential(1), the default scaled prior is exponential(1/sy) where sy = sd(y) if the specified family is a Gaussian, and sy = 1 in all other cases. Setting autoscale=True results in division by sy.
autoscale (bool, optional) –
Enable automatic scaling of priors. Autoscaling is performed the same as in rstanarm.

This procedure does not happen by default.

Notes

The usual prior-selection advice holds. See these discussions about prior selection:

fit(X, y, show_console=False)[source]#

Fits GLM model to the given data. This model is considered fit once its alpha, beta, and sigma parameters are determined via a regression.

Parameters:

X (array-like) – NxK matrix of predictors, where K >= 0. If K = 1, then X is automatically reshaped to being 2D and raises a warning. X can be sparse. It will be converted to CSR format if it is not already.
y (array-like) – Nx1 outcome vector where each row is a response corresponding to the same row of predictors in X.
show_console (bool, optional) – Printing output of default CmdStanPy console during Stan operations.

Returns:

CoreEstimator – Abstract class for all estimator-type models in this package. The GLM class is a subclass of CoreEstimator.

Raises:

ValueError – X is required to have at least 2 columns and have the same number of rows as y.
ValueError – y is required to have exactly 1 column and the same number of rows as X.
ValueError – Algorithm choice in model set-up is not supported.

Notes

Other ValueErrors may be raised by additional validation checks. These include invalid prior set-up or invalid data.

predict(X, return_std=False, show_console=False)[source]#

Compute predictions from supplied data using a fitted model.: This computes the mean of the predicted distribution, given by y_sim in predict_distribution().

A key issue is that predict() will not utilize Stan’s generate quantities method if the model as not fit with HMC-NUTS. Instead, a Python-based rng is used with coefficients and intercept derived from the fitted model. This will change in a future Stan release.

Parameters:

X (array-like) – Predictor matrix or array of data to use for prediction. X can be sparse. It will be converted to CSR format if it is not already.
return_std (bool, optional, default False) – If True, return standard deviation of predictions.
show_console (bool, optional, default False) – Printing output of default CmdStanPy console during Stan operations.

Returns:

ndarray – Array of predictions of shape (n_samples, 1) made by fitted model.
ndarray – Standard deviation of predictive distribution points.

predict_distribution(X, show_console=False)[source]#

Predicts the distribution of the response variable after model has been fit. This is distinct from predict(), which returns the mean of distribution predictions generated by this method.

Parameters:

X (NDArray[Union[np.float64, np.int64]]) – Predictor matrix or array of data to use for prediction. X can be sparse. It will be converted to CSR format if it is not already.
show_console (bool, optional) – Printing output of default CmdStanPy console during Stan operations.

Returns:

NDArray[Union[np.float64, np.int64]] – Set of draws generated by Stan generate quantities method.

Raises:

ValueError – Method requires data X to be supplied.

score(X, y, show_console=False)[source]#

Computes the coefficient of determination R2 of the prediction, like other sklearn estimators.

Parameters:

X (array-like) – Matrix or array of predictors having shape (n_samples, n_predictors) that consists of test data. X can be sparse. It will be converted to CSR format if it is not already.
y (array-like) – Array of shape (n_samples,) containing the target values corresponding to given test dat X.
show_console (bool, optional) – Printing output of default CmdStanPy console during Stan operations.

Returns:

float –

R2 of the prediction versus the given target values.: This is the mean accuracy of self.predict(X) with respect to y