How to Perform Logistic Regression Using Statsmodels in Python

Logistic regression is a popular machine learning algorithm used for binary classification problems. It is based on the statistical concept of maximum likelihood estimation and the logistic function. In this article, we will discuss how to perform logistic regression using the statsmodels library in Python.

Understanding Logistic Regression

Logistic regression is a statistical method for modeling the relationship between a binary dependent variable and one or more independent variables. The dependent variable can take on two possible outcomes: 0 or 1. The independent variables can be continuous or categorical. The logistic regression model estimates the probability of the dependent variable being 1, given the values of the independent variables.

The logistic function, also known as the sigmoid function, is used to transform the output of the logistic regression model from a continuous value to a binary value. The logistic function is defined as:

 \sigma(z) = \frac{1}{1 + e^{-z}}

where z is the weighted sum of the independent variables:

 z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k

The logistic regression model estimates the coefficients \beta_0, \beta_1, \beta_2, …, \beta_k by maximizing the likelihood function. The likelihood function measures the probability of observing the data given the model parameters.

Performing Logistic Regression with Statsmodels

To perform logistic regression with statsmodels in Python, we first need to install the library and import it:

import statsmodels.api as sm
import statsmodels.formula.api as smf

Next, let’s create a dataset for illustration purposes:

import numpy as np

np.random.seed(42)

X = np.random.rand(100, 2)
X = np.hstack((np.ones((100, 1)), X))
y = np.random.choice([0, 1], 100)
    

Now, let’s fit the logistic regression model:

X_design = sm.add_constant(X)

model = smf.logit(y, X_design)

result = model.fit()

print(result.summary())

The output will include the coefficients, standard errors, z-values, p-values, and other statistics:

As an alternative to the formula interface, we can also use the matrix interface:

X = sm.add_constant(X)

X_train = sm.add_constant(X[:80])
X_test = sm.add_constant(X[80:])

y = sm.add_constant(np.array(y))

X_design = sm.DesignMatrix(X_train, X_test)
y_design = sm.DesignMatrix(np.ones((20, 1)), constant=np.ones((20, 1)))

model = sm.Logit(y_design, X_design)

result = model.fit(X, y)

print(result.summary())
    

Both methods produce similar results.

Conclusion

Logistic regression is a powerful machine learning algorithm for binary classification problems. The statsmodels library in Python provides an easy-to-use interface for performing logistic regression using both the formula and matrix interfaces. Understanding the underlying statistical concept of logistic regression and the logistic function is crucial for interpreting the results and applying the model to real-world problems.

Leave a Reply

Your email address will not be published. Required fields are marked *