logistic regression | Definition

Logistic regression is a statistical method for analyzing the relationship between a binary dependent variable and one or more independent variables.

What is Logistic Regression?

Logistic regression is a widely used statistical technique in social science research, primarily to model binary outcomes, meaning it deals with dependent variables that can take on only two possible values, such as yes/no, success/failure, or 0/1. It is especially valuable when the goal is to predict the probability of a certain event happening, based on a set of independent variables. These independent variables can be continuous, categorical, or a mix of both. Logistic regression is most useful when the dependent variable is not continuous, making it a critical tool in situations where ordinary linear regression may not be appropriate.

How Logistic Regression Differs from Linear Regression

At first glance, logistic regression and linear regression may seem similar because they both aim to describe the relationship between variables. However, their differences are fundamental:

Dependent Variable: In linear regression, the dependent variable is continuous, meaning it can take on a range of values (like height, income, or age). In logistic regression, the dependent variable is categorical, most commonly binary, representing two outcomes (like success/failure).
Function Type: Logistic regression uses a logistic function (also known as a sigmoid function) to estimate probabilities, whereas linear regression uses a linear equation to predict the outcome. The logistic function ensures that predicted values stay within the range of 0 and 1, which is ideal for probability estimation.

The Logistic Function

The key to logistic regression is the logistic function, which is written as:

P(y) = 1 / (1 + e^-(b0 + b1 * x1 + b2 * x2 + … + bn * xn))

Where:

P(y) is the predicted probability of the event happening (for example, success, or y = 1),
b0 is the intercept (constant),
b1, b2, …, bn are the coefficients of the independent variables,
x1, x2, …, xn are the independent variables,
e is the base of the natural logarithm (approximately 2.718).

The logistic function maps any input to a value between 0 and 1, making it ideal for modeling probabilities.

Odds and Log Odds

A core concept in logistic regression is the idea of odds and log odds. Odds express the likelihood of an event happening relative to it not happening. They are calculated as:

Odds = P(event happening) / P(event not happening)

For example, if the probability of success is 0.8, the odds of success would be 0.8 divided by 0.2, which equals 4. This means that success is four times more likely than failure.

However, logistic regression does not directly model probabilities. Instead, it models the log of the odds, also known as “log-odds,” as a linear function of the independent variables. This allows the logistic regression model to fit a linear equation that translates into a probability curve between 0 and 1. The formula for the log-odds is:

log(Odds) = b0 + b1 * x1 + b2 * x2 + … + bn * xn

This equation looks very similar to a linear regression equation, but instead of predicting the dependent variable directly, it predicts the log-odds of the outcome. The use of log-odds makes it easier to express probabilities and work with binary outcomes.

Interpreting Coefficients in Logistic Regression

In logistic regression, the coefficients (b1, b2, …, bn) represent the change in the log-odds of the dependent variable for a one-unit change in the corresponding independent variable. Unlike linear regression, where coefficients are interpreted directly as the change in the dependent variable, in logistic regression, the coefficients tell you how the odds of the event change.

To make the coefficients more intuitive, researchers often exponentiate the coefficients to transform the log-odds into odds ratios. Odds ratios provide a more straightforward interpretation. For example, if the odds ratio for a variable is 2, this means that for every one-unit increase in the independent variable, the odds of the outcome occurring double.

Types of Logistic Regression

Logistic regression comes in several forms, depending on the nature of the dependent variable and the research question:

Binary Logistic Regression: This is the most common type of logistic regression, used when the dependent variable has two categories (e.g., success/failure or yes/no).
Multinomial Logistic Regression: This is used when the dependent variable has more than two categories that are not ordered (e.g., types of transportation: car, bus, train).
Ordinal Logistic Regression: This is applied when the dependent variable has ordered categories (e.g., educational attainment: high school, college, graduate school).

Each of these types of logistic regression models follows the same underlying logic but adjusts the mathematical approach to account for the specific nature of the outcome variable.

Assumptions in Logistic Regression

Logistic regression, like all statistical models, is built on a set of assumptions. These assumptions must be met for the results of the analysis to be reliable. Key assumptions include:

Binary Outcome: The dependent variable must be binary, especially in binary logistic regression. For example, if you are modeling success/failure, the dependent variable should only have those two outcomes.
Independence of Observations: Each observation in the dataset must be independent of the others. If the data points are not independent, more advanced techniques like mixed models or generalized estimating equations may be required.
Linearity of Log-Odds: While logistic regression does not require the independent variables to be linearly related to the dependent variable, it assumes that the log-odds (not the probabilities) of the dependent variable are a linear combination of the independent variables.
No Perfect Multicollinearity: Multicollinearity occurs when two or more independent variables are highly correlated with each other. In logistic regression, perfect multicollinearity (where one independent variable is a perfect linear combination of others) must be avoided. Techniques such as variance inflation factor (VIF) tests can help detect multicollinearity.
Large Sample Size: Logistic regression performs best with large sample sizes. Small sample sizes can lead to unreliable estimates and high standard errors. A rule of thumb is to have at least 10 events per predictor variable in the model.

Model Fit and Evaluation

Once a logistic regression model is built, it is crucial to assess its fit and predictive power. There are several ways to evaluate logistic regression models:

Likelihood Ratio Test: This compares the fit of a logistic regression model to a simpler model (typically the null model, which includes only the intercept). A significant likelihood ratio test indicates that the model provides a better fit than the null model.
Pseudo R-Squared: While logistic regression does not have an R-squared value like linear regression, there are alternative measures called “pseudo R-squared” statistics (such as McFadden’s R-squared). These give an idea of how well the model explains the variability in the data, but they should be interpreted carefully, as they do not have the same meaning as R-squared in linear models.
Confusion Matrix: A confusion matrix is a table that shows the actual vs. predicted classifications in a logistic regression model. It helps in evaluating the model’s classification accuracy.
ROC Curve: The receiver operating characteristic (ROC) curve is a plot that shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) for different threshold values. The area under the ROC curve (AUC) provides a single metric that summarizes the overall performance of the model.

Applications in Social Science Research

Logistic regression is widely used in social science research to study a variety of topics, such as:

Health Outcomes: Researchers often use logistic regression to predict the likelihood of disease outcomes based on factors like age, gender, or lifestyle habits.
Educational Attainment: Logistic regression is commonly applied to understand the factors influencing the likelihood of students achieving certain educational outcomes, like graduating from college or dropping out of school.
Voting Behavior: In political science, logistic regression is used to analyze how demographic factors, political views, and other variables influence the likelihood of voting for a particular candidate or party.

Conclusion

Logistic regression is a fundamental tool in social science research for modeling binary outcomes. By estimating probabilities and working with odds, it offers a flexible and interpretable way to analyze relationships between a set of independent variables and a binary dependent variable. With its widespread use across disciplines, understanding the core principles of logistic regression helps researchers make more informed decisions about their data and provides a reliable foundation for drawing meaningful conclusions.

Glossary Return to Doc's Research Glossary

Last Modified: 09/27/2024