When analysing the correlation between two variables, one dependent and the other independent, it is fitting to utilise a linear regression model. This model enables the comprehension of the relationship between the two variables and the prediction of dependent variable values based on the independent variable. A computer applies a mathematical algorithm to fit the best line by measuring the deviation between observed data points and predicted data points. The variation between the prediction and actual values determines the amount of error. To minimise this error, it is essential to adjust the coefficients of the equation and refine the model. In this post, we will elaborate on the linear regression method and answer all reader queries. Are you ready?
What is the meaning of linear regression?
Linear regression is a statistical technique used in statistics when the correlation between variables is linear and can be portrayed as a straight line. Analysis of the correlation between two variables is possible by using this powerful tool. It provides insight into the extent to which one variable will affect the other and whether a linear relationship exists between the two.
Predicting the correlation between two quantitative variables, specifically independent predictor variables and dependent variables to be predicted, is a common use of this approach. To illustrate, if one wants to estimate a home’s value based on its square footage, garage space, lot size, distance to public transportation, and other characteristics, the independent variables in this scenario would be “Square Footage,” “Garage Space,” “Land Slope,” and “Utilities,” while the dependent variable would be the “Price.”
Instances where regression may be observed
Example A: The First Illustration
Businesses frequently utilise linear regression as a statistical tool to study the relationship between advertising expenditure and sales results. For instance, they may create a linear regression model with advertising costs as the independent variable and sales figures as the response variable. The model equation is expressed as follows:
Productivity = Breakeven + Growth (advertising spending)
Linear regression is utilised in the medical field to analyse the relationship between the level of medication administered and the patient’s corresponding blood pressure. Researchers may examine the effect of a specific medication on a patient’s blood pressure or cardiac stress by administering diverse doses of the designated drug. To achieve this, they may employ the dose of medication as an independent variable and blood pressure as the dependent variable in a model. In brief, the formula for this linear regression model is:
Blood pressure calculation: 0+1 (dosage)
Agricultural specialists frequently use linear regression, a statistical technique, to study the correlation between weather conditions and crop yield. To evaluate the impact of rainfall on crop yield, for example, researchers may use multiple linear regression to assess the influence of precipitation and fertiliser application. In this scenario, crop yield would be the dependent variable, and precipitation and fertiliser application would be the independent variables. Mathematically, the regression model can be written as follows:
Plant output = 0 + 1 (rainfall) + 2 (fertiliser)
Various Names for the Statistical Method of Linear Regression
The origins of linear regression can be traced back to 1805, making it one of the oldest documented statistical techniques available. Since then, various versions of linear regression, such as multiple linear regression, polynomial regression, and other similar methods, have arisen. In essence, all of these methods aim to facilitate the prediction of a target value based on the features of the input data available.
The terminology employed in the analysis can differentiate between the two situations. Linear regression is utilised when there is only one predictor variable, whereas multiple linear regression is utilised when there are multiple predictor variables.
Linear regression can only be applied in situations where there is a linear relationship between the variables under examination. To examine this relationship quickly and easily, a pair plot or scatter plot is a useful tool.
What Type of Relationship is Indicated by the Results of Linear Regression?
- Positive Correlation
A correlation exists between two variables when the regression line between them slopes upwards, indicating that a rise in the value of the independent variable (x) will result in a rise in the value of the dependent variable. This correlation is referred to as positive, in contrast to negative correlation, which would suggest that an increase in the value of x would result in a decrease in the value of the dependent variable.
- Negative Correlation
A negative correlation is a relationship between two variables in which a rise in the value of one of the variables is associated with a decrease in the value of the other variable. This can be displayed graphically with a regression line that has a negative slope, indicating that as the value of the independent variable (x) rises, the value of the dependent variable (y) declines.
- No Correlation
If the best fit line of the regression analysis is a horizontal line, it is usually concluded that there is no correlation between the two variables. This suggests that variations in the value of the independent variable (x) would have no influence on the value of the dependent variable (y).
Through measures of correlation and covariance, we can obtain insight into the nature of the relationship between these variables.
Covariance is a measurement of the linear relationship between two variables, X and Y. It indicates whether an increase in the independent variable (X) is related to a rise or fall in the dependent variable (Y). A positive covariance value implies that an increase in X is linked to an increase in Y, while a negative covariance value indicates that an increase in X is associated with a decline in Y and a decrease in X results in an increase in Y. However, it is important to note that covariance does not indicate the strength of the relationship between the two variables, but rather whether the relationship is positive or negative.
The concept of correlation can be used to determine the direction and strength of the relationship between two variables. Correlation is a measure that spans from -1 to +1, with a perfect correlation being exceedingly rare and characterised by every data point being precisely on the line of best fit.
The Least Squares Method
The primary objective of the linear regression model is to determine the most ideal line that can fit the data points. To accomplish this goal, the least squares approach is used, which minimises the sum of squared residuals. This summation serves as an indication of the difference between the actual values of the data points and the anticipated values from the fitted line, which is commonly known as the “error”.
The equation y = a + bx is widely known as the equation of a straight line.
It is crucial to inquire about the error that is being alluded to. Can we obtain a grasp of what is being depicted? Is there a way to pinpoint it? Rest assured that the linear model is equipped to perform all of the crucial calculations, therefore, there is no need for concern.
Assuming that Yi is a variable that is free to vary, the Sum of Squared Errors (SSE) can be utilised to gauge the difference between the actual value of Yi and the predicted value. By minimising this unexplained variability, the most accurate result can be derived.
The Sum of Squared Residuals (SSR) is a metric that quantifies the divergence between the predicted value (ŷ) and the mean of the dependent variable. Minimising the SSR can increase the explained variability of the model.
Although this error can be distinguished at all points, making it a favourable loss function, it is susceptible to the impact of outliers, which can lead to difficulties.
The Sum of Squared Total (SST) represents the total variance in the model (SSR + SSE = SST).
Elucidate the Process of Linear Regression.
To evaluate the accuracy with which the bill can forecast the amount of the tip, one must scrutinise the relationship between the independent variable (the bill) and the dependent variable (the tip). The bill can be regarded as the predictor variable, while the tip is the response variable.
The ideal fitting line can be discovered by minimising the sum of squared errors, which is the disparity between the predicted value and the actual value.
The initial step is to determine if the variables are interconnected in a linear manner.
It is crucial to understand that a straight line can be expressed as the equation y = mx + c or y = x * 1 + 0. To evaluate the correlation between two variables, a scatter plot must be produced. It is noteworthy that the centroid, which corresponds to the point of intersection between the mean x and y values, will always appear on the fitting line.
The positive correlation between a bigger tip and an augmented bill total is apparent. Linear regression could be applied to anticipate the response variable based on the said relationship.
The subsequent phase involves ensuring the consistency of the data.
After constructing a scatter plot to demonstrate the association between two variables, the intensity of a direction can be calculated by determining the correlation coefficient. In this example, the correlation coefficient is 0.866, which implies a remarkably robust correlation between the two variables.
S Stage 3: Compute the Calculations.
With an understanding that the correlation is meaningful and advantageous, the calculations can commence.
Best-fit line equation = x * 1 + 0
By determining that the slope or regression coefficient is 1, we can make precise predictions regarding the effect of a one-unit increase in the independent variable on the dependent variable. This awareness grants us useful perspectives that can be put to use in making decisions and devising strategies.
, and should the constant term be calculated as 0=-x*1,
To accomplish this, substitute x = 74, = 10, (x-x)(y-x) = 615, and solve for x (x-x)2 = 4206.
The resulting figure of rounding the total bill to the nearest unit is 10.1466. In turn, increasing the total bill by a single unit would result in an increment of 0.1462 units in the recommended gratuity.
By the same token, 0 Equates to -0.8203. No practical importance can be guaranteed for the intercept.
As a result, the best-fit line equation is y = 0.1462x – 0.8203.
Due to the comprehensive computations involved, utilising Python packages will make the process more convenient.
The subsequent meeting will cover how to give meaning to healthcare expenditure data derived from a linear regression study.
Interpreting Using Linear Regression
The subsequent code will generate a summary of your linear regression model:
Add a constant X import statsmodels.api as sm (x)
Equivalent to sm.ols (y,x).fit()
Summary of the model()
Coefficient of Statistical Significance (R-squared)
R-squared (R2) is a metric that evaluates how well a statistical model fits the observed data. It measures the proportion of the variability in the data that is accounted for by the model; in essence, it indicates how accurately the model represents the observed data. R2 can range from 0 (poorly fitting) to 1 (perfectly fitting).
Consider the following example:
It has been determined that in this specific scenario, “CGPA” is the independent variable and “Package” is the dependent variable. Once you have conducted extensive research on the financial assistance provided by XYZ University, you may apply. The university will offer you the mean package granted to previous candidates if you do not supply any additional information. However, if you disclose your CGPA when asked the same question, they will be able to estimate your compensation. By examining past data through linear regression, they will be able to arrive at a suitable resolution.
Utilising linear regression instead of the average of other students’ outcomes, produces much more precise outcomes, illustrated by the R2 score.
Can R squared be negative, despite ranging between 0 and 1? At what moment does this number reach 1, and what does that signify precisely?
- Can R2 be equal to 0?
If the errors in the regression line and the mean line are equal, it is not necessary to disclose your cumulative grade point average to the university. The reason being that both lines would have the same error when the regression line intersects the mean line, thereby resulting in the same outcome regardless of disclosing your CGPA or not.
- Is it possible to have an R-squared of one?
For the regression line to be an ideal match for all data points, the error of the regression line should be zero, and the error of the mean line should be zero as well. This would only occur if the error ratio of the regression line to the mean line is zero, indicating that the regression line is an exact fit for all of the data points.
- Is it possible for the R-squared value to be negative or -1?
Yes, if the mean line error is greater than the regression line error. This can only happen if the Sum of Squares of Residuals (SSR) is greater than the Sum of Squares of Total (SST). When the regression line does not intersect any of the data points, the linear approximation error surpasses the worst-case scenario.
If the model can account for 75.1% of the variability in the dataset, or if the fitted values represent the original values well, as demonstrated above, then the coefficient of determination is 0.751. The target column in this dataset contains tabular data, and the values should be continuous and diverse. This result shows that 75.1% of the variation was explained by the characteristics provided to the model. It is not permitted to discuss the remaining 25%.
Please note that a high R-squared value suggests a strong model. However, what exactly constitutes a “high” value?
When R2 is greater than 0.8, it signifies that the model adequately fits the data.
When the R-squared value falls between 60% and 80%, it indicates a decent fit.
A value of less than 60% for R-squared suggests that the model may require some adjustments.
If your R2 is low, ensure to investigate your independent variables for any inconsistencies.
Modified R-squared for Quantitative Analysis
It is crucial to understand that as more input variables are included in the model, the R-squared statistic increases, which can produce incorrect results when utilised to assess the inclusion of a new independent variable. To tackle this, a modified version known as the adjusted R-squared can be employed. This modified version takes into account the subtleties of the situation by gaining importance as extraneous factors are added to the model while decreasing when variables with no impact on the dependent variable are added. It is vital to note that the adjusted R-squared is always less than the R-squared value.
In the majority of cases, the adjusted R-squared value will be considerably comparable to the initial R-squared value. However, if there is a significant difference between the two, it is suggested to review the independent variables to assess whether there is any correlation between the independent and dependent variables.
F-statistic: In this case, the F-statistic value indicates that some of the model’s coefficients are not uniformly equivalent to zero. If the overall F-test yields a significant value, this suggests that the correlation between the model and the dependent variable is statistically significant, since the R-squared value is not zero.
In this scenario, the null hypothesis states that the model, which lacks any independent variables, has the same ability to fit the data as the model with independent variables.
As opposed to the model with only an intercept, the model in question provides a better fit to the data, as hypothesised by the alternative hypothesis (H1).
A p-value is utilised to assess whether the evidence is sufficiently strong to endorse the alternative hypothesis. To have 95% confidence in the test results, the conditions listed below must be satisfied:
The p-value is less than the significance threshold (5%), so the null hypothesis (H0) can be rejected.
If the p-value is greater than the significance threshold, the hypothesis (H0) cannot be rejected.
To gain a better understanding of this notion, let us try to interpret it. It should be emphasised that a fundamental understanding of statistics is necessary to comprehend machine learning.
Coefficient: It is crucial to consider that linear regression operates under the assumption that the independent variables in the model are not collinear, indicating that they are not extremely correlated with one another. Additionally, a variable’s coefficient represents the dependent variable’s increase when the independent variable is increased by one unit, assuming that all other variables in the model stay constant.
For example, if the age coefficient is 257.305, then every additional year of age would result in a corresponding increase of 257.305 points in the dependent variable. Conversely, if the coefficient was negative, such as with the “region” coefficient, a rise of one unit in the region’s value would cause a decrease of 353.4491 points in the dependent variable.
The importance of a variable to a model can be emphasised by its coefficient’s size. A coefficient value approaching zero signifies that there is no correlation between the two variables.
Dispersion, Relative to the Norm: Understanding standard deviation can aid in comprehending standard error. If you want to assess the distribution of your data or how far the numbers deviate from the mean, standard deviation can be used. The majority of the data (roughly 95%) is located within two standard deviations of the mean.
To illustrate the concept of standard error, let us take a sample of 10 from a larger population. Standard error refers to the extent to which sample means differ from the population mean when plotted on a normal distribution graph, indicating how accurately the sample mean represents the true population mean. This will help us grasp the distinction between the sample mean and the actual population mean.
Therefore, the standard error in the regression model gives an estimate of the distribution of coefficients.
Puzzling? Let us break it down, shall we?
For every additional year of age, the dependent variable should rise by 257.4050. Re-running the model may alter this coefficient, with a variance of up to 11.878 standard errors, as indicated by the standard error.
t-stat – The t-statistic, also known as a t-value, is obtained by dividing the coefficient by the standard error.
The formula for the t-statistic is: Coefficient / Standard Error.
If the coefficient value is significantly high or low, the t-statistic can provide an explanation for why this may be the case. If the t-statistic value is greater than anticipated, it can be inferred that the hypothesis being tested is upheld.
As this value is within the rejection range, you should opt to refuse the null hypothesis.
H0 asserts that all coefficients at the population level are zero.
Coefficients do not equal zero at the population level (H1), the alternative hypothesis.
A bigger t-statistic value implies that the variable is more important.
P > |t| – The p-value is a metric of the probability that the alternative hypothesis is incorrect and can be presented as a percentage. For age, the p-value is exceedingly low, indicating that the null hypothesis is extremely improbable. Moreover, the p-value signifies that the coefficients of the independent and dependent variables are non-zero at the population level, indicating a connection between the two variables.
When assessing the linkage between two variables, a p-value of 5% or lower is frequently regarded as an indication of statistical significance. Thus, if the computed p-value is above 0.05, it is incorrect to assert that there is no correlation between the two variables. In such situations, the null hypothesis must be rejected, and if the p-value is less than 0.05 (p-value < 0.05), it can be claimed that the coefficients are not equal to 0.
If the t-statistic is significant, the p-value is low, indicating that the observed coefficient values are less likely to be random.
Your independent research and resulting proficiency in comprehending linear regression models have earned you high praise. Acquiring a thorough understanding of how such models function is always a valuable ability. To improve your understanding of this model and create accurate forecasts, it is necessary to first grasp the definitions of the related terms.