Short Description of Linear Regression Mathematics
Linear regression aims to establish a linear relationship between the independent variable(s) and the dependent variable. The goal of linear regression is to minimize the sum of the squared differences (errors) between the observed values (actual values) and the values predicted by the model.
Simple Linear Regression Model (Single Variable)
Simple linear regression is a method used to model the relationship between two continuous variables: one dependent variable (y) and one independent variable (x). The model assumes a linear relationship between these variables, which can be represented by the equation y = mx + b, where:
- y is the predicted value of the dependent variable.
- m is the slope of the regression line, indicating the average change in y for a one-unit change in x.
- b is the y-intercept, representing the value of y when x is zero.
To find the best values for m and b that fit the data, we use a method called Ordinary Least Squares (OLS). This method calculates the values of m and b that minimize the sum of the squared differences between the observed values and the predicted values on the line. These differences are called residuals.
The optimization process involves:
- Calculating the sum of the product of corresponding values of x and y (Σxy).
- Calculating the sum of the x values (Σx) and the sum of the y values (Σy).
- Calculating the sum of the squares of the x values (Σx^2).
- Using these sums to compute the slope m using the formula: m = (n(Σxy) - (Σx)(Σy)) / (n(Σx^2) - (Σx)^2)
- Computing the y-intercept b using the formula: b = (Σy - m(Σx)) / n
Here, n is the number of data points. After finding m and b, we can use them to predict y for any given x and to understand the relationship between x and y.
import java.util.ArrayList;
import java.util.Arrays;
public class SingleVarCommitsGradeRegression {
public static void main(String[] args) {
// Mock data
ArrayList<Double> commitsData = new ArrayList<>(Arrays.asList(5.0, 10.0, 15.0, 20.0, 25.0)); // Number of commits
ArrayList<Double> gradeData = new ArrayList<>(Arrays.asList(70.0, 75.0, 80.0, 85.0, 90.0)); // Corresponding grades
double m = calculateSlope(commitsData, gradeData);
double c = calculateIntercept(commitsData, gradeData, m);
System.out.println("Linear Regression Model for Grade based on Commits: Grade = " + m + " * Commits + " + c);
}
public static double calculateSlope(ArrayList<Double> commits, ArrayList<Double> grades) {
int n = commits.size();
double sumCommits = 0, sumGrades = 0, sumCommitsGrades = 0, sumCommits2 = 0;
for (int i = 0; i < n; i++) {
sumCommits += commits.get(i);
sumGrades += grades.get(i);
sumCommitsGrades += commits.get(i) * grades.get(i);
sumCommits2 += commits.get(i) * commits.get(i);
}
return (n * sumCommitsGrades - sumCommits * sumGrades) / (n * sumCommits2 - sumCommits * sumCommits);
}
public static double calculateIntercept(ArrayList<Double> commits, ArrayList<Double> grades, double m) {
int n = commits.size();
double sumGrades = 0, sumCommits = 0;
for (int i = 0; i < n; i++) {
sumGrades += grades.get(i);
sumCommits += commits.get(i);
}
return (sumGrades - m * sumCommits) / n;
}
}
SingleVarCommitsGradeRegression.main(null);
Linear Regression Model for Grade based on Commits: Grade = 1.0 * Commits + 65.0
In our context, we’re trying to predict a student’s grade (dependent variable) based on the number of commits they’ve made (independent variable). Mathematically, this relationship is represented as Grade=m×Commits+c, where m is the slope and c is the y-intercept. The slope m indicates how much the grade changes for each additional commit, while c represents the grade when there are no commits. The code calculates m and c using the method of least squares, which minimizes the sum of the squared differences between the observed grades and the grades predicted by the model. Once we have the values of m and c, we can predict the grade for any given number of commits.
Mean Squared Error (MSE)
Mean Squared Error (MSE) is a common loss function used for regression models. It measures the average squared difference between the estimated values and the actual value.
Formula
The MSE is calculated as follows:
MSE = (1/n) * Σ(actual - predicted)^2
Where:
n
is the number of observationsΣ
is the summation symbol, meaning we sum over all observationsactual
is the actual value of the data pointpredicted
is the predicted value from the model
Interpretation
MSE is always non-negative, and a value of 0 indicates perfect predictions. The larger the number, the larger the error, and thus the worse the model’s predictive accuracy.
Advantages
- Sensitivity to Large Errors: MSE is particularly useful when you want to penalize large errors more than smaller ones since the errors are squared before they are averaged, which disproportionately penalizes larger errors.
Disadvantages
-
Scale Dependency: The MSE has the same units as the square of the output variable. This can sometimes make it difficult to interpret the MSE’s magnitude.
-
Impact of Outliers: Due to squaring, outliers have a significant impact on MSE, which can skew the model evaluation if outliers are not handled properly.
Remember, while MSE is a powerful tool for model evaluation, it’s important to consider the context of the problem and the data when interpreting it.
public class MSECalculator {
public static double calculateMSE(double[] actual, double[] predicted) {
if (actual.length != predicted.length) {
throw new IllegalArgumentException("Arrays must be of the same length");
}
double mse = 0;
for (int i = 0; i < actual.length; i++) {
mse += Math.pow(actual[i] - predicted[i], 2);
}
return mse / actual.length;
}
public static void main(String[] args) {
// Example usage:
double[] actualValues = {1.0, 2.0, 3.0};
double[] predictedValues = {1.1, 2.1, 3.1};
double mse = calculateMSE(actualValues, predictedValues);
System.out.println("Mean Squared Error: " + mse);
}
}
MSECalculator.main(null);
Mean Squared Error: 0.010000000000000018
The MSECalculator class in Java includes a method calculateMSE to compute the Mean Squared Error (MSE) between two arrays of actual and predicted values. It ensures both arrays are of equal length, then iterates through them to sum the squares of their differences. This sum is then averaged to yield the MSE, a key metric indicating the accuracy of predictions, with lower values indicating better performance. The main method demonstrates this calculation with example data, outputting the MSE to the console.