Regression

Danger

Do not use typical regression techniques for time-series data

Article to know how to evaluate whether a regression’s fit is good or not

Linear Regression

One of the most important algorithms in ML

Finds relationship between one or more independent variables and one dependent variable

Equation for line of best fit:

Find values for and that minimize MSE (Mean Squared Error), which is the cost function:

We use the square of the error because it’s easier to differentiate than the absolute value of the error

Residual: Difference between actual value and predicted value ()

Assumptions:

  • Relationship must be linear
  • Residuals must be independent - autocorrelation
  • Residuals must be normally distributed
    • If not, you’re likely to have outliers
  • Data must be homoscedastic, i.e., the variance in the outliers must be constant

scedasticity

Used to show how well line fits data (a measure of correlation)

Defined as 1 minus sum of the residuals squared divided by total sum of squares:

Ranges between -1 and 1:

  • of 1 means perfect fit
  • of 0 can happen if you just predict (the mean) for every input
  • of -1 means perfectly incorrect fit (negative not going to happen unless you screw up horribly)

RMSE

Our actual loss function

Good at comparing two models for same data

BUT is in the same units as whatever data we’re measuring, so useless on its own

Hypothesis Testing

We can perform a t-test to see how good our fit is

  • Null hypothesis: Our slope is no better than 0
  • Alternative hypothesis: Our slope is better than 0

Compare predicted values to actual values

Polynomial Regression

Prone to overfitting - generally only use it for smaller degree polynomials

Other types of regression

  • Ridge regression: Does well with multicollinearity, where two independent variables are correlated
  • Lasso regression: Useful when you have lots of potentially irrelevant features
  • ElasticNet regression: Combines loss functions of ridge and lasso regression to get best of both worlds
  • Random forest regression: In general, just do this first rather than messing around with the other types
    • Easy to use with minimal need for parameter tuning.
    • Generally reduces overfitting compared to single decision trees, in contrast to Ridge and Lasso, which focus on penalizing large coefficients.
    • Improved accuracy by aggregating predictions from multiple trees.
    • Can handle non-linearity, unlike Linear, Ridge, or Lasso regressions.
    • Robust against outliers by combining predictions from multiple trees, unlike Linear, Ridge, or Lasso regression.
    • Effective handling of large feature sets without extensive preprocessing.
    • Less sensitive to multicollinearity.