Multivariate Linear Regresssion using Polar

Making Predictions

This article will continue our data science journey by exploring multivariate linear regression.
Last time we looked at simple linear regression with just one independent variable, but real-world problems usually have multiple factors affecting the outcome we want to predict.

Going Beyond One Variable

In our previous article, we created a simple linear regression to predict the width of an iris petal based only on its length.
That's a good start, but what if we considered other measurements too?
The iris dataset also includes sepal length and width measurements that could help us make even better predictions.

When we use multiple independent variables instead of just one, we're doing multivariate linear regression.
The equation gets a bit longer:

y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ...

For our iris flowers:

y is the petal width (what we're trying to predict)
x₁ is petal length
x₂ is sepal length
x₃ is sepal width
b₀, b₁, b₂, b₃ are the coefficients we need to find

Each coefficient tells us how much the petal width changes when we increase that particular measurement by one unit, assuming everything else stays the same.

Let's Code This Up

Just like before, we'll use the polars-df gem to work with our iris dataset:

require 'polars'

# Load up our iris friends
iris_df = Polars.read_csv("iris.csv")
puts iris_df.describe

The math for multivariate regression gets a bit more complex than the simple version we did last time.
While we could calculate it step by step like before, let's use a more direct matrix approach:

# Extract our variables
y = iris_df['petal_width']
X = iris_df.select(['petal_length', 'sepal_length', 'sepal_width'])

# Add a column of 1s for the intercept term
X_with_intercept = X.with_column(Polars.lit(1).alias('intercept'))

# The magic formula: β = (X'X)⁻¹X'y
# Calculate X'X (X transpose multiplied by X)
X_transpose_X = X_with_intercept.transpose().dot(X_with_intercept)

# Calculate the inverse of X'X
X_transpose_X_inverse = matrix_inverse(X_transpose_X)

# Calculate X'y
X_transpose_y = X_with_intercept.transpose().dot(y)

# Calculate the coefficients β
coefficients = X_transpose_X_inverse.dot(X_transpose_y)

# Pull out the individual coefficients
intercept = coefficients[-1]
petal_length_coef = coefficients[0]
sepal_length_coef = coefficients[1]
sepal_width_coef = coefficients[2]

puts "Our equation is:"
puts "petal_width = #{intercept.round(4)} + #{petal_length_coef.round(4)} * petal_length + #{sepal_length_coef.round(4)} * sepal_length + #{sepal_width_coef.round(4)} * sepal_width"

Making Predictions

Now we can predict the petal width of any iris as long as we know its other measurements:

# Let's predict a new iris!
new_iris = {
  'petal_length' => 4.5,
  'sepal_length' => 6.0,
  'sepal_width' => 3.0
}

predicted_width = intercept + 
                 petal_length_coef * new_iris['petal_length'] + 
                 sepal_length_coef * new_iris['sepal_length'] + 
                 sepal_width_coef * new_iris['sepal_width']

puts "For an iris with petal length of 4.5cm, sepal length of 6.0cm, and sepal width of 3.0cm:"
puts "Predicted petal width: #{predicted_width.round(2)} cm"

Is This Better Than Simple Regression?

Great question! Let's compare:

# Make predictions using our multivariate model
multivariate_predictions = iris_df.apply(row -> 
  intercept + 
  petal_length_coef * row['petal_length'] + 
  sepal_length_coef * row['sepal_length'] + 
  sepal_width_coef * row['sepal_width']
)

# Calculate the errors
multivariate_errors = multivariate_predictions - iris_df['petal_width']
multivariate_squared_errors = multivariate_errors ** 2
multivariate_mse = multivariate_squared_errors.mean()

# Make predictions using just petal length (like in our previous article)
simple_predictions = iris_df.apply(row ->
  simple_intercept + simple_slope * row['petal_length']
)
simple_errors = simple_predictions - iris_df['petal_width']
simple_squared_errors = simple_errors ** 2
simple_mse = simple_squared_errors.mean()

puts "Mean Squared Error (Multivariate): #{multivariate_mse.round(4)}"
puts "Mean Squared Error (Simple): #{simple_mse.round(4)}"

If the multivariate model's MSE is lower, then adding those extra variables helped us make better predictions!

Which Measurement Matters Most?

One cool thing about multivariate regression is finding out which variables have the biggest impact on our predictions:

puts "Coefficients:"
puts "Petal Length: #{petal_length_coef.round(4)}"
puts "Sepal Length: #{sepal_length_coef.round(4)}"
puts "Sepal Width: #{sepal_width_coef.round(4)}"

But wait - we can't directly compare these because they're measured in different units. Let's standardize them first:

# Standardize our variables (subtract mean, divide by std)
standardized_X = X.apply(col -> (col - col.mean()) / col.std())

# Redo the regression with standardized variables
# ... (similar matrix operations as before)

puts "Standardized coefficients (these we can compare!):"
puts "Petal Length: #{std_petal_length_coef.round(4)}"
puts "Sepal Length: #{std_sepal_length_coef.round(4)}"
puts "Sepal Width: #{std_sepal_width_coef.round(4)}"

Now we can see which measurements have the most predictive power!

Potential Gotchas

A few things to watch out for when doing multivariate regression:

Correlation between variables: If two independent variables are highly correlated (like petal length and petal width often are), it gets tricky to separate their individual effects. This is called multicollinearity.
Overfitting: Adding more variables doesn't always make your predictions better. Sometimes it just makes your model fit the training data too closely without generalizing well.
Feature selection: Sometimes less is more! You might not need all those variables to get good predictions.

Wrapping Up

Multivariate linear regression is a powerful extension of simple linear regression that lets us consider multiple factors when making predictions. For our iris flowers, we've seen how we can use all available measurements to better predict petal width.

In our next article, we'll take things a step further and look at logistic regression - a technique that helps us classify things rather than predict continuous values. Instead of asking "how wide will this petal be?", we'll ask "what species of iris is this?" Stay tuned!