Pregnancy Weight Linear Regression with Octave (2)

Last time, we did an exploratory plot of the data and determined that it was a decent candidate for linear regression. Now let’s do that regression!

Linear Regression

This is the excellently functional tutorial I found, and while I will be streamlining a couple of commands, the nuts and bolts are exactly the same.

Since I went and complained about other tutorials not being reasonably standalone, I can’t very well do the same thing, now can I? :) So I’ll go ahead and pretend we’re starting all over again in Octave. Here’s our data file, and here’s us loading it again:

> data = load('data.csv')

Now, let’s go ahead and split out our x and y vectors so that we don’t have to keep saying data(:,1) and data(:,2).

> x = data(:,1)
> y = data(:,2)

From the perspective of the linear regression, our independent variable x is time (or the date of the weighing). Our dependent variable y is the weight.

At this point, stuff gets pretty mathematical. Even having taken an entire course on linear regression (albeit nearly a decade ago), I had to go on a week-long linear algebra bender before it all made sense. So while I hope to dig into it for you someday, for now, I’m just going to say that why this all works is beyond the scope of this post.

So here we go. Because we we want our line to be able to cross the y axis somewhere other than the origin (that is, we want it to have a nonzero intercept), we have to tack on a column vector of 1’s to our x vector to make a matrix , like so:

> X = [ones(length(x), 1), x]

Now we use the normal equation to find theta, which will be a magical vector containing our intercept and slope. Yeah, just copy and paste this equation in:

> theta = (inv(X'*X))*X'*y

You should have gotten theta = [-2.1892e+03, 1.6410e-06].

pinv vs inv

Those of you that went to the original tutorial may have noticed that she used the pinv function, which gives you the pseudoinverse of the matrix. The pseudoinverse gives inverses for non-square matrices as well as square. Every article I can find says that when in doubt, use pinv, and seeing as our 49x2 matrix is most definitely not square, I would have guessed that pinv would be head and shoulders the right choice.

But alas, pinv is giving me results that are wayyy off. Not a little, a lot! In contrast, inv is giving me a line that looks plausibly correct (see graph below). So I’m going with inv right now.

I realize that proof by “well it looks alright” isn’t terribly rigorous :), so I’m planning to get to the bottom of this, and I will update this post when I do. Meanwhile, if any of you know what’s going on, please email me!

Plotting the Line

Now that we have our line, let’s see how it looks next to our data.

> plot(x, y, 'linestyle', 'none', 'marker', 'o')

We are now going to do the fancy pants thing of putting TWO plots on the same graph. And we do it with this nifty little command:

> hold on

In case it’s not clear, there is a hold and you are turning it on. (The corresponding command is hold off.) But seriously, am I the only person who saw that and thought we were hollering “hold on!” the way we would at a friend who was walking too fast? Did the medieval person who invented that phrase think of it like that? And who else thinks it’s awesome that breakfast is literally “break fast”, as in you’re breaking your overnight fast? Anyway …

> plot(x, X*theta)

And we have:

Linear Regression Line


We could leave it there, but being a perfectionist, it occurs to me that we could do just a little bit better. See that dip in the graph in the beginning? That was my morning sickness. For 8 weeks, I ate almost nothing, threw up multiple times a day, and lost a total of 7 pounds. (Spencer, relegated to eating his own cooking, lost 10. I kid you not.) So my pregnancy weights form more of a checkmark, and if we do the two parts separately, maybe we can get a better approximation. We’ll try that next time.