Pregnancy Weight Linear Regression with Octave (2)
Last time, we did an exploratory plot of the data and determined that it was a decent candidate for linear regression. Now let’s do that regression!
This is the excellently functional tutorial I found, and while I will be streamlining a couple of commands, the nuts and bolts are exactly the same.
Since I went and complained about other tutorials not being reasonably standalone, I can’t very well do the same thing, now can I? :) So I’ll go ahead and pretend we’re starting all over again in Octave. Here’s our data file, and here’s us loading it again:
> data = load('data.csv')
Now, let’s go ahead and split out our x and y vectors so that we don’t have to
> x = data(:,1) > y = data(:,2)
From the perspective of the linear regression, our independent variable
time (or the date of the weighing). Our dependent variable
y is the weight.
At this point, stuff gets pretty mathematical. Even having taken an entire course on linear regression (albeit nearly a decade ago), I had to go on a week-long linear algebra bender before it all made sense. So while I hope to dig into it for you someday, for now, I’m just going to say that why this all works is beyond the scope of this post.
So here we go. Because we we want our line to be able to cross the y axis somewhere other than the origin (that is, we want it to have a nonzero intercept), we have to tack on a column vector of 1’s to our x vector to make a matrix , like so:
> X = [ones(length(x), 1), x]
Now we use the normal equation to find
theta, which will be a magical vector
containing our intercept and slope. Yeah, just copy and paste this equation in:
> theta = (inv(X'*X))*X'*y
You should have gotten
theta = [-2.1892e+03, 1.6410e-06].
Those of you that went to the original tutorial may have noticed that she
pinv function, which gives you the pseudoinverse of the matrix.
The pseudoinverse gives inverses for non-square matrices as well as square.
Every article I can find says that when in doubt, use
pinv, and seeing as
our 49x2 matrix is most definitely not square, I would have guessed that
would be head and shoulders the right choice.
pinv is giving me results that are wayyy off. Not a little, a
lot! In contrast,
inv is giving me a line that looks plausibly correct (see
graph below). So I’m going with
inv right now.
I realize that proof by “well it looks alright” isn’t terribly rigorous :), so I’m planning to get to the bottom of this, and I will update this post when I do. Meanwhile, if any of you know what’s going on, please email me!
Plotting the Line
Now that we have our line, let’s see how it looks next to our data.
> plot(x, y, 'linestyle', 'none', 'marker', 'o')
We are now going to do the fancy pants thing of putting TWO plots on the same graph. And we do it with this nifty little command:
> hold on
In case it’s not clear, there is a hold and you are turning it on. (The
corresponding command is
hold off.) But seriously, am I the only person who
saw that and thought we were hollering “hold on!” the way we would at a friend
who was walking too fast? Did the medieval person who invented that phrase
think of it like that? And who else thinks it’s awesome that breakfast is
literally “break fast”, as in you’re breaking your overnight fast? Anyway …
> plot(x, X*theta)
And we have:
We could leave it there, but being a perfectionist, it occurs to me that we could do just a little bit better. See that dip in the graph in the beginning? That was my morning sickness. For 8 weeks, I ate almost nothing, threw up multiple times a day, and lost a total of 7 pounds. (Spencer, relegated to eating his own cooking, lost 10. I kid you not.) So my pregnancy weights form more of a checkmark, and if we do the two parts separately, maybe we can get a better approximation. We’ll try that next time.