## Thursday, April 5, 2012

### Fun with statistics: A tutorial on using linear regression to develop race time conversions

As you can see, I've been playing around with numbers recently.  "Conversion charts" intended to predict future race performances (or equivalent race times) based on a previous race abound on the internet and in running books.  But they often don't have the race distance you are looking for.  And the various conversion tools runners use have very different meanings behind them—two charts may give wildly different "equivalents" for a given time (say, a 4:30 mile).  Why is that?

Well, any conversion chart has a set of assumptions behind it.  When converting between two distances (say, a 5k runner who wants to know what he can run for the 10k), there is usually an implicit assumption that you will have trained for the 10k as well as the 5k.  Charts and conversion factors are often based on runners who ran both events, and thus probably trained for both events too.  Greg McMillan's calculator and Jack Daniels' tables are examples of conversions based on runners who actually prepared for and raced in a wide variety of events.  While McMillan's chart predicts a 4:20 miler can run a 2:26 marathon, that math is based on marathon runners who actually raced a mile during marathon training.  There are a lot of 2:26 marathoners who can't run 4:20 in the mile and a lot of 4:20 milers who can't run a 2:26 marathon, but the chart isn't valid for them because they didn't prepare for both events!

Another set of "conversions" exists to compare equivalent performances, which is a very different thing than predicting an equivalent race.  Equivalent performances are based on a ranked performance list, so they allow you to compare two completely unrelated events, like the shotput and 1500m.  The IAAF publishes a book of "points tables" that allow for comparing how "good" a performance is in a particular event.  By necessity, however, these kinds of tables will be biased towards distances and events that are not contested as often—as an example, the 100th best steeplechaser in the world will probably be a lot further back in a 1500m race from the the 100th best 1500m runner, simply because there aren't as many competitors in the steeplechase, even though their performances are equivalent on the world stage.  The same holds true for events like the 15k or 10 miles, since they just aren't as many people who compete in them.  This is the weakness of equivalent-performance charts.

Anyways, getting back to the numbers: it will be useful for any coach to be able to develop a formula to give a rough estimate of what to expect when heading into a race.  It's easy to find 3k-to-5k conversions, but what if your conference runs the two-mile? Or the 600 yards? Or if you want to compare two cross country courses to find out how much faster one is than the other?  What follows is a tutorial on how to go about comparing times among two similar distances or courses using marks from runners who competed in both events.  I'll be using Microsoft Excel to do this analysis, but it is also easy to do in OpenOffice or LibreOffice, two free and open-source spreadsheet programs. You could probably even do this in Google Docs if you'd like.

Getting started: find and import your data

Finding and importing the data you want to analyze is the first step.  The initial problem that spurred me to investigate this whole conversion issue was the lack of an acceptable conversion factor to predict 600m times from 800m times.  So, to develop my conversion and to illustrate the process, I'll be using times from this year's indoor honor rolls from the Minnesota Intercollegiate Athletics Conference, or MIAC.  Most conferences have performance lists you can save as text files and import into Excel.  The MIAC is ideal in many ways since, unlike a Division I conference, the MIAC displays a much broader range of abilities.  While a men's 3k performance list in a DI conference might range from 7:50 to 8:40, the MIAC list ranges from 8:06 to 11:00, allowing us to evaluate the validity of our predictions over a larger range of race times.

The MIAC honor rolls are already in text format, so all I have to do is copy the 600m and 800m honor roll in their entirety and paste them into a Notepad document.  Then, save it as a .txt file (not .rtf, .doc, or anything else)

That was easy enough.  Now open up MS Excel or your spreadsheet program of choice and select File>Open and navigate to where you saved your .txt file.  Initially, it won't show up, but all you have to do is select "All files (*.*)" in the "Files of type..." option.

When you select your file and open it, a dialogue box should pop up giving you a variety of import options.  Excel can automatically sort the names, times, and dates into separate columns.  You may have to play around with the options to get it to work with your data set, but the MIAC honor rolls import very cleanly with the default settings.

Once you click "Finish," you should have your data on a spreadsheet.  First you'll need to save the spreadsheet as an Excel workbook (.xls or .xlsx) so you can sort, make charts, and so on.  Once you've done that, you'll need to do some organization: on my workbook, I'm going to post the 800m and 600m honor rolls side-by side, then delete all of the extraneous information like school, academic year, and date of the meet.  Also make sure you delete any non-performance standards that are mixed in with the honor rolls, like an "automatic qualifying" mark, since that will screw up the data analysis later.

Cleaning up and formatting

That's a lot cleaner.  But we still have have some leftover "gunk" from the web format.  The easiest way to get rid of this is to simply sort the honor rolls by time from smallest to largest, which will put all the extra stuff at the bottom.  First, though, you should make sure Excel is recognizing the times right! Excel can be very finicky with times (as we'll see in a little bit), but what you need to do for now is select the column containing the race times, right-click, and select "Format Cells..."

Then, in the Format Cells menu, you'll need to scroll down to "Custom."  "Time" is actually reserved for time of DAY, not discrete time units.  Type in "mm:ss.00" in the custom formula box and click okay.  Notice how this has brought back the hundredths digit on the honor roll times.  Don't forget to do this for the other column too!

Now, we can select each set of data—both the names and the times at once—and head over to the "Data" tab and click "Sort."  You'll want to sort by Column B (the times) from smallest to largest.  This will bump all of the extra stuff to the bottom (like the "10" in cell A 12).  Do this for both sets of data you are comparing.

Now that I've got the organization out of the way, I've got my two sets of data side-by-side.  The next step is to isolate the runners who are in both sets of data. In my case, I want to find all of the MIAC runners who are on the honor roll for the 600m and the 800m. I'll do this using the MATCH() function in Excel.  MATCH is a function which takes three arguments: the input, the search range, and the type of search.  Its intended use is to return the position of a certain value within a set of data.  So if I was looking for what position Rob LeMay is on the 600m honor roll, the MATCH function would be great for that.  If the person I'm looking for is not in the set of data I'm looking at, the function will return "#N/A," which will easily allow me to eliminate the runners who ran only one event and not the other.   So, what we are going to do is make a "helper column" in column C titled "In 800?" which will use the MATCH function to look up whether each runner on the 600m honor roll is also on the 800m honor roll.  It's a bit tricky, so look at the screenshots to figure it out.

For the first entry in column C, I'll enter =MATCH(A2, \$F\$2:\$F\$107, 0).  This tells Excel to look up whether Mike Hutton (entry A2), entry #1 in the 600m honor roll, exists in the 800m honor roll, which ranges from cells F2 to F107.  The \$ signs between each cell row and column are important because they denote that this range is static—when we start copying this formula down into adjacent cells, the range we are searching won't change.  The third argument in the function, 0, just tells Excel to look for an exact match.  Once you've entered the formula, you should see that it return "1", meaning that Mike Hutton exists in spot #1 on the 800m honor roll as well.  Now, the easy part: just select all of column C and hit Ctrl-D on the keyboard or select C2 and "drag" it down by the bottom right corner (+ symbol cursor) for the rest of the honor roll.  Lather, rinse, and repeat for the other data set: I'll do the same thing for the 800m honor roll, making a helper column titled "in 600m?" and use the MATCH function to look up whether the runners in the 800m honor roll exist on the 600m honor roll.  When I'm done, this is what I have:

Each data set has a column which tells me whether a given runner is in the other data set.  We could match these up by hand, but that's far too tedious.  Instead, we'll sort each data set again, but this time, we'll sort each data set by its "helper column" (C and H, respectively) from smallest to largest value.  This will bump all of the runners with "#N/A" in their helper column—the ones who did not compete in the other event—to the end, where we will delete them.  Now what you should be left with is two sets of data equal in length and organized by each runner:  So for example, below, you can see that we have Joe Arnold's 600m time in row 17 and his 800m time also in the same row.  There are a few ways this could get mixed up: if a runner is entered multiple times on the honor roll for an event, or if a runner's name is misspelled in one instance, for example.  So if things don't match up, you'll have to hunt down the cause.  Fortunately, RaceBerry (the site which hosts the MIAC honor rolls) is very good about keeping its results tidy.

Now, developing a conversion should be as simple as dividing the times, right? Well unfortunately, no.  Because Excel was programmed mostly to do math on times of day, not discrete units of time, math with minutes and seconds is ugly.  The only workaround I've found is to convert the time into seconds, but even THIS isn't easy.  Here's how to do it: create two helper columns next to the "in 800m?" column, titled "days" and "seconds."  In the "days" column, simply set each cell equal to the race time.  So D2 will just be "=B2" (see below).

Then drag this down for the rest of column D.  Select all of the cells in column D, right-click, and select "Format Cells..." again.  Then just click "general."

This appears to spit out a meaningless decimal.  But—and this took me over an hour to figure out—this decimal is not meaningless.  In their infinite wisdom, the programmers of Excel decided to store all times as fractions of a day!  So you'll see that 0.000929282 just so happens to be the number of days in one minute, twenty point two-nine seconds.  This is why we have the "seconds" helper column.  All we have to do is multiply the number of days by 24, 60, and 60 (twenty four hours in a day, sixty minutes in an hour, sixty seconds in a minute).

Now duplicate this feat for your other set of data (the 800m list in my example) and you'll have a matched list of race times, in seconds.

All that's left now is to make a graph.  Make a basic scatter plot of the 600m times, in seconds, against the 800m time, in seconds.  I'll plot the shorter race (600m) on the x-axis, but it doesn't really matter.  In the graph below, I've taken the liberty of adding in the data from the women's honor roll too, and as you can see, they form a pretty nice line.  Of course it won't be perfect, since there's going to be some variability.  But there's a clear relationship between 600m fitness and 800m fitness.

To establish a good conversion, all we have to do is add a trendline.  We can do this by right-clicking on the data points and selecting "Add Trendline..."  It should be linear, and we'll want the equation and R-squared value on the chart.

There are two interesting questions we should answer now:

First, is a true linear regression going to be better than a straight multiplier? When we fit a line to these data, we can either force that line to intersect the origin of our graph (by clicking "Set Intercept = 0.0"), or we can allow it to intersect wherever.  Mathematically speaking, we have no good reason to force our line to intersect the origin, so a true linear regression (y = a*x + b) is the proper fit.  But from a practical perspective, just multiplying by a constant (y = a*x) is a lot easier, so if we can do that without losing much accuracy, that would be nice.

The second question is whether men and women can be grouped together.  It certainly looks that way from the graph, but there might be some good physiological reasons to think otherwise.

As for whether we can get away with a simple conversion, here's the comparison:
 Click to Enlarge
As you can see above, we do get a slightly better fit using a true linear regression, and there does seem to be a small difference between men and women.  The R2 value tells us the goodness of fit of our line.  If all of the data points lied perfectly on the line, the R2 value would be 1.0.  If they were completely random and unrelated, the R2 value would be zero.  R2 is, approximately speaking, the fraction of the variance in the data we can explain.  So here, we can explain about 75% of the variability in 800m time by looking at a runner's 600m time.  This might seen dreadfully deterministic, and on some data sets (3k vs. 5k time, for example), I've observed R2 values near 0.90, but keep in mind that even a 10% variance in time is HUGE in a running event.  That's about the difference between a 16:30 and a 15:00.

If we force the lines to intersect the origin, we lose about 1% of our explaining power, but in exchange, we get a much easier calculation and one that is virtually identical for men and women.  Seems like a good trade to me, but if you're a math major, you'd understandably scoff, since it still isn't "proper."  Additionally, while carrying out this conversion analysis on all events from 400m to 5000m, I found a few instances where forgoing the linear regression resulted in a 5-10% drop in explaining power, most notably when converting 1000m times to mile times. Usually, though, the loss was less than 1% and men were very similar to women in their conversion factors.

So, if we're using the simple multiplier (graph on the right, above) to predict an 800m time from a 600m race, all you have to do is multiply the time, in seconds of course, by 1.4035 for men or 1.4077 (or we could split the difference and use about 1.406 for both).  To predict a 600m from an 800m time, divide the 800 by the same factor.  Solving for times if you're using the true linear regression model is a tad more complicated, but only requires some basic algebra.  Using the simple multiplier is also easier because it allows you to easily "stack" them to jump multiple distances (for example, converting a mile to a 5k).  Keep in mind that, in general, the greater the (physiological) difference between two events, the less reliable conversions will be.  Some of my most reliable conversions are actually from 3k to 5k, whereas the least accurate ones are from 400m to 600m.  This reflects the fundamental physiology at work in these events—in many ways, the 400m differs more from the 600m than the 3000m does from the 5000m!

Simple conversions for events from 400m to 5000m (indoors)

Here are the aggregated results from this year's MIAC indoor track honor roll.  I'm using a simple multiplier, not a linear regression, and I've aggregated men and women together.  I've also truncated the multiplier to four significant digits, since it doesn't make much sense to go overboard with our "sig figs."

Comprehensive linear regression conversions for men and women from 400m to 5000m (indoors)

Here are the mathematically correct and moderately more accurate conversions from the same data, segregated by gender and using a proper linear regression.  I've also included information on the explained variance (the R-squared value) and the gain in explained variance vs. the simple multiplier fit for the math majors.  I haven't dropped many significant figures yet since you'd do that after your final calculation was done.

 Click to enlarge
Conclusion

I hope you've found this tutorial helpful.  Although I used this simple linear regression model to compare performances in different races in indoor track, you could easily adapt it for other uses—how much slower is your state cross country meet course than your conference meet course? How do cross country times compare with track times? Something I hope to look at in the near future is how indoor track conversions compare to outdoor track conversions.  Now, these conversion are never going to be perfect.  If you are comparing cross country courses, for example, you have to take into account the weather, the time of year the race happened, and so on.  And even in my indoor track conversion here, they probably underpredict 5k performance since it's not contested nearly as often as a mile or 3k (and thus runners don't have as many chances to run well in it).  Runners are also improving throughout the season, so that muddies the data too.

In the end, this is okay to some extent, because our "conversion" is never going to be super accurate anyways.  Just because you ran 4:31 in the mile doesn't mean you can't break 9:00 in the 3k because the conversion factor is 2.001.  Remember, even the most accurate conversions here only explain ~90% of the variance, and 10% is a LOT! I try my best to avoid falling into the trap of conversions, paces, and numbers, but it's also not always a good idea to go into a race with no idea of what you're capable of.  View a conversion more as a guideline, or a reasonable goal to shoot for, rather than some scientific, predetermined fate.  Here's to proving these formulas wrong!