Thursday, May 22, 2014

Can we determine the gender of a given runner from his or hers split times?

From my previous post, we found out that there is a clear difference in the fractional splits pace for male and female.

The question I want to address now is whether the splits provide a good way to predict the gender of a runner from the split times.

To approach the question, a preliminary data analysis as the one presented before suggest that data from splits in the 5, 10, 15, 20, 25, 30, 35, 40K checkpoints and the total time provide enough information to attempt a regression problem where the target variable is binary (1 for male runner and 0 for female)

I started with the entire set of runners in NYC 2011 marathon considering a random sampling for the training set and the test set with a few algorithms. Here I will present the results with k-NN

k-NN:

Two heads are better than one and k heads are even better than one.
The k-NN relays in the fact that locally, the best decision can be made when the majority of your friends agrees on something. Many assumptions are made to get to this claim but it seems very reasonable. One important thing to consider here is the meaning of close friends. How do you determine who are your closest friends. It is clear that this question lies beyond the geographical sense of the world, and it requires a different way to measure distances, that is a metric.

In the case in hands, the input data will be the split times and the labels (1 for male and 0 for female).
Many metrics where implemented, among them euclidean, Manhattan, Dot product





No comments:

Post a Comment