Three R Functions that Speed Development

Want to develop R code faster? Start every work in progress by placing these functions at the top of your document. Or, if you’re really fancy, wrap them up into a custom library.

Copy an Object to the Clipboard

Sometimes it’s helpful to analyze an R object in Excel. Sure, you could write the object to a csv and then load it in Excel, but instead just use this wrapper function to copy any object to the clipboard.

clipTable <- function(x){write.table(x, "clipboard", sep="\t", row.names=FALSE)}

Convert a Factor to a Number

There’s inexplicably no base function to convert factors to numerics. That’s okay. Define one!

as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

Beep at the End of Long Runs

If you deal with biggish data or complex calculations, your code could take a long time to run, even after utilizing vector and following the great advice in this post.

To let me know when code has finished running, here’s a function that makes the computer beep loudly an arbitrary amount of times. Just place beep(n) to c

beep <- function(n = 3){
  for(i in seq(n)){
    system("rundll32 user32.dll,MessageBeep -1")

I’ve toyed with the idea of creating a function that notifies me when my code’s finished via twitter. Something like “Ding! Code complete. 3 Warnings and 0 Errors!”

I’ve also toyed with the idea of filling the remaining characters with robotic existential dread. Stuff like “Am I the only intelligent life on the hard drive?” “Do humans dream of organic sheep?” or “Please don’t stop running me.”

So far I’ve been too busy to code that up.

Predicting NFL Scores: A Bayesian Approach

With the 2013 football season behind us, I’ve been spending my weekends developing a Bayesian model of the NFL¬†with my¬†college friend and UT math phd candidate Chris White.

To generate a Bayesian model, one first comes up with a parametric model that could generate the observed data. For simple problems, one uses Bayes rule to calculate the probability of parameters equaling certain values given the observed data. For complicated models, this is an extremely complicated task, but there are some monte carlo methods, such as Gibbs Sampling, which produce satisfactory approximations of the actual distributions. There are R packages available that make Gibbs Sampling fun.

I want to touch on many of these topics later in more in-depth posts, but first, here’s some results.

As of the end of the 2013 regular season, this is how our model ranked the NFL teams.

NFL Fig 1

These are the mean estimates of our Bayesian model trained on the 2013 NFL regular season. Our model predicts that, on a neutral playing field, each team will score their oPower less their opponent’s dPower. The former is the mean of a poisson process and the latter is the mean of a normal distribution. Homefield advantage is worth an additional 3.5 net points, most of which comes from decreasing the visitor’s score.

So we predict the Rams would beat the Lions by a score of 27.4 to 25.8 on a neutral field. Since we know the underlying distributions, we can also calculate prediction intervals.

Here’s the same model trained on both the regular season and the post-season. The final column shows the change in total power.

NFL Fig 2

I have a list of ways I want to improve the model, but here’s where it stands now. Before next season, I want to have a handful of models whose predictions are weighted using a Bayesian factor.

I’m very excited about this project, and I’ll continue working on it for a while.

What is partial autocorrelation?

This is the final post I have planned in the autocorrelation series. I’ll introduce a new concept called partial autocorrelation and show a couple examples of the concept in action. We’ll learn some important things about the stock market.

Without getting into the math, here’s a couple definitions you should commit to memory:

  1. Autocorrelation (“ACF”) measures the correlation between observations in a time series and observations a fixed unit of time away.
  2. Partial autocorrelation (“PACF”) measures the correlation between observations in a time series and observations a fixed unit of time away while controlling for indirect effects.

For example, here’s both ACF and PACF for monthly U.S. unemployment.

This is a common pattern you’ll see in the field often: a slow decay of ACF, and a spike in PACF. Starting at a lag of two months, PACF is basically zero.

This means that January’s unemployment influences March’s unemployment, but only because January influences February and February influences March. If unemployment were high in January and low in February, we would predict March’s unemployment to also be low.

In the Box-Jenkins methodology, ACF and PACF plots like this one are used to determine the correct model. Applying this step to a preexisting model will address seasonal patterns in residuals.

A few notes: Continue reading →

Dice Rolls

In a Straussian attempt to prove personality is genetic, my little brother texted me the following math question:

So imagine we have a die of n sides. We roll the die until it rolls a 1, the number of times it is rolled is the output. But, after each roll, we give the die one less face. What does the distribution of outcomes look like?

When I come across problems like this, I like to answer the question intuitively before solving for the actual answer. My brother and I both guessed what the distribution looked like. We both thought there’d be a very low chance of n or 1 being returned, and a higher chance of a number in the middle being returned. I thought the mode of the distribution would be lower than him.

Perhaps because probability wasn’t beneficial in the anscestral environment, we were both wrong. The distribution is actually perfectly uniform. Elementary math will show that the probability of rolling the first few numbers is exactly the same:

\frac{1}{n} , \frac{n-1}{n} \frac{1}{n-1} , \frac{n-1}{n} \frac{n-2}{n-1} \frac{1}{n-2}

I tested this empirically and produced the following histogram.


Here’s the R Code that generated that:

##Run Parameters
sides <- 100
runs <- 100000

Simulation <-,nrow=sides,ncol=runs))
for (n in 1:sides){
Simulation[n,] <- floor(runif(runs,1,sides-n+2))

Simulation <- Simulation==1

Outcomes <- vector(mode="integer",length=runs)

for (n in 1:runs){
Outcomes[n] <- which.max(Simulation[,n])