Regression Modeling Strategies 2020 Review

A couple of weeks ago I attended Dr. Frank Harrell’s Regression Modeling Strategies course. I wanted to use the chance to write a short post on some of the key takeaways that I got from the course for my own sake and just keep on my practice of writing every day.

Expectations and Reality

The course met my expectations in that it was four days of A LOT of information from Dr. Harrell’s book on regression modeling pitched for people who know this stuff way better than I do.

I don’t have a background in (bio)statistics, epidemiology, or live and breathe stats every week to the extent that I imagine most other people on the Zoom call did. At many times it felt like tons of the material was slipping right through my fingers and was struggling to keep up with it (first semester grad school throwback!). Maybe things would have been different if I was not on UK time and starting the course at 3PM and ending at 10PM (after doing some teaching in the morning), but there’s no alternative universe in which I would have been sitting on the other end and feeling comfortable with the content given what I’ve learned up until this point.

That said, I learned SO MUCH from this course and spending the six hours a day soaking in this information made me realise how much I miss having research be a part of my daily life. In the spirit of making writing a practice and not a performance, I wanted to summarize much of what I did take away from the course for my own (and other’s) benefit.


My first big takeaway was that I now feel that statistical simulations were and continue to be the missing piece of my previous statistics education and teaching.

If you’re not exactly sure what I mean by this, a statistical simulation (or at least what I mean when I am referring to it in this context) is when you as the analyst sort of “play God” in order to create data that behaves exactly like you want it to then see how well your methods do in finding the “Truth”. Of course we don’t get this luxury when doing actual research (since if we knew the Truth, we wouldn’t have to do any empirical research!) so running a simulation before we actually get any messy data will give us a much better view of how our methods will perform under different conditions.

From a research perspective, doing this seems almost totally obvious to have to do before investing so much time and energy into data collection, but speaking first hand as an applied researcher who is learning much of this as they go, this was just never a part of any class I ever took.

One of the reasons that this idea of running simulations resonated with me so much mostly comes from one of the hours on the second day of the class where Dr. Harrell used this technique to show how bad running automated stepwise regression can be. You can page through the notes here in the notes for Chapter 4 but the TL;DR is that if have some sort of regression model with 8 variables– four of which actually have a true relationship with what you’re trying to predict– and you were to just rely on Akaike’s Information Criterion to do the work for you, you need at least ~2,000 samples in order to have your little robot select the correct set of variables only half of the time…

Like much of my statistical knowledge, I feel like so much of it started as a collection of rules of thumbs such as “never, ever, ever do stepwise/automated regression models”, but it was not until I saw something like this where it began to to make sense on how to get these many thumbs that I have been collecting.

I think a lot of that has to do with the fact that a simulation gives the learner some sort of causal dial to play around with and not just be told that “well if you did it a different way then this would happen, just take my word on it”.

Though getting back to the stepwise regression example, this little detour was depressing as hell to know how common this is in research and how bad of a tool it actually is and how long people have known about it. I want to play around with this a bit in the future to understand it better and will hopefully be blogging about it more soon.

But as depressing as this was to know how misguided so many regression models have been, one thing I really did like about the course was how Dr. Harrell reiterated many times that the reason that he does this course is so other people don’t have to make the mistakes that he did.

I guess the bottom line of all of this is that I really think that becoming a bit more knowledgeable about simulation data is going to be helpful both in my own research as well as as a pedagogical tool going forward. I think coupling it with ideas about pre-registration, it will help a lot more with thinking beforehand about what the true state of the world rather than just taking a shotgun approach to knocking down null hypotheses and calling it science.

Case Studies

While many of the more abstract concepts went a bit over my head as someone without formal statistics training, the next thing that I found a lot of value in were the few case studies that were used to exemplify the topics being discussed. Nearer the end of the week, there were a few examples taken from Regression Modeling Strategies that Dr. Harrell worked though to demonstrate many of the concepts we had been talking about in action.

For example, there was an extended walk through of modeling the titanic3 dataset. Now I know that many data science people reading this might groan at the thought of slugging through another tutorial on that uses the Titanic dataset, but from a pedagogical standpoint using these canonical datasets really does reduce the germane cognitive demands of a learner to take in both new techniques and all things relevant to a new dataset.

Because there was so much new information being conveyed, if Dr. Harrell would have used anything else, I think my stats brain would have basically imploded. What I found particularly helpful about this walk through (and many others, which is why I think these screencasts by Julia Silge and David Robinson have become so popular) is the chance to see what an analysis looks like through the eyes of an expert.

Personally, I feel I learn best when trying to emulate in the presence of an expert with them gently guiding me when I go off course. This might come from my background as a musician, but I personally can’t think of anything more powerful from a pedagogical standpoint than one-on-one pedagogy (For more reading I remember seeing some refereneces to this in Teaching Tech Together both here and here)), but of course this doesn’t scale well.

What’s the answer then?

Well, I think a bit part of it is trying to give learners access to that stream-of-consciousness leading to final product thought-process so that the learner can build up whatever temporary scaffolding they need to figure out how and why an expert made the choice they did.

Reflecting on the course three weeks out, returning to these case studies is the first thing that I will want to do when I do have more time to look at this because everything was so tangible and thus much more memorable. I will be keeping that in mind for my future teaching.


There were other many other gems that I also took away from the course that didn’t make it under the main umbrellas here like justification for using multiple imputation methods for missing data (answer: don’t throw away valuable data you already collected!!), not assuming that your model is always going to be a straight line and thinking about using splines (when does that ever happen in nature), and a whole lot of other stuff.

So hopefully going forward I will be able to start hacking away at all this, this coming summer with my own projects.

The last thought I wanted to end on regarding things I took away from the course (so I can also come back to and know that I actually did feel this) was more of a lateral learning moment in hearing many times that the goals with a lot of the statistical modeling that I do, there is way less on the line than what others have to deal with.

There arn’t millions of dollars of government money on experiments I design and I’m very glad that degree of ischemia is not the dependent variable I need to model in a clinical trial. Relative to this kind of work, there’s no reason for me to get as stressed about many of my modeling choices in my goofy reaction time studies where I am looking at recreating a WEIRD subject’s implicit understanding of Western tonal music via some sort of behavioral paradigm. Yes, I might get it wrong, but if you can’t be wrong, you’re really not doing science.

So wrapping up, I really took a lot from this course despite the fact that I felt very out of my depth most of the time. The course made me really realise how much I missed learning at this rate and made me long for being able to do more research rather than teach more. I have a feeling that will be more reality than I think in the future, but I guess only time will tell on that front!