Hire Me (as a Data Scientist!), Part II

Continuing on from my earlier post, I’m now looking to tackle the question:

If you had to pick 3 beers to recommend using only this data, which would you pick?

This is a pretty open ended question, which is kind of fun. I also don’t really have a ton of experience (yet!) in recommendation systems, though I have done a little reading here or there on it.

My goals in coming up with three beers to recommend were to:

  1. Try to find the most popular beer among super users of the website
  2. Find a bizzaro beer that matched the profile of my first beer, but lives in the long tail of the ratings distribution
  3. Find the best Beer sans Booze (Highest Rating with lowest ABV)

So let’s begin! Here’s how I went about tackling this question.

Bizzaro Beer

Now Pliny The Elder seemed to be a pretty popular beer. But what if I was trying to sketch out some ideas about what other beers I could recommend to beer lovers who like Pliny The Elder? It needed to somewhat “look like” the target beer, but have way less reviews.

Playing with some of the fringe data here, I wanted to be careful not to again pick a beer with only one or two ratings on it. My rationale was coming from assuming there is some sort of true “population mean” for this beer and having a beer with too little reviews will not approximate the mean correctly.

## Make Unique Beer Label for Larger Dataset
beer[, beer_name_unique := paste(brewery_name,beer_name, beer_style)]

## Count Number of Reviews Each Beer Has  
number.of.reviews <- beer[, .(NumberOfReviews = .N), by = beer_name_unique][order(-NumberOfReviews)]
 
## Only get beers with over 30 reviews
reliable.beers.list <- number.of.reviews[ NumberOfReviews >= 30 ]

## Join that to our big 'beer' dataset only matching beers with over 30 reviews
beer.reliable <- reliable.beers.list[beer, on = "beer_name_unique", nomatch=0]

With our dataset chiseled down to only ‘reliable’ beers, I needed to find a way to get some sort of profile of each of the beers. While my first instinct was to do some sort of data reductive type thing like a PCA on our continuous variables and use distances from certain scores as metrics of similarity (which I have done before and it ended up actually being the inspiration for a tool currently used by Soundout!), doing that on so few predictors seemed extra.

So instead, I figured why not just assume that there is some sort of wiggle room in my hastily made recommendation system and just match first on the overall review, then if there are some close contenders, look for matches on other metrics?

The next bit of code creates a table of the metrics I am interested in, gets beers that have over 30 reviews, but less than 100, and also creates a vector so I can pull out all of the IPAs on my less reviewed beers table. I then joined the tables for my candidates.

# # Get metrics used for distance calculations
beer.metrics <- beer.reliable[, .(mean_review_overall = mean(review_overall),
                                  mean_review_aroma = mean(review_overall),
                                   mean_review_appearance = mean(review_appearance),
                                   mean_review_palate = mean(review_palate),
                                   mean_reviw_taste = mean(review_taste),
                                   sd_review_overall = sd(review_overall),
                                   sd_review_aroma = sd(review_overall),
                                   sd_review_appearance = sd(review_appearance),
                                   sd_review_palate = sd(review_palate),
                                   sd_review_taste = sd(review_taste)),
                               by = beer_name_unique]

## Get only IPAs with less than 100 reviews
less.reviewed.beers <- number.of.reviews[NumberOfReviews <= 100 & NumberOfReviews >= 30]
## Make vector to help find IPAs
find.IPA <- str_detect(string = less.reviewed.beers$beer_name_unique, pattern = "Imperial IPA")
bizzaro.candidates <- less.reviewed.beers[find.IPA]

#Create Table
bizzaro.candidates.metrics <- bizzaro.candidates[beer.metrics, on = "beer_name_unique", nomatch=0]

Of these less reviewed beers, I now needed to find the one that was “closest” on the few dimensions I had to work with. The simplest way to do this would be to just subtract our target beer (Pliny The Elder), from every other beer in our interested list, then check out the results.

## Get metrics for our target beer
rrbcpteadii.metrics <- beer[beer_name_unique == "Russian River Brewing Company Pliny The Elder American Double / Imperial IPA",
                           .(mean_review_overall = mean(review_overall),
                             mean_review_aroma = mean(review_overall),
                             mean_review_appearance = mean(review_appearance),
                             mean_review_palate = mean(review_palate),
                             mean_reviw_taste = mean(review_taste),
                             sd_review_overall = sd(review_overall),
                             sd_review_aroma = sd(review_overall),
                             sd_review_appearance = sd(review_appearance),
                             sd_review_palate = sd(review_palate),
                             sd_review_taste = sd(review_taste))]
## Create vector for looping over
key.vector <- as.vector(rrbcpteadii.metrics)
## Pull off the tags of our search
search.vector <- bizzaro.candidates.metrics[, -c(1,2)]

## Sanity check that what we are going to subtract has same names
names(key.vector)
##  [1] "mean_review_overall"    "mean_review_aroma"      "mean_review_appearance"
##  [4] "mean_review_palate"     "mean_reviw_taste"       "sd_review_overall"     
##  [7] "sd_review_aroma"        "sd_review_appearance"   "sd_review_palate"      
## [10] "sd_review_taste"
names(search.vector)
##  [1] "mean_review_overall"    "mean_review_aroma"      "mean_review_appearance"
##  [4] "mean_review_palate"     "mean_reviw_taste"       "sd_review_overall"     
##  [7] "sd_review_aroma"        "sd_review_appearance"   "sd_review_palate"      
## [10] "sd_review_taste"
## And that the apply function I am going to run is doing what I think it will
search.vector[1]- key.vector
##    mean_review_overall mean_review_aroma mean_review_appearance
## 1:          -0.7900277        -0.7900277             -0.6386031
##    mean_review_palate mean_reviw_taste sd_review_overall sd_review_aroma
## 1:         -0.6263257       -0.8393187        0.09103239      0.09103239
##    sd_review_appearance sd_review_palate sd_review_taste
## 1:            0.1273011          0.11882       0.1073594
## Run apply function
ipa.distances <- apply(search.vector, 1, function(x) x - key.vector)
ipa.distances.dt <- data.table(do.call(rbind.data.frame,ipa.distances))
## Combine this back with vector with names
bizzaro.candidates.distances <- cbind(bizzaro.candidates.metrics, ipa.distances.dt)
## Sort our data by overall and see if we have a good match!
bizzaro.candidates.distances[order(-mean_review_overall)]
##                                                                          beer_name_unique
##   1: Hill Farmstead Brewery Galaxy Imperial Single Hop IPA American Double / Imperial IPA
##   2:           Lawson's Finest Liquids Double Sunshine IPA American Double / Imperial IPA
##   3:        Kern River Brewing Company 5th Anniversary Ale American Double / Imperial IPA
##   4:                           Alpine Beer Company Bad Boy American Double / Imperial IPA
##   5:             Iron Hill Brewery & Restaurant Kryptonite American Double / Imperial IPA
##  ---                                                                                     
## 132:                            BrewDog Sink The Bismarck! American Double / Imperial IPA
## 133:                   Blue Frog Grog & Grill The Big DIPA American Double / Imperial IPA
## 134:                            Hermitage Brewing Hoptopia American Double / Imperial IPA
## 135:                    Florida Beer Company Swamp Ape IPA American Double / Imperial IPA
## 136:            BrewDog Storm (Islay Whisky Cask Aged IPA) American Double / Imperial IPA
##      NumberOfReviews mean_review_overall mean_review_aroma
##   1:              76            4.592105          4.592105
##   2:              85            4.588235          4.588235
##   3:              41            4.475610          4.475610
##   4:              79            4.468354          4.468354
##   5:              44            4.443182          4.443182
##  ---                                                      
## 132:              76            3.197368          3.197368
## 133:              53            2.867925          2.867925
## 134:              32            2.687500          2.687500
## 135:              52            2.615385          2.615385
## 136:              92            2.440217          2.440217
##      mean_review_appearance mean_review_palate mean_reviw_taste
##   1:               4.368421           4.440789         4.559211
##   2:               4.317647           4.311765         4.552941
##   3:               4.329268           4.219512         4.500000
##   4:               4.234177           4.335443         4.474684
##   5:               4.284091           4.318182         4.420455
##  ---                                                           
## 132:               3.835526           3.401316         3.539474
## 133:               3.433962           3.047170         2.801887
## 134:               3.500000           2.890625         2.500000
## 135:               3.442308           3.009615         2.586538
## 136:               2.902174           2.836957         2.706522
##      sd_review_overall sd_review_aroma sd_review_appearance sd_review_palate
##   1:         0.3337716       0.3337716            0.3861642        0.3997258
##   2:         0.3289275       0.3289275            0.3845766        0.4293993
##   3:         0.3865103       0.3865103            0.3641730        0.4749840
##   4:         0.3698733       0.3698733            0.3741267        0.4060561
##   5:         0.3768892       0.3768892            0.3796836        0.4586495
##  ---                                                                        
## 132:         1.2438621       1.2438621            0.7136206        1.1431528
## 133:         0.8995241       0.8995241            0.6866707        0.7355280
## 134:         1.2427207       1.2427207            0.4918694        0.8005479
## 135:         0.9108033       0.9108033            0.5994593        0.7637009
## 136:         0.9829379       0.9829379            0.6843587        0.7883377
##      sd_review_taste mean_review_overall mean_review_aroma
##   1:       0.3826844         0.002077562       0.002077562
##   2:       0.3620669        -0.001792407      -0.001792407
##   3:       0.4873397        -0.114417945      -0.114417945
##   4:       0.3660140        -0.121673270      -0.121673270
##   5:       0.4026537        -0.146845883      -0.146845883
##  ---                                                      
## 132:       1.1277209        -1.392659280      -1.392659280
## 133:       0.8165336        -1.722103173      -1.722103173
## 134:       0.9418581        -1.902527701      -1.902527701
## 135:       1.0035288        -1.974643085      -1.974643085
## 136:       1.0086226        -2.149810310      -2.149810310
##      mean_review_appearance mean_review_palate mean_reviw_taste
##   1:            -0.02018203        -0.01053621      -0.07177483
##   2:            -0.07095603        -0.13956098      -0.07804418
##   3:            -0.05933479        -0.23181349      -0.13098536
##   4:            -0.15442587        -0.11588264      -0.15630181
##   5:            -0.10451218        -0.13314386      -0.21053081
##  ---                                                           
## 132:            -0.55307677        -1.05000989      -1.09151167
## 133:            -0.95464082        -1.40415587      -1.82909857
## 134:            -0.88860309        -1.56070068      -2.13098536
## 135:            -0.94629539        -1.44171030      -2.04444690
## 136:            -1.48642917        -1.61436916      -1.92446362
##      sd_review_overall sd_review_aroma sd_review_appearance sd_review_palate
##   1:       -0.12136909     -0.12136909          -0.01935577      -0.04762635
##   2:       -0.12621327     -0.12621327          -0.02094339      -0.01795284
##   3:       -0.06863039     -0.06863039          -0.04134702       0.02763182
##   4:       -0.08526747     -0.08526747          -0.03139329      -0.04129607
##   5:       -0.07825155     -0.07825155          -0.02583641       0.01129741
##  ---                                                                        
## 132:        0.78872139      0.78872139           0.30810063       0.69580063
## 133:        0.44438341      0.44438341           0.28115074       0.28817587
## 134:        0.78758001      0.78758001           0.08634938       0.35319581
## 135:        0.45566254      0.45566254           0.19393929       0.31634876
## 136:        0.52779720      0.52779720           0.27883874       0.34098558
##      sd_review_taste
##   1:     -0.03307969
##   2:     -0.05369722
##   3:      0.07157561
##   4:     -0.04975012
##   5:     -0.01311039
##  ---                
## 132:      0.71195677
## 133:      0.40076950
## 134:      0.52609404
## 135:      0.58776473
## 136:      0.59285853

Of course if I were building a real recommendation machine I could start talking about what factors are more important for what users and what factors are more predictive than others, but this seems like an OK enough solution to at least have completed my a priori goal.

Based on this solution, it looks like I will have to find myself a bottle of Pliny The Elder and the Hill Farmstead Brewery Galaxy Imperial Single Hop IPA American Double / Imperial IPA and do some of my own empirical work to see if this was a good idea.

Beer sans Booze

The last beer that I think I wanted to recommend would be one that tastes great, but does not have a lot of alcohol in it. The reason this question kind of interests me is because if we are really going to talk about how tasty a beer is, it would be nice to be able to factor out of the equation how drunk we are actually getting from it.

I can see first of all IF this relationship exists if we look at the mean overall rating of a beer as a function of its ABV content.

beer.complete[, beer_name_unique := paste(brewery_name,beer_name, beer_style) ]

# ABVs and Mean Scores
abv.vs.mean <- beer.complete[, .(Abv = mean(beer_abv), MeanOverall = mean(review_overall)), by = beer_name_unique]

ggplot(abv.vs.mean[Abv < 20], aes(x = Abv, y = MeanOverall)) + 
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_smooth(method = "lm", formula = y ~ poly(x,2), color = "orange") +
  labs(title = "Rating as Function of ABV (Beers with than 20% ABV)",
       x = "ABV Content",
       y = "Mean Overall Rating")
## `geom_smooth()` using formula 'y ~ x'

Surprisingly, when I ran some quick and dirty regression models (that yes, I know violate tons of assumptions) I saw that only a very small amount of variance was being explained by its ABV. Note that although the models were significant, the R squared values hovered around 3-5%!

# "The More Booze The Better" Model
abv.linear <- lm(MeanOverall ~ Abv, data = abv.vs.mean[Abv < 20]) 
# The "Diminishing Returns Model "
abv.poly <- lm(MeanOverall ~ poly(Abv,2), data = abv.vs.mean[Abv < 20])
# The "Dissapointing Amount of Variance Explained Summaries"
summary(abv.linear)
## 
## Call:
## lm(formula = MeanOverall ~ Abv, data = abv.vs.mean[Abv < 20])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.06793 -0.27278  0.09684  0.38850  1.72081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.276138   0.008995  364.20   <2e-16 ***
## Abv         0.060975   0.001369   44.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6013 on 48822 degrees of freedom
## Multiple R-squared:  0.03906,    Adjusted R-squared:  0.03904 
## F-statistic:  1984 on 1 and 48822 DF,  p-value: < 2.2e-16
summary(abv.poly)
## 
## Call:
## lm(formula = MeanOverall ~ poly(Abv, 2), data = abv.vs.mean[Abv < 
##     20])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.88738 -0.28351  0.09695  0.37907  2.17585 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.658087   0.002706 1351.73   <2e-16 ***
## poly(Abv, 2)1  26.785156   0.597970   44.79   <2e-16 ***
## poly(Abv, 2)2 -13.922969   0.597970  -23.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.598 on 48821 degrees of freedom
## Multiple R-squared:  0.04961,    Adjusted R-squared:  0.04957 
## F-statistic:  1274 on 2 and 48821 DF,  p-value: < 2.2e-16

This actually surprised me and might be worth looking into at a deeper level another time, but for now I want to keep going on and find a beer knowing that how much booze is in it doesn’t really affect how good it is.

So let’s take one final dive into the dataset, grab only our quality reviews then plot a subset of our data so I can see beers that have a very high overall rating with a very small amount of booze in them.

# Quality Assurance Step
reliable.and.abv.beers <- reliable.beers.list[beer.complete, on = "beer_name_unique", nomatch=0]

## Get mean ratings and keep ABV (which won't change if I average it)
dd.beers <- reliable.and.abv.beers[, .(mean_overall = mean(review_overall), abv = mean(beer_abv)), by = "beer_name_unique"]

# Only Beers that Fit Our Criterion
dd.beers.2 <- dd.beers[mean_overall > 4.6 & abv < 10]

# Plot It!
ggplot(dd.beers.2, aes(x = abv, y = mean_overall, label = beer_name_unique, color = beer_name_unique)) +
  geom_point() +
  geom_text(aes(label=beer_name_unique),hjust=-.01, vjust=0) +
  labs(title = "High Quality Beers with Low ABV",
       x = "ABV",
       y = "Mean Overall Rating") + theme(legend.position = "none") +
  xlim(0, 20) +
  scale_y_continuous(breaks = c(seq(4.6,5,.1)), limits = c(4.6,4.85))

These are all OK choices (most of the beers are still above 5% ABV), but we do have one beer clocking in at 2.0% ABV giving us our final beer recommendation – the Southampton Publick House Southampton Berliner Weisse Berliner Weissbier!

Summary

After all of this, I know have three beers to check out. Pliny The Elder is our winner for the top rated beer among our Super Users of the site, the Hill Farmstead Brewery Galaxy Imperial Single Hop IPA American Double / Imperial IPA is a beer to maybe follow up on from our first choice, and then lastly we have the the Southampton Publick House Southampton Berliner Weisse Berliner Weissbier which supposedly tastes great, despite its lack of alcohol content.

David John Baker
David John Baker
Senior Research Associate

Music, Theory, Sciences.

Related