Hire Me (as a Data Scientist!), Part I

Background

I read Medium blog posts on “How to Become a Data Scientist” more often than I care to admit. Much of this comes from a fear that after doing all this work on the PhD and then hitting the Music Theory job market, I won’t fit the mold of the kind of theorist most schools want to hire. Not coming from one of five schools that seem to have a monopoly on the tenure track jobs can be a bit discouraging, but I also won’t deny that having a non-academic job with a regular 9-5 schedule and a decent salary is pretty tempting after spending the vast majority of my twenties in school. And even if I don’t go on over to industry after the PhD, I’ll probably always be looking for a bit more work in summer.

On top of all of that, I believe that skills that are acquired in a PhD (especially if you do computational musicology and music cognition!) are very transferable to most jobs, and it’s just a matter of being a bit more pro-active in promoting myself that might help me one day land a stable, non-academic job.

That said, one tweet I saw last week by Jesse Meagan linked to this really interesting Linked-In post by Tanya Cashorali that purported to have a one size fits all data-science interview process which has candidates take home a big dataset with a bunch of beer reviews and answer four very broad questions. Considering myself an aficionado of How-To-Become-a-Data Scientist articles, this of course caught my eye.

After reading the article, I figured why not give it a go? It’s the start of the semester, I’m basically ABD, need more of a portfolio beyond my github, and I have nothing to do with my Saturday morning. Why not see what I can produce in 4 or 5 hours? At the very least I’ll hopefully just have something to point to if a future employer wants to see how I think through data-science problems.

And if anyone is reading this that does have comments on my code or thought process… please let me know what you think on Twitter! I’d love some feedback!

Exploring the Dataset

The first thing I did was to grab this dataset which you can get here and then I set up my R script with a few of my favorite packages (again, big love to Ben at GormAnalysis for helping me learn data.table).

#=====================================================================================#
# Beer Script
#=====================================================================================#
# Library
library(ggplot2)
library(data.table)
library(stringr)
#=====================================================================================#
beer <- fread("data/beer_reviews.csv")
#=====================================================================================#

The dataset has about 1.5 million observations across 14 different observations, so don’t try to open it in LibreOffice. The reviews come from a variety of different users that have rated the beers based on five different attributes (Appearance, Palate, Aroma, Taste, Overall) and then each beer has a few other variables listed such as its ABV, the brewery it comes from, the beer’s name (duh), and what kind of beer it is.

names(beer)
##  [1] "brewery_id"         "brewery_name"       "review_time"       
##  [4] "review_overall"     "review_aroma"       "review_appearance" 
##  [7] "review_profilename" "beer_style"         "review_palate"     
## [10] "review_taste"       "beer_name"          "beer_abv"          
## [13] "beer_beerid"
beer[, .(Number = unique(beer$brewery_name))]
##                             Number
##    1:              Vecchio Birraio
##    2:      Caldera Brewing Company
##    3:       Amstel Brouwerij B. V.
##    4:        Broad Ripple Brew Pub
##    5:   Moon River Brewing Company
##   ---                             
## 5739:        Gattopardo Cervejaria
## 5740:         Brauerei Lasser GmbH
## 5741:        Wissey Valley Brewery
## 5742:      Outback Brewery Pty Ltd
## 5743: Georg Meinel Bierbrauerei KG
beer[, .(Number = unique(beer$review_profilename))]
##                Number
##     1:        stcules
##     2: johnmichaelsen
##     3:        oline73
##     4:      Reidrover
##     5:   alpinebryant
##    ---               
## 33384:     jennaizzel
## 33385:    mine2design
## 33386:       hogshead
## 33387:     NyackNicky
## 33388:        joeebbs
beer[, .(Number = unique(beer$beer_name))]
##                        Number
##     1:           Sausa Weizen
##     2:               Red Moon
##     3: Black Horse Black Beer
##     4:             Sausa Pils
##     5:          Cauldron DIPA
##    ---                       
## 56853:      Bear Mountain Ale
## 56854:        Highland Porter
## 56855:       Baron Von Weizen
## 56856:          Resolution #2
## 56857:     The Horseman's Ale
beer[, .(Number = unique(beer$beer_style))]
##                               Number
##   1:                      Hefeweizen
##   2:              English Strong Ale
##   3:          Foreign / Export Stout
##   4:                 German Pilsener
##   5:  American Double / Imperial IPA
##  ---                                
## 100:                          Gueuze
## 101:                            Gose
## 102:                        Happoshu
## 103:                           Sahti
## 104: Bière de Champagne / Bière Brut

From a bird’s eye view we have 56,857 unique beers in 104 different categories from 5,743 different breweries and 33,388 unique beer aficionados who have gone out of their way to tell this website what they think about the beers they drink.

Before diving in further, it’s worth doing a preliminary check of the quality of the data (aka we should know if this is BAD (Best Available Data) or has undergone a fair deal of cleaning). As someone who comes from more of a psychology background, I’ve noticed what certain people consider “clean” when it comes to data varies a lot.

The first thing I check for is if there is any kind of data missing and if there is, is it due to chance? Or is it due to some sort of systematic variation?

table(complete.cases(beer))
## 
##   FALSE    TRUE 
##   67785 1518829
67785/1518829
## [1] 0.04462978

So about 4% of our rows don’t have every entry, so probably not too much cause for concern unless we start getting into specific questions about specific beers. Looking into this a bit further it seems like it’s just beers missing the ABV of the beer. Anyone who has made some beer ratings has made ratings on all five variables. And although it’s only 4% of our entire ratings that don’t have their ABV, comparing that to every beer we have, we see we are actually missing ~25% of the ABV ratings of all of our beers. That could be a problem later, but it’s good to know about it sooner rather than later.

beer.complete <- beer[complete.cases(beer)]

beer[!complete.cases(beer)][, .(.N = unique(beer_name))]
##                             .N
##     1: Cauldron Espresso Stout
##     2:    The Highland Stagger
##     3:              Alpha Beta
##     4:          Imperial Stout
##     5:               Megalodon
##    ---                        
## 14106:       English Nut Brown
## 14107:              Hop Common
## 14108:     Very Hoppy Pale Ale
## 14109:       Prohibition Lager
## 14110:           Resolution #2
14110/56857
## [1] 0.2481665
beer[969]
##    brewery_id             brewery_name review_time review_overall review_aroma
## 1:      12770 City Grille and Brewhaus  1145738954              4            3
##    review_appearance review_profilename              beer_style review_palate
## 1:                 4         UncleJimbo American Pale Ale (APA)           3.5
##    review_taste     beer_name beer_abv beer_beerid
## 1:            4 City Pale Ale       NA       30088

More problems might come up here or there, but let’s move on the first question.

Which brewery produces the strongest beers by ABV%?

Answering the first question on the list is pretty straight forward. Essentially all you need to do is grab all of the observations that have an ABV associated with their rating, and then get the average ABV of all the beers that that brewery produces.

# Use object before that has only ratings with ABVs 
beer.complete[, .(AvgABV = mean(beer_abv)), by = brewery_name]
##                       brewery_name   AvgABV
##    1:              Vecchio Birraio 5.675000
##    2:      Caldera Brewing Company 6.168849
##    3:       Amstel Brouwerij B. V. 3.816373
##    4:        Broad Ripple Brew Pub 6.006202
##    5:   Moon River Brewing Company 5.724103
##   ---                                      
## 5152:        Gattopardo Cervejaria 6.033333
## 5153:         Brauerei Lasser GmbH 5.200000
## 5154:        Wissey Valley Brewery 5.133333
## 5155:      Outback Brewery Pty Ltd 4.787879
## 5156: Georg Meinel Bierbrauerei KG 5.850000
# Create table that has means and standard deviations of beers by brewery 
# Order them from most to least
abv.counter <- beer.complete[, .(AvgABV = mean(beer_abv), 
                                 SdABV = sd(beer_abv)) , 
                             by = brewery_name][order(-AvgABV)]
abv.counter
##                              brewery_name     AvgABV      SdABV
##    1:                        Schorschbräu 19.2288235 12.3273042
##    2:                       Shoes Brewery 15.2000000  0.0000000
##    3:                Rome Brewing Company 13.8400000  1.9718012
##    4:                   Hurlimann Brewery 13.7500000  0.5752237
##    5:            Alt-Oberurseler Brauhaus 13.2000000         NA
##   ---                                                          
## 5152:             Cerveceria Vegana, S.A.  2.2608696  2.2455490
## 5153: Moskovskaya Pivovarennaya Kompaniya  2.1500000  1.6881943
## 5154:                      Fentimans Ltd.  1.3750000  1.6201852
## 5155:                        Borodino ZAO  0.9666667  0.4041452
## 5156:                    All Stars Bakery  0.5000000  0.0000000

Having this table could be good enough for government work, but looking at the output there are clearly problems, and one thing to consider in this table (and pretty much this whole dataset) is “Is this data point a good representation of what I am trying to measure?”. Note for example the huge variability as measured by the standard deviation in our top answer as well as the fact that some of the SDs have NAs and there is a value of 0. Given that, I think it’d be good to put on some sort of threshold that would up the quality of our answers. One way to do this would be to see exactly how many beers each brewery makes and use that as a proxy for how big the brewery is.

The code below does just that and reveals the variability in terms of size of breweries within this dataset.

# Create table that counts number of beers 
NoOfBeers <- beer.complete[, .(NameOfBeer = unique(beer_name)), 
                           by = brewery_name][, .(.N), by = brewery_name]
NoOfBeers
##                       brewery_name  N
##    1:              Vecchio Birraio  4
##    2:      Caldera Brewing Company 25
##    3:       Amstel Brouwerij B. V.  9
##    4:        Broad Ripple Brew Pub 40
##    5:   Moon River Brewing Company 34
##   ---                                
## 5152:        Gattopardo Cervejaria  3
## 5153:         Brauerei Lasser GmbH  1
## 5154:        Wissey Valley Brewery  3
## 5155:      Outback Brewery Pty Ltd  6
## 5156: Georg Meinel Bierbrauerei KG  2
# Make table that lists each beer with it's ABV and the name of the brewery
abv.table <- NoOfBeers[abv.counter, on = "brewery_name"]
abv.table
##                              brewery_name  N     AvgABV      SdABV
##    1:                        Schorschbräu 10 19.2288235 12.3273042
##    2:                       Shoes Brewery  1 15.2000000  0.0000000
##    3:                Rome Brewing Company  2 13.8400000  1.9718012
##    4:                   Hurlimann Brewery  3 13.7500000  0.5752237
##    5:            Alt-Oberurseler Brauhaus  1 13.2000000         NA
##   ---                                                             
## 5152:             Cerveceria Vegana, S.A.  2  2.2608696  2.2455490
## 5153: Moskovskaya Pivovarennaya Kompaniya  2  2.1500000  1.6881943
## 5154:                      Fentimans Ltd.  3  1.3750000  1.6201852
## 5155:                        Borodino ZAO  2  0.9666667  0.4041452
## 5156:                    All Stars Bakery  1  0.5000000  0.0000000
# Create z scores 
abv.table[, zAvgABV := scale(AvgABV)]

After visually inspecting the graph on the size of breweries (below), I figured I could just look at breweries that make over five beers (which hopefully wipes out your hipster friend’s “micro brewery” in his basement where he is just trying to make the most potent IPA ever) and then only look at beers that score 4 standard deviations above the mean of all beers in terms of ABV content to narrow down possible candidates.

# How many beers to count for a big brewery? 
hist(NoOfBeers$N, breaks= 200, 
     main = "Distribution of Size of Breweries", 
     xlab = "Number of Beers Produced by a Brewery")

NoOfBeers[N > 200] # Clearly some big breweries here! 
##                        brewery_name   N
## 1:    Minneapolis Town Hall Brewery 243
## 2:            Goose Island Beer Co. 304
## 3:   Iron Hill Brewery & Restaurant 269
## 4: Rock Bottom Restaurant & Brewery 522
abv.table[N >= 5, ][order(-AvgABV)]
##                                              brewery_name   N    AvgABV
##    1:                                        Schorschbräu  10 19.228824
##    2: Brasserie Grain d' Orge (Brasserie Jeanne d'Arc SA)  10 12.445860
##    3:                          Brauerei Schloss Eggenberg  14 11.779681
##    4:                     Brasserie Dubuisson Frères sprl  14 11.432746
##    5:                            Kuhnhenn Brewing Company 142 11.345839
##   ---                                                                  
## 2539:                             Berliner Kindl Brauerei  12  3.532627
## 2540:  Yanjing Pijiu (Guilin Liquan) Gufen Youxian Gongsi   5  3.440000
## 2541:                                            Ochakovo  16  3.203150
## 2542:                        Grogg's Pinnacle Brewing Co.   6  3.200000
## 2543:                                        Deka Brewery   7  2.620000
##            SdABV   zAvgABV
##    1: 12.3273042 10.235332
##    2:  1.7054879  5.048898
##    3:  3.0759353  4.539520
##    4:  1.6583471  4.274244
##    5:  3.5788003  4.207793
##   ---                     
## 2539:  1.0372607 -1.766396
## 2540:  0.6426508 -1.837221
## 2541:  1.2735986 -2.018323
## 2542:  0.0000000 -2.020731
## 2543:  1.6356553 -2.464215
hist(abv.table[N >= 5, ][order(-AvgABV)]$zAvgABV,
     xlab = "z Score of ABV", 
     main = "Distribution of ABV in Breweries that make more than 5 Beers")

abv.table[N >= 5 & zAvgABV > 4, ][order(-AvgABV)]
##                                           brewery_name   N   AvgABV     SdABV
## 1:                                        Schorschbräu  10 19.22882 12.327304
## 2: Brasserie Grain d' Orge (Brasserie Jeanne d'Arc SA)  10 12.44586  1.705488
## 3:                          Brauerei Schloss Eggenberg  14 11.77968  3.075935
## 4:                     Brasserie Dubuisson Frères sprl  14 11.43275  1.658347
## 5:                            Kuhnhenn Brewing Company 142 11.34584  3.578800
##      zAvgABV
## 1: 10.235332
## 2:  5.048898
## 3:  4.539520
## 4:  4.274244
## 5:  4.207793

Doing it this way puts Schorschbräu as the highest ABV brewery, which makes sense because they claim to make the world’s strongest beer. Making a quick plot of the data for our winner and the second place finisher, we see how strong Schorschbräu really is.

schor.abv <- beer.complete[brewery_name == "Schorschbräu", 
                           .(beer_name = unique(beer_name)), by = beer_abv]

ggplot(schor.abv, aes(x = beer_name, y = beer_abv)) + 
  geom_bar(stat = "identity")  + 
  labs( title = "Schorschbräu Beer ABV", x = "Beer Name", y = "ABV") +
   theme(axis.text.x=element_text(angle = -90, hjust = 0)) 

brassOrg.abv <- beer.complete[brewery_name == "Brasserie Grain d' Orge (Brasserie Jeanne d'Arc SA)", 
                              .(beer_name = unique(beer_name)), by = beer_abv]

ggplot(brassOrg.abv, aes(x = beer_name, y = beer_abv)) + 
  geom_bar(stat = "identity")  + 
  labs( title = "Brasserie Grain d' Orge Beer ABV", 
        x = "Beer Name", 
        y = "ABV") +
   theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ylim(0, 60) 

I think that saying Schorschbräu is technically correct here, but after sharing my findings with my one beer drinking friend he pointed out that one thing that this analysis did not take into account was that beers that are traditionally brewed to have a higher ABV (like IPAs and Belgiums) might skew my results. So if you are a big IPA brewery, you are going to have higher average ABV because of the beers you decide to brew!

So in the true spirit of that data science Venn diagram noting that data scientists need to be flexible in incorporating others’ domain knowledge, I did another analysis to just show how much an answer can change depending on how you change your operationalization of the question!

Let’s do another one!

First up for this one is making a plot of the data to see how much beers actually vary from type to type.

# Get mean and SD of each beer type 
abv.by.type <- beer.complete[ , .(MeanAbvType = mean(beer_abv), 
                                  SdAbvType = sd(beer_abv)), 
                              by = beer_style]

# For Graphing, order, set style as factor 
ordered.abv.by.type <- abv.by.type[order(-MeanAbvType)]
ordered.abv.by.type$beer_style <- factor(ordered.abv.by.type$beer_style, levels = ordered.abv.by.type$beer_style)

# Code for plot, blogdown crunches the images 
# Average ABV by beer type
# ggplot(ordered.abv.by.type, aes(x = beer_style, y = MeanAbvType)) + 
#  geom_bar(stat="identity") +   coord_flip() +
#  labs(title = "Average ABV by Type of Beer",
#       x = "Beer Style",
#       y = "Mean ABV, bars represent SD") +
#  geom_errorbar(aes(ymin=MeanAbvType-SdAbvType, ymax=MeanAbvType+SdAbvType)) +
#  theme_bw()

So you can see here that if you wanted to have a higher ABV on average for your brewery, you’d benefit from having more IPAs, Barley Wines, and Belgian Stouts.

Now with average ABV for each beer, let’s then match that to our big list, find how each beer fairs against its own category, sort them, and then combine them with our information from before on how big the brewery is. For the purposes of this example, let’s only look at breweries that have over 100 beers in the database and look the top 20.

# Combine ABV per type data with complete data
beer.complete.avg.abv <- abv.by.type[beer.complete, on = "beer_style"]
# Make new z score variable based on other beers in group
beer.complete.avg.abv[, zABV := (beer_abv-MeanAbvType)/SdAbvType]  
# Get averages per brewery on z variable
zAvgBeers <- beer.complete.avg.abv[, .(AvgAbvZ = mean(zABV)), by = brewery_name][order(-AvgAbvZ)]
# Combine back with our data on proxy of size of brewery
BreweryAndAvgAbv <- zAvgBeers[NoOfBeers, on = "brewery_name"]
# And the winner is...
BreweryAndAvgAbv[N > 100][order(-AvgAbvZ)][1:25]
##                               brewery_name     AvgAbvZ   N
##  1:               Kuhnhenn Brewing Company  1.26003627 142
##  2:                     Cigar City Brewing  0.87721989 171
##  3:                             The Bruery  0.74965237 143
##  4:     Three Floyds Brewing Co. & Brewpub  0.71875251 128
##  5: Flossmoor Station Restaurant & Brewery  0.53182569 114
##  6:                     Brouwerij De Molen  0.51695738 119
##  7:               Jackie O's Pub & Brewery  0.40179915 105
##  8:                          Mikkeller ApS  0.37012567 184
##  9:               Founders Brewing Company  0.24770929 130
## 10:                      Deschutes Brewery  0.19390121 127
## 11:      Port Brewing Company / Pizza Port  0.19357914 194
## 12:                     Fitger's Brewhouse  0.18931699 111
## 13:                       Bullfrog Brewery  0.10590796 121
## 14:                Victory Brewing Company  0.09970522 107
## 15:         Iron Hill Brewery & Restaurant  0.03563186 269
## 16:                  Goose Island Beer Co.  0.01475446 304
## 17:                Sly Fox Brewing Company -0.01733499 140
## 18:              Sierra Nevada Brewing Co. -0.07417651 121
## 19:                      Stone Brewing Co. -0.08646265 119
## 20:              Cambridge Brewing Company -0.12326285 124
## 21:       Rock Bottom Restaurant & Brewery -0.26029542 522
## 22:                Willimantic Brewing Co. -0.28003004 134
## 23:     John Harvard's Brewery & Ale House -0.29598861 151
## 24:          Minneapolis Town Hall Brewery -0.41963873 243
## 25:                                   <NA>          NA  NA
##                               brewery_name     AvgAbvZ   N

Looking at this list, we get a totally different answer. It appears that on average Kuhnhenn Brewing Company brews their beers 1.26 standard deviations above the mean of all other beers in that category!

Both answers could be technically correct, but more importantly demonstrate how important it is to come up with how you frame your question first, and then try to answer it so you don’t end up going on a fishing expedition!

Moving on to question #2!

David John Baker
David John Baker
Senior Research Associate

Music, Theory, Sciences.

Related