Whiprsnapr 2019, part II
Whiprsnapr 2019, part II
If you are a normal person, all you need to know is that I was feeling cute and wanted to practice my nascent R skills. So I reprised and extended the data analysis I did last time on the Whiprsnapr beers. The resulting report is here.
If you are a data nerd, read on.
Quick thoughts on R
While I don’t have anything earth-shattering to reveal, I thought that since we’re already here, I’ll take the occasion to briefly yammer about
my experience with R
.
R
is one of those languages optimized at data analysis. Sounds
nice! But how does the language feels? Easy to learn? Fun to use?
And how does it compares with more general-usage languages
like Perl and JavaScript?
The language itself
R’s syntax is… definitively peculiar. And coming from someone calling himself a necrohacker, that’s saying something.
Not that it’s unexpected. R’s roots are old and come from the Stygian depths of
academia. And its community went through
several
iterations of best practices and came up with libraries that
twist and overload the original languages in new ways
to accommodate them (hmmm… sounds oddly familiar, somehow…). It means that, for example,
depending what you’re reading, one of the basic variable type will
be a data.frame
, or a tibble
.
It doesn’t help that most of the books and blog entries out there are more likely to give recipes on how to achieve things, rather than explain how the language is parsed. By now I know how to write statements like
beers
%>% filter( beer_ibu != 0 )
%>% ggplot()
+ geom_point(
aes(y=beer_abv,x=beer_ibu,color=style),
show.legend = FALSE )
+ facet_wrap(~style)
+ labs(
x="ibu",
y="abv",
title="abv vs ibu, by style" )
Even better, I kind of know what it means! %>%
is a kind of curried
operator, the +
is an overloaded operator to add bits to the original
ggplot()
graph, and something prefixed with ~
is a formula. What exactly
is a formula in the context of R? The language
specs are a little vague
about it, but some
tutorials
provides helpful hints as to the nature of that strange animal.
Oh yeah, and there is also a lot of closures and localized variables business
going on. And the
naming of things can be less than explicit. I’m looking at you, package
named DT
, for data table.
What I’m trying to say is that, eventually, the language will make sense. But until it does, a lot of Faith-based cut’n’pasting might need to be done.
Data munging
Raw data is invariably a mess that needs to be tidied. R has all the regular filtering/munging tools that one might need for those clean-ups. And to be fair they are not wildly superior than the ones that a “regular” language would provide. Indeed, a few of the books I’ve read openly advocate doing the first wave of cleaning with Python/Perl/whatev scripts so that once the data enters R space, only minor tweaks are required.
In any case, for giggles, here’s part of the munging I did on the raw Untappd data, first in R:
beers <- fromJSON('beers.json', flatten=TRUE)
%>% as.tibble
%>% mutate(
style = factor(str_replace(beer_style, "\s*-.*", "") ),
rating = round(rating_score,1)),
week = created_at %>% as.Date(format="%a, %d %b %Y %H:%M:%S %z") %>% format("%U")),
url = paste( "<a href='","https://untappd.com/b/",beer_slug,"/",bid,"'>", beer_name, "</a>",sep="" )
)
%>% select(-starts_with('vintages'),-starts_with("brew"),-beer_active,-is_homebrew,-wish_list)
And here in roughly equivalent Perl:
my $beers = file_deserialize 'beers.json';
for ( @$beer) {
$_->{style} = $_->{beer_style} =~ s/s*-.*//r;
$_->{rating} = int $_->{rating_score};
$_->{week} = parse_date($_->{created_at})->year_week;
$_->{url} = sprintf "<a href='https://untappd.com/b/%s/%s'>a%s</a>",
$_->{beer_slug}, $_->{bid}, $_->{beer_name};
delete $_->%[qw/ vintage brewery beer_active is_homebrew wish_list /];
}
Not so different. From what I’ve experience so far, so simple mappings and filterings, the R version looks nicer, but as soon as string manipulations enter the picture, well it’s time to say hello to my old friend.
Statistical analysis
This is R forté. The language is loaded with ways to create ANOVA models, regressions, and all that good statistical stuff out of data. But I alas didn’t play with it for this mini-project and thus can’t speak of it. But yeah, I totally assume that if that’s one of the things for which R shine at its brightest.
Graphs
Like data munging, R has some nice shorthands. For only one graph, the difference might not be that impressive. For example this Chart.js graph
const data = {
labels: fp.map( 'beer_name' )(beers),
datasets: [ {
label: 'rating', data: fp.map( 'rating_score', beers ),
} ],
};
const chart = new Chart( { data })
is roughly the equivalent of
beers %>% ggplot() + geom_point(aes(x=beer_name,y=rating))
But it becomes ludicrously awesome when we’re going down hog-wild territory. Like, how about splintering that rating graph per style, which each point on the graph sized by the count of checkins, and its color varying with the beer’s ibu? And with a regression line thrown in. Because why not?
beers %>% ggplot()
+ geom_point(
aes( x=beer_name, y=rating, color=ibu, size=stats.totat_count ))
+ geom_facet(~style)
+ geom_smooth(aes(x=beer_name,y=rating))
Now, try to do the same thing with Chart.js
. I’ll be waiting. It’s not that it’d be hard, mind you,
but it’d be brain-numbingly tedious and verbose.
REPL’ing and report generation
Now, this is were R shreds.
The main thing with data exploration is that you want the shortest feedback loop possible between “I wonder what a graph of this against that, minus thus observations looks like” and “Hey peeps! Take a gander at that funky relationship!“. And the R environment — either you use the cli REPL, RStudio or even the neovim interface — provides real-time munge’n’graphing in spades.
And to grease the reporting pipeline even slickier, R code can trivially be wrapped in Markdown to generate — and trivially regenerate when the data change — eye-pleasing documents.
In summary
R is quaint and niche. It might not do anything beyond what can be done with a regular JavaScript/Perl/Python/etc program. But it’ll do it in a much more succinct, interactive way. If you want only to plot a quick graph, it’s not worth the learning overhead. But if you entertain taking a serious stroll in data jungles, then it may be a tool worth adding to your collection.