One Hour EDA - fitzRoy R Package

Today I take a closer look at the info available in the fitzRoy package.

This is such a great example of what is happening in the R community these days, the package really takes away the difficulty out of scraping the web for data.

I will limit myself to one hour. This will cover all the poking and prodding of the data and the writing and editing of this post.

My goal is to look a bit more deeply at the data in the package to spur my imagination for future more in-depth reports.

My time starts now…

Packages

Here is my setup (install before calling library).

# Packages
library(tidyverse) # For everything
## Warning: package 'dplyr' was built under R version 3.5.1
library(here) # For project-oriented workflow
library(ggthemes) # For nice plot themes
library(devtools) # For non-CRAN packages
library(fitzRoy) # Data source
library(GGally) # Correlation plot

Explore

I’m following along from the documentation.

Match Results

I’ve used this before, it’s the first suggested data set. The stats are basic but it’s every game since 1897!

d <- get_match_results()

Use glimpse to get a feel for the data;

glimpse(d)
## Observations: 15,380
## Variables: 16
## $ Game         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Date         <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08, ...
## $ Round        <chr> "R1", "R1", "R1", "R1", "R2", "R2", "R2", "R2", "...
## $ Home.Team    <chr> "Fitzroy", "Collingwood", "Geelong", "Sydney", "S...
## $ Home.Goals   <int> 6, 5, 3, 3, 6, 4, 3, 9, 6, 5, 12, 8, 5, 5, 2, 11,...
## $ Home.Behinds <int> 13, 11, 6, 9, 4, 6, 8, 10, 5, 9, 6, 11, 14, 11, 8...
## $ Home.Points  <int> 49, 41, 24, 27, 40, 30, 26, 64, 41, 39, 78, 59, 4...
## $ Away.Team    <chr> "Carlton", "St Kilda", "Essendon", "Melbourne", "...
## $ Away.Goals   <int> 2, 2, 7, 6, 5, 8, 10, 3, 5, 7, 6, 0, 3, 5, 6, 7, ...
## $ Away.Behinds <int> 4, 4, 5, 8, 6, 2, 6, 1, 7, 8, 5, 2, 4, 3, 6, 4, 8...
## $ Away.Points  <int> 16, 16, 47, 44, 36, 50, 66, 19, 37, 50, 41, 2, 22...
## $ Venue        <chr> "Brunswick St", "Victoria Park", "Corio Oval", "L...
## $ Margin       <int> 33, 25, -23, -17, 4, -20, -40, 45, 4, -11, 37, 57...
## $ Season       <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1...
## $ Round.Type   <chr> "Regular", "Regular", "Regular", "Regular", "Regu...
## $ Round.Number <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5...

Example visual;

ggplot(data = d, aes(x = Date, y = Margin)) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
  labs(title = "Margin trend in AFL",
        x = "Date",
        y = "Margin (pts)",
        caption = expression(paste(italic("Source: AFL games 1897 to 2018 c/o fitzRoy package"))))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Tidy Match Results

There is a helpful functionto that shapes the results data into a longer format. I could have used this previously when calculating the Home/Away difference in goal kicking accuracy.

d_long <- convert_results(d)
glimpse(d_long)
## Observations: 30,760
## Variables: 13
## $ Game         <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9...
## $ Date         <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08, ...
## $ Round        <chr> "R1", "R1", "R1", "R1", "R1", "R1", "R1", "R1", "...
## $ Venue        <chr> "Brunswick St", "Brunswick St", "Victoria Park", ...
## $ Margin       <dbl> 33, -33, 25, -25, -23, 23, -17, 17, 4, -4, -20, 2...
## $ Season       <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1...
## $ Round.Type   <chr> "Regular", "Regular", "Regular", "Regular", "Regu...
## $ Round.Number <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ Status       <chr> "Home", "Away", "Home", "Away", "Home", "Away", "...
## $ Behinds      <chr> "13", "4", "11", "4", "6", "5", "9", "8", "4", "6...
## $ Goals        <chr> "6", "2", "5", "2", "3", "7", "3", "6", "6", "5",...
## $ Points       <chr> "49", "16", "41", "16", "24", "47", "27", "44", "...
## $ Team         <chr> "Fitzroy", "Carlton", "Collingwood", "St Kilda", ...

Example visual;

ggplot(data = d_long, aes(x = Date, y = Margin, col = Round.Type)) +
  geom_point(alpha = 0.2) +
  geom_boxplot() +
  facet_grid(. ~ Round.Type) +
  labs(title = "Margin trend in AFL",
       subtitle = "Tighter in finals (slightly)",
        x = "Date",
        y = "Margin (pts)",
        caption = expression(paste(italic("Source: AFL games 1897 to 2018 c/o fitzRoy package"))))

Fixture

There is a function to ‘get fixture’. I admit I don’t know what this means, but on inspection it looks like it is a list of all the games that will be played in current year (excluding finals).

d_fixture <- get_fixture()
glimpse(d_fixture)
## Observations: 198
## Variables: 7
## $ Date        <dttm> 2018-03-22 19:25:00, 2018-03-23 19:50:00, 2018-03...
## $ Season      <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20...
## $ Season.Game <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Round       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Home.Team   <chr> "Richmond", "Essendon", "St Kilda", "Port Adelaide...
## $ Away.Team   <chr> "Carlton", "Adelaide", "Brisbane Lions", "Fremantl...
## $ Venue       <chr> "MCG", "Etihad Stadium", "Etihad Stadium", "Adelai...

Example visual;

# table showing number of games by venue
venue_games <- d_fixture %>%
  group_by(Venue,) %>%
  summarise(games = n())

# plot
ggplot(d = venue_games, aes(x = reorder(Venue, games), y = games)) +
  geom_bar(stat = "identity") +
  coord_flip() +
    labs(title = "Where AFL will be played in 2018",
       subtitle = "",
        x = "Venue",
        y = "Number of games",
        caption = expression(paste(italic("Source: AFL 2018 fixture c/o fitzRoy package"))))

Footywire

Footywire stats are more detailed, and go back to 2010. Lots of stats here, I don’t understand all of them! Will start with a correlation plot to see what might be related.

d_fwire <- update_footywire_stats()
## Getting match ID's...
## Downloading new data for 18 matches...
## 
## Checking Github
## Getting data from footywire.com
## Finished getting data
glimpse(d_fwire)
## Observations: 79,332
## Variables: 43
## $ Date           <date> 2010-03-25, 2010-03-25, 2010-03-25, 2010-03-25...
## $ Season         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,...
## $ Round          <chr> "Round 1", "Round 1", "Round 1", "Round 1", "Ro...
## $ Venue          <chr> "MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG"...
## $ Player         <chr> "Daniel Connors", "Daniel Jackson", "Brett Dele...
## $ Team           <chr> "Richmond", "Richmond", "Richmond", "Richmond",...
## $ Opposition     <chr> "Carlton", "Carlton", "Carlton", "Carlton", "Ca...
## $ Status         <chr> "Home", "Home", "Home", "Home", "Home", "Home",...
## $ Match_id       <dbl> 5089, 5089, 5089, 5089, 5089, 5089, 5089, 5089,...
## $ CP             <int> 8, 11, 7, 9, 8, 6, 7, 7, 6, 7, 8, 6, 1, 6, 2, 2...
## $ UP             <int> 15, 10, 14, 10, 10, 12, 10, 6, 7, 5, 4, 6, 7, 6...
## $ ED             <int> 16, 14, 16, 11, 13, 16, 13, 7, 10, 7, 9, 6, 6, ...
## $ DE             <dbl> 66.7, 60.9, 76.2, 57.9, 68.4, 88.9, 76.5, 50.0,...
## $ CM             <int> 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,...
## $ GA             <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI5            <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0, 0,...
## $ One.Percenters <int> 1, 0, 0, 0, 0, 1, 5, 2, 5, 1, 0, 1, 2, 1, 0, 6,...
## $ BO             <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOG            <int> 69, 80, 89, 69, 77, 81, 84, 80, 100, 88, 81, 92...
## $ K              <int> 14, 11, 12, 13, 11, 5, 7, 9, 6, 7, 7, 11, 7, 4,...
## $ HB             <int> 10, 12, 9, 6, 8, 13, 10, 5, 7, 6, 5, 1, 4, 7, 3...
## $ D              <int> 24, 23, 21, 19, 19, 18, 17, 14, 13, 13, 12, 12,...
## $ M              <int> 3, 2, 5, 1, 6, 4, 2, 3, 4, 2, 3, 5, 2, 3, 1, 3,...
## $ G              <int> 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0,...
## $ B              <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0,...
## $ T              <int> 1, 5, 6, 1, 1, 3, 2, 5, 4, 4, 9, 3, 1, 4, 0, 2,...
## $ HO             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GA1            <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ I50            <int> 2, 8, 4, 1, 2, 2, 1, 5, 0, 3, 0, 6, 1, 0, 0, 1,...
## $ CL             <int> 2, 5, 3, 2, 3, 3, 4, 4, 1, 1, 2, 0, 0, 1, 0, 0,...
## $ CG             <int> 4, 4, 4, 3, 3, 1, 2, 0, 2, 0, 1, 2, 4, 4, 0, 2,...
## $ R50            <int> 6, 1, 3, 4, 2, 0, 2, 0, 3, 1, 2, 0, 2, 0, 4, 3,...
## $ FF             <int> 2, 2, 1, 1, 0, 0, 1, 4, 1, 1, 2, 1, 0, 1, 0, 1,...
## $ FA             <int> 0, 0, 2, 0, 2, 1, 0, 0, 0, 0, 1, 2, 1, 3, 0, 1,...
## $ AF             <int> 77, 85, 94, 65, 65, 62, 56, 77, 61, 56, 76, 71,...
## $ SC             <int> 85, 89, 93, 70, 63, 72, 79, 73, 68, 59, 94, 68,...
## $ CCL            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ SCL            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ SI             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ MG             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ TO             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ITC            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ T5             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Example visual;

ggcorr(d_fwire, method = c("pairwise", "pearson"), nbreaks = 9, hjust = .75, vjust = 0.5, layout.exp = 2) +
  labs(title = "Correlation of Footywire dataset numeric variables")
## Warning in ggcorr(d_fwire, method = c("pairwise", "pearson"), nbreaks
## = 9, : data in column(s) 'Date', 'Round', 'Venue', 'Player', 'Team',
## 'Opposition', 'Status' are not numeric and were ignored

Weather

This is awesome. There is weather data for 2017!

d_weather<- fitzRoy::results_weather %>%
  filter(Season == 2017)
glimpse(d_weather)
## Observations: 207
## Variables: 19
## $ StationNo    <int> 86038, 86038, 40764, 66062, 86038, 86038, 9225, 2...
## $ date         <date> 2017-03-23, 2017-03-24, 2017-03-25, 2017-03-25, ...
## $ Venue        <chr> "M.C.G.", "M.C.G.", "Carrara", "S.C.G.", "M.C.G."...
## $ Game         <dbl> 14994, 14995, 14999, 14997, 14998, 14996, 15002, ...
## $ Round        <chr> "R1", "R1", "R1", "R1", "R1", "R1", "R1", "R1", "...
## $ Home.Team    <chr> "Carlton", "Collingwood", "Gold Coast", "Sydney",...
## $ Home.Goals   <int> 14, 12, 14, 12, 17, 13, 10, 22, 13, 14, 16, 18, 1...
## $ Home.Behinds <int> 5, 14, 12, 10, 14, 12, 13, 15, 15, 15, 14, 8, 12,...
## $ Home.Points  <int> 89, 86, 96, 82, 116, 90, 73, 147, 93, 99, 110, 11...
## $ Away.Team    <chr> "Richmond", "Footscray", "Brisbane Lions", "Port ...
## $ Away.Goals   <int> 20, 15, 15, 17, 12, 18, 18, 14, 21, 11, 13, 13, 1...
## $ Away.Behinds <int> 12, 10, 8, 8, 19, 12, 7, 7, 10, 14, 9, 19, 9, 10,...
## $ Away.Points  <int> 132, 100, 98, 110, 91, 120, 115, 91, 136, 80, 87,...
## $ Margin       <int> -43, -14, -2, -28, 25, -30, -42, 56, -43, 19, 23,...
## $ Season       <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2...
## $ Round.Type   <chr> "Regular", "Regular", "Regular", "Regular", "Regu...
## $ Round.Number <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Description  <chr> "86038 ESSENDON AIRPORT                         -...
## $ Rainfall     <dbl> 0.0, 0.0, 6.6, 4.8, 0.0, 0.0, 2.2, 0.0, 3.8, 4.2,...

Example visual;

ggplot(data = d_weather, aes(x = Margin, y = Rainfall, col = Rainfall)) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
     labs(title = "Does higher rainfall lead to tigher games?",
       subtitle = "Maybe??",
        x = "Margin",
        y = "Rainfall",
        caption = expression(paste(italic("Source: AFL 2017 + weather c/o fitzRoy package"))))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Tips

Finally, there is tipping info here. I’m really looking forward to exploring the different methods used to arrive at these.

d_tips <- get_squiggle_data("tips")
## Getting data from https://api.squiggle.com.au/?q=tips

Didn’t get time to explore here - next time!

Conclusion

I realise I’ve just pretty much gone through the documentation, but since I learn from doing, this is a good thing. I have a much better understanding of the info available in this package and am excited about having a closer look soon.