# Exploratory Data Analysis of Gas Prices in Brazil

### Intention

I have been a kaggle member since 2017 and the analysis that this community share every day have been a good source of learning material in my data science journey. This time I decide to stop been a reader and share with you my first notebook.

The main intention of the notebook is to make an Exploratory Data Analysis of the Gas Prices in Brazil data set, hoping that new r users and kagglers in general could use this as a useful material to learn and apply in their own analysis.

English is not my native language, so, take this in consideration while reading my ideas.

### Introduction

Brazil is the bigger oil producer in South America, and keeps a place between the top 10 producers in the world in 2018 (Source: Energy Information Administration). However, even the production level it has, the economy is not a net oil exporter, they have to import product to supply the national demand.

These facts makes oil prices in the country partially exposed to foreign shocks, that may be reflected in distortions in gas prices, with direct effects over the macroeconomic activity. Gas prices in any economy is an important variable correlated with the macroeconomic growth and general inflation levels.

This situation makes interesting an analysis of the evolution of gas prices in Brazil, using a data set with market prices and distribution prices of different product. In general, the analysis will cover a section oriented to import, transform and cleaning the data. Then, a data visualization exploration, finding trends, seasonality, distributions, etc.

The final topic will cover some modeling techniques, like exponential smoothing and least square to forecast the gas price evolution.

### Packages

library(janitor) # data cleaning and tables
library(tidyverse) # metapackage with lots of helpful functions
library(lubridate) # to handle date objects in R
library(ggthemes) # themes for ggplot graphs
library(broom) # to handle models output in a tidy way
library(ggridges) # another tool to add elements to ggplot graphs 

### Importing and cleanign the data for analysis

This time the data comes in a .tsv format, which stands for Tab Separated Values. To import this type of text files the function read_delim() comes into play. This function, as the others in the readr package, has a better performance than their parallel in base R (read.table in this case) sice this provide a better automatic variable type identification, saving mutating code after importing the data.

run gas2 <- read.table("2004-2019.tsv", sep = "\t", header = TRUE) an compare the results.

gas <- read_delim("data/2004-2019.tsv", delim = "\t")

Before showing the results of the data importing process, it is important to tune a bit the names of the variables in the gas data frame. This job was not automatic and consisted in writing by hand a short English translation of the names in Portuguese.

# This asignation structure helps when trying to sabe space in the cell

new_variables_names <-  c("x", "initial_date", "end_date", "region", "state",
"product", "stations_consulted","unit", "mean_market_price",
"sd_market_price", "min_market_price", "max_market_price",
"mean_price_margin", "cv_market_price", "mean_distribution_price",
"sd_distribution_price", "min_distribution_price",
"max_distribution_price", "cv_distribution_price", "month", "year")

names(gas) <- new_variables_names

# I want a moth variable with labels, not only numbers
gas <- gas %>%
mutate(month_label = month(initial_date, label = TRUE))

Done that, taking a glimpse of the data shows that all variables were import with a correct format. For example, dates were automatically parse as date.

glimpse(gas)
## Observations: 106,823
## Variables: 22
## $x <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... ##$ initial_date            <date> 2004-05-09, 2004-05-09, 2004-05-09, 2...
## $end_date <date> 2004-05-15, 2004-05-15, 2004-05-15, 2... ##$ region                  <chr> "CENTRO OESTE", "CENTRO OESTE", "CENTR...
## $state <chr> "DISTRITO FEDERAL", "GOIAS", "MATO GRO... ##$ product                 <chr> "ETANOL HIDRATADO", "ETANOL HIDRATADO"...
## $stations_consulted <dbl> 127, 387, 192, 162, 103, 408, 278, 105... ##$ unit                    <chr> "R$/l", "R$/l", "R$/l", "R$/l", "R$/l"... ##$ mean_market_price       <dbl> 1.288, 1.162, 1.389, 1.262, 1.181, 1.3...
## $sd_market_price <dbl> 0.016, 0.114, 0.097, 0.070, 0.078, 0.1... ##$ min_market_price        <dbl> 1.190, 0.890, 1.180, 1.090, 1.050, 0.9...
## $max_market_price <dbl> 1.350, 1.449, 1.760, 1.509, 1.400, 2.0... ##$ mean_price_margin       <dbl> 0.463, 0.399, 0.419, 0.432, 0.240, 0.4...
## $cv_market_price <dbl> 0.012, 0.098, 0.070, 0.055, 0.066, 0.0... ##$ mean_distribution_price <dbl> 0.825, 0.763, 0.970, 0.830, 0.941, 0.9...
## $sd_distribution_price <dbl> 0.110, 0.088, 0.095, 0.119, 0.077, 0.1... ##$ min_distribution_price  <dbl> 0.42010, 0.50130, 0.56140, 0.59910, 0....
## $max_distribution_price <dbl> 0.96660, 1.05000, 1.16100, 1.22242, 1.... ##$ cv_distribution_price   <dbl> 0.133, 0.115, 0.098, 0.143, 0.082, 0.1...
## $month <chr> "05", "05", "05", "05", "05", "05", "0... ##$ year                    <dbl> 2004, 2004, 2004, 2004, 2004, 2004, 20...

### Margins distributions by region and product

• Camparign margins in all regions shows that CNG has the higher one.
gas %>%
#group_by(year, month, floor_date, region, product) %>%
#summarise(mean_market_price = mean(mean_market_price, na.rm = TRUE)) %>%
filter(!product == "LPG") %>%
ggplot(aes(x = mean_price_margin, fill = product, color = product)) +
geom_density(alpha = 0.15) +
scale_colour_hue(l = 40, c = 30) +
facet_wrap(~region, scales = "free_y") +
theme(legend.position = "bottom") +
theme_fivethirtyeight()

#coord_flip()

### Seassonality

This type of data with monthly frequency usually brings seasonal patterns that are important to be detect. In this case, we are going to explore seasonality with a graph. This graph in particular shows a line for each year, where months are map to one of the axis, allowing us to see is a cycle repeat over the different lines.

In general, there is not a clear cycle in the different variables, but anyway two things can be Highlighted.

1. Diesel prices tend to increase more rapidly in the last quarter of the year.
1. Hydrated Ethanol prices tend to increase more rapidly in the first quarter of the year.

#### Plot 3. Seassonal plot

gas %>%
group_by(year, month_label,  product) %>%
summarise(mean_market_price = mean(mean_market_price, na.rm = TRUE)) %>%
ggplot(aes(x = month_label, y = mean_market_price, color = year, group = year)) +
geom_line() +
facet_wrap(~product, scales = "free") +
theme_fivethirtyeight() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 90, size = 9),
panel.grid = element_blank(),
legend.text = element_text(size = 7))

### Trend and slope of the evolution

Since the beginning of the notebook the existence of a positive trend in the evolution of the prices was mention in different parts, but now it is time to describe that trend and with a linear regression explore the slope by fuel in each region.

Previously, a line plot of mean market price was shown, where the mean appear without any element related to the dispersion of the different prices. This time a hexagonal plot shows the trend that we mention before, isolated by different facets. A hex plot was use because it is easier to show the concentration and distribution of the prices than using points with low opacity.

With this analysis emerged a few insights:

• The Ethanol was the fuel with the lower slope and higher variance
• Diesel S10 was the one with the higher variation.
• Compressed Natural Gas has a low share in the north region. In addition, there are some states in this region that bought CNG different moments but it have not been adopt as a frequent source of energy.
• In the West Center region, the price of Liquefied petroleum gas (LPG) has a clear variation by state, evidence in the different branches of the hexagons. I did not include color by state in this visualization, but I did the homework and a different trend within the region explains those branches.
gas %>%
ggplot(aes(x = initial_date, y = mean_market_price)) +
geom_hex(show.legend = F) +
#geom_point(alpha = 0.09, size = 0.1) +
facet_grid(product~region, scales = "free") +
theme(legend.position = "none") +
theme_fivethirtyeight() +
theme( axis.text.x = element_text(angle = 90))

Now I will apply an approach from book “R for Data Science”, in order to run many models at once for each strata. The strategy to accomplish this is to nest the data with tidyr, ending with a data frame of data frames called by_region_product, then mutating each DF to end with a model.

Having the models we can extract information about the coefficients (standar error, p.values) and about the individual observations (residuals).

# Using nest function from tidyr to create a data set of data set with the diferent strata
by_region_product <- gas %>%
mutate(period_change = as.numeric(round((min(initial_date) - initial_date)/30, 0)) * -1) %>%
group_by(region, product) %>%
nest()

by_region_product <- by_region_product %>%
mutate(price_lm = map(data, ~lm(mean_market_price ~ period_change, data = .x)),
glance = map(price_lm, glance),
tidy = map(price_lm, tidy),
augment = map(price_lm, augment),
rsq = map_dbl(glance, "r.squared"))

# Plot of the slopes colored by the fit indicador Rsquared
by_region_product %>%
unnest(tidy) %>%
select(region, product, rsq, term, estimate) %>%
rename(intercept = (Intercept)) %>%
legend.text = element_text(size = 8))