library(DBI)
library(dbplyr)
library(tidyverse)
library(knitr)
library(DT)
library(patchwork)
library(glue)
<- 3
sleep_default theme_set(theme_minimal(base_size = 12))
For this analysis, I will use sales data from Northwind database1 to answer the following questions:
What is the annual sales revenue and number of orders?
What is the monthly variation in sales revenue?
How do monthly revenue and orders compare to the yearly average?
Steps to answer the questions
Identify the tables that contain the sales revenue and order data
Understand the schema, columns that store date, sales, and orders
Use SQL and R to retrieve, prepare and summarize data
Data Visualization
Data Interpretation
Prerequisites
Connect to the database
<- DBI::dbConnect(RPostgres::Postgres(),
con dbname = 'northwind',
host = 'localhost',
port = 5432,
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"))
dbListTables(conn = con) ## list tables in the database
[1] "us_states" "customers" "orders"
[4] "employees" "shippers" "products"
[7] "order_details" "categories" "suppliers"
[10] "region" "territories" "employee_territories"
[13] "customer_demographics" "customer_customer_demo" "sales_revenue"
The data used to answer the questions can be found in two tables: orders
and order_details
.
Entity Relationship Diagram
Source DBeaver
What is the annual sales revenue and number of orders?
<- "
sales_revenue
select
min(o.order_date) as start_date,
max(o.order_date) as end_date,
extract(year from o.order_date) as year,
count(*) as order_number,
round(cast(sum(od.unit_price* od.quantity) as numeric),
2) as total_revenue,
round(cast(avg(od.unit_price* od.quantity)as numeric),
2) as avg_revenue
from public.order_details od
inner join public.orders as o on od.order_id = o.order_id
group by year
order by year;
"
<- (dbGetQuery(con, sales_revenue)) sales_revenue_tbl
kable(sales_revenue_tbl)
start_date | end_date | year | order_number | total_revenue | avg_revenue |
---|---|---|---|---|---|
1996-07-04 | 1996-12-31 | 1996 | 405 | 226298.5 | 558.76 |
1997-01-01 | 1997-12-31 | 1997 | 1059 | 658388.8 | 621.71 |
1998-01-01 | 1998-05-06 | 1998 | 691 | 469771.3 | 679.84 |
Annual summary of sales, number of orders and average sale
Code
<- ggplot(sales_revenue_tbl, aes(x = year, y =(total_revenue/1000)))+
total_revenue geom_col()+
geom_text(aes(label = round(total_revenue/1000)), vjust = 1.5, color = "white")+
scale_y_continuous(labels = scales::dollar_format())+
labs(
title = "Total Sales Revenue per Year - Thousands",
x = NULL,
y = "Sales $M"
+
)theme_minimal()
<- ggplot(sales_revenue_tbl, aes(x = year, y = as.numeric(order_number)))+
order_number geom_col()+
geom_text(aes(label = as.numeric(order_number)), vjust = 1.5, color = "white")+
labs(
title = "Number of Orders per Year",
x = NULL,
y = "Total Number of Orders"
+
)theme_minimal()
<- ggplot(sales_revenue_tbl, aes(x = year, y = avg_revenue))+
avg_revenue geom_col()+
geom_text(aes(label = avg_revenue), vjust = 1.5, color = "white")+
scale_y_continuous(labels = scales::dollar_format())+
labs(
title = "Average Dollars per Sale",
x = "Years between July 04, 1996 to May 06, 1998",
y = "Average Sale Amount"
+
)theme_minimal()
+ plot_spacer() + order_number + plot_layout(widths = c(4, -1, 4))) / avg_revenue (total_revenue
Insights
- The available periods for 1996 and 1998 are shorter than those for 1997, making it difficult to interpret comparisons.
- The graph shows that the sales revenue and the number of orders for the first five months of 1998 are greater than those for the last six months of the year 1996. However, since the data is not for the same period or consecutive years, it is difficult to determine the sales trend. Even though the context is different, we can say that sales revenue doubled in the first four months of 1998 compared with the last six months of 1996.
- In the absence of data for an entire year for more than one year, we can look at the average dollar amount of sales. Based on the average dollar amount of sales, the graph shows an increase from year to year.
How about the monthly sales?
What is the monthly variation in sales revenue?
To have the monthly sales revenue and number of orders, the previous query is modified as follows:
<- "
monthly_revenue select
cast(date_trunc('month', o.order_date) as date) as sales_month,
count(*) as total_number,
round(cast(sum(od.unit_price* od.quantity)as numeric)) as monthly_revenue
from public.orders o
inner join public.order_details od on o.order_id = od.order_id
group by sales_month
order by sales_month;
"
<- (dbGetQuery(con, monthly_revenue)) monthly_revenue_tbl
Monthly sales revenue and number of orders
Next, let’s create a graph to visualize the monthly sales data.
Code
<- min(sales_revenue_tbl$start_date)
min_order_date <- max(sales_revenue_tbl$end_date)
max_order_date
<- ggplot(monthly_revenue_tbl, aes(x = sales_month, y = monthly_revenue))+
(monthly_revenue_plot geom_col()+
scale_y_continuous(labels = scales::dollar_format())+
theme(
plot.title = element_text(size= 12),
plot.title.position = "plot"
+
)labs(
title = glue("Sales by Month \n",
"Between ", {format(min_order_date, "%B %d, %Y")}, " and ",
format(max_order_date, "%B %d, %Y")}
{
),x = "Month",
y = "Sales Dollars"
+
)theme_minimal())
The graph shows that there is much variation from one month to another. As there is month-over-month sales variation we can use the lag()
function (SQL) to help to see the differences.
<- "
monthly_var_revenue with monthly_revenue as
(select
cast(date_trunc('month', o.order_date) as date) as sales_month,
count(*) as total_number,
round(cast(sum(od.unit_price* od.quantity)as numeric)) as monthly_revenue
from public.orders o
inner join public.order_details od on o.order_id = od.order_id
group by sales_month
order by sales_month),
prev_month_revenue as
(select
mr.sales_month,
mr.total_number,
mr.monthly_revenue,
lag(mr.monthly_revenue, 1, mr.monthly_revenue) over (order by mr.sales_month) as prev_month_revenue
from monthly_revenue mr)
select
pmr.sales_month,
pmr.total_number,
pmr.monthly_revenue,
prev_month_revenue,
pmr.monthly_revenue - pmr.prev_month_revenue as delta_revenue
from prev_month_revenue pmr;
"
<- (dbGetQuery(con, monthly_var_revenue)) monthly_var_revenue_tbl
Monthly sales revenue and number of orders
Also, based on the graph, it is evident that the data follows a left-tailed distribution. I will utilize the summary()
function in R to examine the descriptive statistics.
median(monthly_var_revenue_tbl$delta_revenue)
[1] 3707
<- summary(monthly_var_revenue_tbl$delta_revenue)) (summary_delta_revenue
Min. 1st Qu. Median Mean 3rd Qu. Max.
-114732.0 -2405.0 3707.0 -447.5 14643.5 31563.0
The median is positive ($3.707) while the mean is negative: (447.5). Month-over-month sales data shows a wide spread between the lower and upper quartiles.
The plot below shows how the sales vary month-over-month.
Code
<- ggplot(monthly_var_revenue_tbl, aes(x = sales_month, y = delta_revenue))+
(monthly_var_revenue_plot scale_x_date(date_breaks = "year", date_labels = "%Y", date_minor_breaks = "3 months") +
geom_line()+
scale_y_continuous(limits = c(-200000, 150000), labels = scales::dollar_format())+
theme(
plot.title = element_text(size= 12),
plot.title.position = "plot"
+
)labs(
title = glue("Monthly Sales Variation \n",
"Between ", {format(min_order_date, "%B %d, %Y")}, " and ",
format(max_order_date, "%B %d, %Y")}
{
),x = "Month",
y = "Dollars Change"
+
)theme_minimal()
)
Insights
- With real data, we would want to investigate more to understand the business context that occurred in May of 1998.
- From the data available in the Northwind database, we know that the data collection ended at the beginning of May 1998 which can explain the high difference compared with the previous month.
We have the data for the entire year of 1997 and we can look at revenue and number of orders together and compare the monthly data to the yearly average. By comparing monthly sales to yearly average, businesses can identify which months consistently perform above or below average, helping in informing various business decisions.
How do monthly revenue and orders compare to the yearly average?
<- "
month_data_year_avg with montly_data as
(select
cast(date_trunc('month', o.order_date) as date) year_month,
count(*) as monthly_orders,
round(cast(sum(od.unit_price* od.quantity)as numeric)) as monthly_revenue
from public.orders o
inner join public.order_details od on o.order_id = od.order_id
where extract(year from o.order_date) = 1997
group by year_month
order by year_month)
select
md.year_month,
md.monthly_orders,
md.monthly_revenue
from montly_data md
order by md.year_month;
"
<- (dbGetQuery(con,month_data_year_avg)) month_data_year_avg_tbl
Monthly sales data and Yearly averages
<- mean(month_data_year_avg_tbl$monthly_revenue)) (yearly_revenue_average
[1] 54865.67
<- mean(month_data_year_avg_tbl$monthly_orders)) (yearly_order_average
integer64
[1] 88
Code
<- ggplot(month_data_year_avg_tbl, aes(x = year_month, y = monthly_revenue))+
month_data_year_avg_rev_plot geom_line()+
scale_x_date(date_labels = "%m", date_minor_breaks = "1 month")+
scale_y_continuous(labels = scales::dollar_format())+
geom_hline(yintercept = yearly_revenue_average ,
color = "darkred",
linetype = "dashed")+
annotate("text",
x = min(month_data_year_avg_tbl$year_month),
y = yearly_revenue_average +1000,
label = "AVG",
color = "darkred",
hjust = 0
+
)theme(
plot.title = element_text(size= 12),
plot.title.position = "plot"
+
)labs(
title = glue( "Yearly Revenue Average \n",
"Monthly Revenue \n",
"Year 1997"),
x = "Date",
y = "Revenue Dollars"
+
)theme_minimal()
<- ggplot(month_data_year_avg_tbl, aes(x = year_month, y = as.numeric(monthly_orders)))+
month_data_year_avg_order_plot geom_line()+
scale_x_date(date_labels = "%m", date_minor_breaks = "1 month")+
geom_hline(yintercept = yearly_order_average,
color = "darkred",
linetype = "dashed")+
annotate("text",
x = min(month_data_year_avg_tbl$year_month),
y = yearly_order_average +1,
label = "AVG",
color = "darkred",
hjust = 0
+
)theme(
plot.title = element_text(size= 12),
plot.title.position = "plot"
+
)labs(
title = glue( "Yearly Order Average \n",
"Monthly Orders \n",
"Year 1997"),
x = "Date",
y = "Number of Orders"
+
)theme_minimal()
+ plot_spacer() + month_data_year_avg_order_plot+ plot_layout(widths = c(4, -1, 4)) month_data_year_avg_rev_plot
Insights
- Both the monthly sales revenue and number of orders vary widely compared with the yearly average.
- The highest values for sales revenue are at the beginning and the end of the year while the lowest values are in the first half of the year. This indicates that the business can have seasonal fluctuations in sales.
- This is also supported by the number of orders placed each month. The highest number of orders is in the second half of the year which can explain the increased revenue at the final of the year.
Knowing in which months the business performs below or above the average can help in the strategic planning of the company and the allocation of resources.
For example :
when the sales volumes are low the business can design loyalty programs that encourage repeated business during slower months;
during off-peak times the company can plan training programs to ensure that the staff are fully equipped for busy periods;
targeted campaigns by allocating budgets to peak seasons to maximize returns or promotions during low-performing months to increase sales.
Comparing monthly sales data to yearly averages is a powerful analytical tool that provides deep insights into business performance, aids in strategic planning and drives informed decision-making. It allows businesses to adapt proactively to changes, optimize operations, and achieve better financial outcomes.
# Close the database connection
dbDisconnect(con)
Footnotes
The Northwind database is a sample database that was originally created by Microsoft for demonstrating the features of their database management systems. It contains a variety of tables and data related to a fictional company called Northwind Traders, which imports and exports specialty foods from around the world.↩︎