Simon S. OBAME
Project Manager
Hi! My name is Simon, I'm a Project Manager with over 12 years of experience in the energy and water sector.
DOB: October 28, 1985
Email : contact @ stevenobame.com
Website: www.stevenobame.com
I realised this analysis with Excel, SQL, R and Tableau. In this article you will explore the analysis I made with R and Tableau. For this project I work for a fictional company, Cyclistic
This case study focuses on conducting an exploratory analysis of a bike-share company's customer trip data spanning a 12-month period from January 2023 to December 2023.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers,
Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.
In order to do that, however, the team needs to better understand :
Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
Moreno has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
To understand the difference between the usage pattern of members and casual riders. To achieve the business task, it is essential to answer the following questions below:
What is the average ride duration of both users?
What is the average ride duration for both users by day of week and by month?
What is the total number of rides for both users by day of week and by month?
What is the number of rides per hour of the day for both users?
What is the peak hour for both users?
When do both users use the Cylistic bike more on weekdays or on weekends ?
Which one of the three types of bikes is preferred by the users?
Once we have the answers to the above questions, it will be easier to understand usage patterns among members and casual riders.
For the purpose of this project, I have downloaded the 12 months of data, from january to december 2023. The data has been made available by Motivate International Inc. under this license. Each of the twelve files is organized into thirteen columns with data for all rides that occurred between January 2023 and December 2023.
These datasets once combined will help answer important questions as it contains all the necessary details to reach conclusions.
Credibility of data The data has been retrieved from the authorized site, hence it is credible.
The tools used for data cleaning, analyzing, and visualization : I have used Excel, SQL and R for cleaning, analysis, Excel and Tableau for data visualization. For this article I will explore the process using exclusively R and Tableau. You can find the code and the dataset to reproduce it. Click here to download the dataset.
Business task
The purpose of this project is to analyze the Cyclistic data set for the year of 2023 to understand how annual members and casual riders use Cyclistic bikes differently.
To perform this analysis we will answer to these questions :
What is the average ride duration of both users?
What is the average ride duration for both users by day of week and by month?
What is the total number of rides for both users by day of week and by month?
When do both users use the Cylistic bike more on weekdays or on weekends?
Which one of the three types of bikes is preferred by the users?
Before we start to delve into the data we’ll install the packages and get the libraries
# Install packages install.packages("tidyverse", repos="https://cran.r-project.org") install.packages("janitor", repos="https://cran.r-project.org") install.packages("plyr", repos="https://cran.r-project.org") install.packages("mice", repos="https://cran.r-project.org") # helps wrangle data library(tidyverse) # use the conflicted package to manage conflicts library(conflicted) library(janitor) library(scales) library(dplyr) library(plyr) library(mice) library(ggplot2) # Set dplyr::filter and dplyr::lag as the default choices conflict_prefer("filter", "dplyr") conflict_prefer("lag", "dplyr") # Prefer plyr::mutate over dplyr::mutate conflict_prefer("mutate", "dplyr") conflict_prefer("arrange", "dplyr") conflict_prefer("summarise", "dplyr") conflict_prefer("rename", "dplyr") # Setting working directory setwd("~/Formations_Certifications/Data Analysis/Case Study/DataSet/2024_02_Case_Study1/R-Study")
The dataset contains the previous 12 months (January to December 2023) .csv files using the read_csv function and store each file in 12 dataframes.
# upload Bikes dataset (csv files) q1_2023 <- read_csv("202301-divvy-tripdata.csv") q2_2023 <- read_csv("202302-divvy-tripdata.csv") q3_2023 <- read_csv("202303-divvy-tripdata.csv") q4_2023 <- read_csv("202304-divvy-tripdata.csv") q5_2023 <- read_csv("202305-divvy-tripdata.csv") q6_2023 <- read_csv("202306-divvy-tripdata.csv") q7_2023 <- read_csv("202307-divvy-tripdata.csv") q8_2023 <- read_csv("202308-divvy-tripdata.csv") q9_2023 <- read_csv("202309-divvy-tripdata.csv") q10_2023 <- read_csv("202310-divvy-tripdata.csv") q11_2023 <- read_csv("202311-divvy-tripdata.csv") q12_2023 <- read_csv("202312-divvy-tripdata.csv")
Before doing any manipulation we have to be sure we will be able to merge all the files in a single file. We have to compare the column names of each files. The names don’t have to be in the same order, but they HAVE to match perfectly before we can join them into one file
# checking column names for each file.(Just 2 for the example) colnames(q1_2023)
## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q2_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q3_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q4_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q5_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q6_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q7_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q8_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q9_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q10_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q11_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
colnames(q12_2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
We can now merge the files into one big data frame
# Stack individual quarter's data frames into one big data frame all_trips2023 <- bind_rows(q1_2023, q2_2023, q3_2023, q4_2023, q5_2023, q6_2023, q7_2023, q8_2023, q9_2023, q10_2023, q11_2023, q12_2023)
# getting to know the new data frame dim(all_trips2023)
## Output : ## [1] 5719877 13
# list of the column names : Get the names of the column in the dataset colnames(all_trips2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "start_lat" ## [10] "start_lng" "end_lat" "end_lng" ## [13] "member_casual"
For this analysis we won’t use some columns so we’re going to remove them without affecting the study.
# Remove columns lat and long all_trips2023 <- all_trips2023 %>% select(-c(start_lat, start_lng, end_lat, end_lng))
We have to inspect the new table that has been created
# List of column names colnames(all_trips2023)
## Output : ## [1] "ride_id" "rideable_type" "started_at" ## [4] "ended_at" "start_station_name" "start_station_id" ## [7] "end_station_name" "end_station_id" "member_casual"
# How many rows are in the data frame ? nrow(all_trips2023)
## Output : ## [1] 5719877
# Dimension of the data frame ? dim(all_trips2023)
## Output : ## [1] 5719877 9
# See the first 6 rows of the data frame head(all_trips2023)
## Output : ## # A tibble: 6 × 9 ## ride_id rideable_type started_at ended_at ## <chr> <chr> <dttm> <dttm> ## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33 ## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05 ## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11 ## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44 ## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20 ## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16 ## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>, ## # end_station_name <chr>, end_station_id <chr>, member_casual <chr>
# See list of columns and data types (numeric, character, etc) str(all_trips2023)
## Output : ## tibble [5,719,877 × 9] (S3: tbl_df/tbl/data.frame) ## $ ride_id : chr [1:5719877] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ... ## $ rideable_type : chr [1:5719877] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ... ## $ started_at : POSIXct[1:5719877], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ... ## $ ended_at : POSIXct[1:5719877], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ... ## $ start_station_name: chr [1:5719877] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ... ## $ start_station_id : chr [1:5719877] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ... ## $ end_station_name : chr [1:5719877] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ... ## $ end_station_id : chr [1:5719877] "202480.0" "TA1308000002" "599" "TA1308000002" ... ## $ member_casual : chr [1:5719877] "member" "member" "casual" "member" ...
# Statistical summary of data summary(all_trips2023)
## Output : ## ride_id rideable_type started_at ## Length:5719877 Length:5719877 Min. :2023-01-01 00:01:58.00 ## Class :character Class :character 1st Qu.:2023-05-21 12:50:44.00 ## Mode :character Mode :character Median :2023-07-20 18:02:50.00 ## Mean :2023-07-16 10:27:50.01 ## 3rd Qu.:2023-09-16 20:08:49.00 ## Max. :2023-12-31 23:59:38.00 ## ended_at start_station_name start_station_id ## Min. :2023-01-01 00:02:41.00 Length:5719877 Length:5719877 ## 1st Qu.:2023-05-21 13:14:09.00 Class :character Class :character ## Median :2023-07-20 18:19:47.00 Mode :character Mode :character ## Mean :2023-07-16 10:46:00.18 ## 3rd Qu.:2023-09-16 20:28:10.00 ## Max. :2024-01-01 23:50:51.00 ## end_station_name end_station_id member_casual ## Length:5719877 Length:5719877 Length:5719877 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ##
# How many observations fall under each usertype table(all_trips2023$member_casual)
## Output : ## casual member ## 2059179 3660698
We have to check for duplicates for ride_id and remove empty rows and columns
# Checked for duplicates get_dupes(all_trips2023, ride_id)
## Output : ## No duplicate combinations found of: ride_id
## Output : ## # A tibble: 0 × 10 ## # ℹ 10 variables: ride_id <chr>, dupe_count <int>, rideable_type <chr>, ## # started_at <dttm>, ended_at <dttm>, start_station_name <chr>, ## # start_station_id <chr>, end_station_name <chr>, end_station_id <chr>, ## # member_casual <chr>
## Removing empty rows and columns remove_empty(all_trips2023)
## value for "which" not specified, defaulting to c("rows", "cols")
## Output : ## # A tibble: 5,719,877 × 9 ## ride_id rideable_type started_at ended_at ## <chr> <chr> <dttm> <dttm> ## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33 ## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05 ## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11 ## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44 ## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20 ## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16 ## 7 2F7194B6012A98D4 electric_bike 2023-01-15 21:18:36 2023-01-15 21:32:36 ## 8 DB1CF84154D6A049 classic_bike 2023-01-25 10:49:01 2023-01-25 10:58:22 ## 9 34EAB943F88C4C5D electric_bike 2023-01-25 20:49:47 2023-01-25 21:02:14 ## 10 BC8AB1AA51DA9115 classic_bike 2023-01-06 16:37:19 2023-01-06 16:49:52 ## # ℹ 5,719,867 more rows ## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>, ## # end_station_name <chr>, end_station_id <chr>, member_casual <chr>
The ride_id is the unique variable in this dataset. So, I have checked for duplicates using ride_id. No duplicates were found and no empty rows and columns were found
Check for the unique value to ensure that all rows in the rideable_type column contain one of the three types of bike “electric_bike”, “classic_bike”, and “docked_bike”. And ensure that all rows in the member_casual column contain “member” or “casual”.
unique(all_trips2023$rideable_type)
## Output : ## [1] "electric_bike" "classic_bike" "docked_bike"
unique(all_trips2023$member_casual)
## Output : ## [1] "member" "casual"
For answering the questions to proceed our analysis we will add columns : ride_length, start_time, start_hour, date, month, day_of_week
# add ride_length (ride duration in minutes) all_trips2023$ride_length <- difftime(all_trips2023$ended_at, all_trips2023$started_at, units = "mins") ## Adding column for start_time (Extracting time from date & time column) all_trips2023$start_time <- format(as.POSIXlt(all_trips2023$started_at), format = "%H:%M:%S") # Adding column for start hour all_trips2023 <- all_trips2023 %>% mutate(start_hour = hour(hms(all_trips2023$start_time))) # Adding column for date all_trips2023$date <- as.Date(all_trips2023$started_at) # Adding column for month all_trips2023$month <- format(all_trips2023$date, "%B") # Adding column for day_of_week all_trips2023$day_of_week <- format(as.Date(all_trips2023$date), "%A")
The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by “all_trips2023” was negative or equal to zero. For that, we will create a new version of the dataframe (v2) since data is being removed.
# Create the new data frame all_trips2023_V2 <- all_trips2023[!all_trips2023$ride_length <= 0,] # Dimension of the new data frame ? dim(all_trips2023_V2)
## Output : ## Output : ## [1] 5718608 15
1269 rows have been removed.
# converting "ride_length" into numeric all_trips2023_V2$ride_length <- as.numeric(all_trips2023_V2$ride_length)
The descriptive analysis we will conduct on ride_length (all figures in minutes) with : straight average (total ride length / rides), mean midpoint number in the ascending array of ride lengths * longest ride and shortest ride
# Quick summary to get all the metrics needed summary(all_trips2023_V2$ride_length)
## Output : ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.02 5.42 9.53 18.19 16.92 98489.07
#Average rides aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual, FUN = mean)
## Output : ## all_trips2023_V2$member_casual all_trips2023_V2$ride_length ## 1 casual 28.25396 ## 2 member 12.52779
# median rides aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual, FUN = median)
## Output : ## all_trips2023_V2$member_casual all_trips2023_V2$ride_length ## 1 casual 11.850000 ## 2 member 8.516667
# maximum rides aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual, FUN = max)
## Output : ## all_trips2023_V2$member_casual all_trips2023_V2$ride_length ## 1 casual 98489.067 ## 2 member 1559.667
# minimum rides aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual, FUN = min)
## Output : ## all_trips2023_V2$member_casual all_trips2023_V2$ride_length ## 1 casual 0.01666667 ## 2 member 0.01666667
These metrics tell us that casual users ride bikes more than members.
Let’s put the days of the week in order and see the average ride time by each day for members vs casual users
# order the days of the week from Sunday to Saturday all_trips2023_V2$day_of_week <- ordered(all_trips2023_V2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) # average of ride length by aggregate aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual + all_trips2023_V2$day_of_week, FUN = mean)
## Output : ## all_trips2023_V2$member_casual all_trips2023_V2$day_of_week ## 1 casual Sunday ## 2 member Sunday ## 3 casual Monday ## 4 member Monday ## 5 casual Tuesday ## 6 member Tuesday ## 7 casual Wednesday ## 8 member Wednesday ## 9 casual Thursday ## 10 member Thursday ## 11 casual Friday ## 12 member Friday ## 13 casual Saturday ## 14 member Saturday ## all_trips2023_V2$ride_length ## 1 32.86849 ## 2 13.99256 ## 3 27.72091 ## 4 11.90424 ## 5 25.08965 ## 6 12.01519 ## 7 24.31035 ## 8 11.94911 ## 9 24.73157 ## 10 12.02156 ## 11 27.26909 ## 12 12.48058 ## 13 32.14513 ## 14 13.94076
These results are quite diverse for the week; we will examine these elements in detail during the next phase with some visualisations (Share phase).
Analyze ridership data by type of user and weekday and we create a weekday field using wday(). Then we group by usertype and weekday summarise(number_of_rides = n()) and calculate the number of rides and average duration
all_trips2023_V2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% arrange(member_casual, weekday)
## Output : ## # A tibble: 14 × 4 ## # Groups: member_casual [2] ## member_casual weekday number_of_rides average_duration ## <chr> <ord> <int> <dbl> ## 1 casual Sun 335617 32.9 ## 2 casual Mon 234766 27.7 ## 3 casual Tue 246162 25.1 ## 4 casual Wed 249093 24.3 ## 5 casual Thu 270540 24.7 ## 6 casual Fri 311835 27.3 ## 7 casual Sat 410608 32.1 ## 8 member Sun 408765 14.0 ## 9 member Mon 494481 11.9 ## 10 member Tue 576629 12.0 ## 11 member Wed 586365 11.9 ## 12 member Thu 589486 12.0 ## 13 member Fri 531503 12.5 ## 14 member Sat 472758 13.9
Let’s visualize the rides by type of bikes
all_trips2023_V2 %>% group_by(rideable_type) %>% ggplot(aes(x = rideable_type, fill = rideable_type)) + geom_bar() + labs(title="Case study : How does a bike-share navigate speedy success", subtitle = "Bike types used", caption = "Data collected by company - Steven OBAME"
From the visualization, it's obvious that the electric bikes are the most used bikes followed by the classic bikes.
Let’s visualize the number of rides by rider type
all_trips2023_V2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% arrange(member_casual, weekday) %>% ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) + geom_col(position = "dodge") + labs(title="Case study : How does a bike-share navigate speedy success", subtitle = "Number of rides by usertype", caption = "Data collected by company - Steven OBAME")
Examining the weekly ride counts for users, this visualization clearly illustrates that the number of trips is higher for members compared to casual users. Members predominantly utilize bikes during the weekdays, with the peak number of trips occurring on Tuesdays, whereas casual users primarily ride on weekends, with the highest number of trips recorded on Saturdays.
Let’s visualize the average duration of rides by rider type
all_trips2023_V2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% arrange(member_casual, weekday) %>% ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) + geom_col(position = "dodge") + labs(title="Case study : How does a bike-share navigate speedy success", subtitle = "Average ride by usertype", caption = "Data collected by company - Steven OBAME")
## output : ## `summarise()` has grouped output by 'member_casual'. You can override using the ## `.groups` argument.
Remarkably, in this visualization, we observe that the average travel duration for casual users exceeds that of members significantly. Casual users tend to utilize bikes more frequently, particularly on weekends. In contrast, member usage remains consistent, experiencing a slight increase during weekends.
For further analysis we are going to create a .csv files that we will visualize in Tableau
# csv file : mean ride_length by day_of_week counts_mean <- aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual + all_trips2023_V2$day_of_week, FUN = mean) write.csv(counts_mean, file = 'avg_ride_length.csv') # csv file : number of ride by day_of_week sum_ride_id_day <- aggregate(all_trips2023_V2$ride_id ~ all_trips2023_V2$member_casual + all_trips2023_V2$day_of_week, FUN = length) write.csv(sum_ride_id_day, file = 'sum_ride_id_day.csv') # csv file : mean ride_length by month counts_mean_month <- aggregate(all_trips2023_V2$ride_length ~ all_trips2023_V2$member_casual + all_trips2023_V2$month, FUN = mean) write.csv(counts_mean_month, file = 'avg_ride_length_month.csv') # csv file : number of ride by month sum_ride_id_month <- aggregate(all_trips2023_V2$ride_id ~ all_trips2023_V2$member_casual + all_trips2023_V2$month, FUN = length) write.csv(sum_ride_id_month, file = 'sum_ride_id_month.csv')
Based on our previous analysis, we aim to explore year-long trends by examining the four seasons. We intend to augment the previous file with a column for seasons. Subsequently, we will import the previous file into a new data frame, add the new column, and then export it.
# import into the new file dfile <- read.csv("sum_ride_id_month.csv") # rename the columns dfile <- dfile %>% rename(month = all_trips2023_V2.month, ride_id = all_trips2023_V2.ride_id, member_casual = all_trips2023_V2.member_casual) dfile <- dfile %>% mutate(seasons = case_when( month %in% c("March", "April", "May") ~ "Spring", month %in% c("June", "July", "August") ~ "Summer", month %in% c("September", "October", "November") ~ "Fall", TRUE ~ "Winter" # For all other cases (e.g., December, January, February) )) # csv file : number of ride by month and seasons sum_ride_id_month_season <- dfile write.csv(sum_ride_id_month_season, file = 'sum_ride_id_month_season.csv')
You can see the dashboard in Tableau profile here
First we are going to confirm the number of rides per user and day with the next visualization :
In this visualization, with 64% of users classified as members, totaling 3,659,987, and 36% classified as casual users, totaling 2,058,621, we confirm that members are the most frequent bike users. The breakdown of bike usage throughout the week clearly indicates high usage by members on weekdays and by casual users on weekends.
Let’s examine the distribution of the number of trips per month and per season.
When we examine the distribution of the number of trips per season and per member throughout the year, we observe a significant concentration from May to November, particularly during summer (from June to August).
The summer and spring seasons account for over 61% of the occupancy rate throughout the year. However, casual users take their shortest trips during winter.
Let’s examine the distribution of the average of trips per month and day.
These visualizations clearly indicate that casual users ride bikes more frequently and for longer duration than members, peaking during the summer months (with July being the highest). Meanwhile, bike usage among members remains almost constant throughout the year.
This distribution may suggest a trend where member users utilize bikes for regular trips, such as commuting to work and adhering to specific hours, while casual users use them for daily commuting and leisure activities, resulting in higher usage on weekends.
With everything we’ve seen so far we can give a quick summary :
Member They take a significantly higher number of rides, albeit for shorter duration. They primarily ride during the week, and their usage remains consistent throughout the year, with the highest number of rides during the summer. * They use bikes for regular trips, such as commuting to work.
Casual The number of rides is lower compared to members, but they ride for longer periods and cover longer distances. Their rides predominantly occur on weekends, and their usage remains consistent throughout the year, with the highest number of rides during the summer. * They use bikes for daily commuting and leisure activities.
This analysis can be expanded by mapping the rides per member and station to observe peak hours and focus on the top stations by bike type, for example.
The purpose to this analysis is to answer this question : how do annual members and casual riders use Cyclistic bikes differently. We see here and some recommendations we can give in order to convert casual riders into members.
The goal is to attract casual users to become members :
Simon Steven OBAME
Hi! My name is Simon, I'm a Project Manager.