Introduction

This is the capstone project as part of my Google Data Analytics Professional Certificate course. For the analysis, I will be using R programming language and RStudio IDE.

Modus Operandi

  • Ask
  • Prepare
  • Process
  • Analyse
  • Share
  • Act

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Problem Statement

The director of marketing and your manager Lily Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Ask

Key tasks

  • Identify the business task
    • The main objective is to build the best marketing strategies to turn casual bike riders into annual members by analyzing how the ‘Casual’ and ‘Annual’ customers use Cyclistic bike share differently.
  • Consider key stakeholders
    • Cyclistic executive team
    • Director of Marketing
    • Marketing Analytics team

Deliverable

  • A clear statement of the business task
    • Find the difference between casual and member riders.

Prepare

I will be using Cyclistic’s historical trip data to analyse and identify trends. The data has been made available by Motivate International Inc.

Key Tasks

  • Download data and store it appropriately.
    • Data has been downloaded from divvy-tripdata and copies have been stored securely on my computer.
  • Identify how it’s organised.
    • All trip data is in comma-delimited (.CSV) format. Column names - “ride_id”, “rideable_type”, “started_at”, “ended_at”, “start_station_name”, “start_station_id”, “end_station_name”,“end_station_id”, “start_lat”, “start_lng”, “end_lat”, “end_lng”, “member_casual” (Total 13 columns).
  • Sort and filter the data.
    • For this analysis I’m going to use past 12 months data starting form May 2022 to April 2023.
  • Determine the credibility of the data.
    • For the purpose of this case study, the datasets are appropriate and it will enable me to answer the business questions. But due to privacy policy, I cannot use rider’s personally identification information, and this will prevent me from determining if a single rider taken several rides. All ride_id are unique in this datasets.

Deliverable

  • A description of all data sources used.
    • Main source data provided by Cyclistic Company.
Importing the packages to be used
library(readr)
library(dplyr)
library(lubridate)
library(geosphere)
library(ggplot2)
library(ggmap)
Import data in R Studio
df1 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202205-divvy-tripdata.csv')
df2 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202206-divvy-tripdata.csv')
df3 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202207-divvy-tripdata.csv')
df4 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202208-divvy-tripdata.csv')
df5 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202209-divvy-tripdata.csv')
df6 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202210-divvy-tripdata.csv')
df7 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202211-divvy-tripdata.csv')
df8 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202212-divvy-tripdata.csv')
df9 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202301-divvy-tripdata.csv')
df10 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202302-divvy-tripdata.csv')
df11 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202303-divvy-tripdata.csv')
df12 <- read.csv('/home/arjit/Projects/Case Study Cyclistic/Data/202304-divvy-tripdata.csv')
Checking the consistency of data
colnames(df1)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df2)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df3)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df4)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df5)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df6)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df7)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df8)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df9)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df10)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df11)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(df12)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
Checking the structure of data
str(df1)
## 'data.frame':    634858 obs. of  13 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(df2)
## 'data.frame':    769204 obs. of  13 variables:
##  $ ride_id           : chr  "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-06-30 17:27:53" "2022-06-30 18:39:52" "2022-06-30 11:49:25" "2022-06-30 11:15:25" ...
##  $ ended_at          : chr  "2022-06-30 17:35:15" "2022-06-30 18:47:28" "2022-06-30 12:02:54" "2022-06-30 11:19:43" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(df3)
## 'data.frame':    823488 obs. of  13 variables:
##  $ ride_id           : chr  "954144C2F67B1932" "292E027607D218B6" "57765852588AD6E0" "B5B6BE44314590E6" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-07-05 08:12:47" "2022-07-26 12:53:38" "2022-07-03 13:58:49" "2022-07-31 17:44:21" ...
##  $ ended_at          : chr  "2022-07-05 08:24:32" "2022-07-26 12:55:31" "2022-07-03 14:06:32" "2022-07-31 18:42:50" ...
##  $ start_station_name: chr  "Ashland Ave & Blackhawk St" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" ...
##  $ start_station_id  : chr  "13224" "15541" "15541" "15541" ...
##  $ end_station_name  : chr  "Kingsbury St & Kinzie St" "Michigan Ave & 8th St" "Michigan Ave & 8th St" "Woodlawn Ave & 55th St" ...
##  $ end_station_id    : chr  "KA1503000043" "623" "623" "TA1307000164" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "casual" "casual" "casual" ...
str(df4)
## 'data.frame':    785932 obs. of  13 variables:
##  $ ride_id           : chr  "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-08-07 21:34:15" "2022-08-08 14:39:21" "2022-08-08 15:29:50" "2022-08-08 02:43:50" ...
##  $ ended_at          : chr  "2022-08-07 21:41:46" "2022-08-08 14:53:23" "2022-08-08 15:40:34" "2022-08-08 02:58:53" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 42 42 41.8 ...
##  $ end_lng           : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(df5)
## 'data.frame':    701339 obs. of  13 variables:
##  $ ride_id           : chr  "5156990AC19CA285" "E12D4A16BF51C274" "A02B53CD7DB72DD7" "C82E05FEE872DF11" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-09-01 08:36:22" "2022-09-01 17:11:29" "2022-09-01 17:15:50" "2022-09-01 09:00:28" ...
##  $ ended_at          : chr  "2022-09-01 08:39:05" "2022-09-01 17:14:45" "2022-09-01 17:16:12" "2022-09-01 09:10:32" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "California Ave & Milwaukee Ave" "" "" "" ...
##  $ end_station_id    : chr  "13084" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(df6)
## 'data.frame':    558685 obs. of  13 variables:
##  $ ride_id           : chr  "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ...
##  $ rideable_type     : chr  "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-10-14 17:13:30" "2022-10-01 16:29:26" "2022-10-19 18:55:40" "2022-10-31 07:52:36" ...
##  $ ended_at          : chr  "2022-10-14 17:19:39" "2022-10-01 16:49:06" "2022-10-19 19:03:30" "2022-10-31 07:58:49" ...
##  $ start_station_name: chr  "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ...
##  $ start_station_id  : chr  "13290" "13288" "655" "KA1504000133" ...
##  $ end_station_name  : chr  "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ...
##  $ end_station_id    : chr  "KA1504000079" "13089" "TA1307000140" "620" ...
##  $ start_lat         : num  41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 42 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...
str(df7)
## 'data.frame':    337735 obs. of  13 variables:
##  $ ride_id           : chr  "BCC66FC6FAB27CC7" "772AB67E902C180F" "585EAD07FDEC0152" "91C4E7ED3C262FF9" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-11-10 06:21:55" "2022-11-04 07:31:55" "2022-11-21 17:20:29" "2022-11-25 17:29:34" ...
##  $ ended_at          : chr  "2022-11-10 06:31:27" "2022-11-04 07:46:25" "2022-11-21 17:34:36" "2022-11-25 17:45:15" ...
##  $ start_station_name: chr  "Canal St & Adams St" "Canal St & Adams St" "Indiana Ave & Roosevelt Rd" "Indiana Ave & Roosevelt Rd" ...
##  $ start_station_id  : chr  "13011" "13011" "SL-005" "SL-005" ...
##  $ end_station_name  : chr  "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" ...
##  $ end_station_id    : chr  "13016" "13016" "13016" "13016" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(df8)
## 'data.frame':    181806 obs. of  13 variables:
##  $ ride_id           : chr  "65DBD2F447EC51C2" "0C201AA7EA0EA1AD" "E0B148CCB358A49D" "54C5775D2B7C9188" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-12-05 10:47:18" "2022-12-18 06:42:33" "2022-12-13 08:47:45" "2022-12-13 18:50:47" ...
##  $ ended_at          : chr  "2022-12-05 10:56:34" "2022-12-18 07:08:44" "2022-12-13 08:59:51" "2022-12-13 19:19:48" ...
##  $ start_station_name: chr  "Clifton Ave & Armitage Ave" "Broadway & Belmont Ave" "Sangamon St & Lake St" "Shields Ave & 31st St" ...
##  $ start_station_id  : chr  "TA1307000163" "13277" "TA1306000015" "KA1503000038" ...
##  $ end_station_name  : chr  "Sedgwick St & Webster Ave" "Sedgwick St & Webster Ave" "St. Clair St & Erie St" "Damen Ave & Madison St" ...
##  $ end_station_id    : chr  "13191" "13191" "13016" "13134" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...
str(df9)
## 'data.frame':    190301 obs. of  13 variables:
##  $ ride_id           : chr  "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr  "2023-01-21 20:05:42" "2023-01-10 15:37:36" "2023-01-02 07:51:57" "2023-01-22 10:52:58" ...
##  $ ended_at          : chr  "2023-01-21 20:16:33" "2023-01-10 15:46:05" "2023-01-02 08:05:11" "2023-01-22 11:01:44" ...
##  $ start_station_name: chr  "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
##  $ start_station_id  : chr  "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
##  $ end_station_name  : chr  "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
##  $ end_station_id    : chr  "202480.0" "TA1308000002" "599" "TA1308000002" ...
##  $ start_lat         : num  41.9 41.8 42 41.8 41.8 ...
##  $ start_lng         : num  -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.8 42 41.8 41.8 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "casual" "member" ...
str(df10)
## 'data.frame':    190445 obs. of  13 variables:
##  $ ride_id           : chr  "CBCD0D7777F0E45F" "F3EC5FCE5FF39DE9" "E54C1F27FA9354FF" "3D561E04F739CC45" ...
##  $ rideable_type     : chr  "classic_bike" "electric_bike" "classic_bike" "electric_bike" ...
##  $ started_at        : chr  "2023-02-14 11:59:42" "2023-02-15 13:53:48" "2023-02-19 11:10:57" "2023-02-26 16:12:05" ...
##  $ ended_at          : chr  "2023-02-14 12:13:38" "2023-02-15 13:59:08" "2023-02-19 11:35:01" "2023-02-26 16:39:55" ...
##  $ start_station_name: chr  "Southport Ave & Clybourn Ave" "Clarendon Ave & Gordon Ter" "Southport Ave & Clybourn Ave" "Southport Ave & Clybourn Ave" ...
##  $ start_station_id  : chr  "TA1309000030" "13379" "TA1309000030" "TA1309000030" ...
##  $ end_station_name  : chr  "Clark St & Schiller St" "Sheridan Rd & Lawrence Ave" "Aberdeen St & Monroe St" "Franklin St & Adams St (Temp)" ...
##  $ end_station_id    : chr  "TA1309000024" "TA1309000041" "13156" "TA1309000008" ...
##  $ start_lat         : num  41.9 42 41.9 41.9 41.8 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  41.9 42 41.9 41.9 41.8 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "member" ...
str(df11)
## 'data.frame':    258678 obs. of  13 variables:
##  $ ride_id           : chr  "6842AA605EE9FBB3" "F984267A75B99A8C" "FF7CF57CFE026D02" "6B61B916032CB6D6" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2023-03-16 08:20:34" "2023-03-04 14:07:06" "2023-03-31 12:28:09" "2023-03-22 14:09:08" ...
##  $ ended_at          : chr  "2023-03-16 08:22:52" "2023-03-04 14:15:31" "2023-03-31 12:38:47" "2023-03-22 14:24:51" ...
##  $ start_station_name: chr  "Clark St & Armitage Ave" "Public Rack - Kedzie Ave & Argyle St" "Orleans St & Chestnut St (NEXT Apts)" "Desplaines St & Kinzie St" ...
##  $ start_station_id  : chr  "13146" "491" "620" "TA1306000003" ...
##  $ end_station_name  : chr  "Larrabee St & Webster Ave" "" "Clark St & Randolph St" "Sheffield Ave & Kingsbury St" ...
##  $ end_station_id    : chr  "13193" "" "TA1305000030" "13154" ...
##  $ start_lat         : num  41.9 42 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.7 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(df12)
## 'data.frame':    426590 obs. of  13 variables:
##  $ ride_id           : chr  "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2023-04-02 08:37:28" "2023-04-19 11:29:02" "2023-04-19 08:41:22" "2023-04-19 13:31:30" ...
##  $ ended_at          : chr  "2023-04-02 08:41:37" "2023-04-19 11:52:12" "2023-04-19 08:43:22" "2023-04-19 13:35:09" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
Merging 12 datasets into 1 dataset
all_rides <- rbind(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
Creating a file of the merged dataset
write_csv(all_rides, '/home/arjit/Projects/Case Study Cyclistic/Data/all_rides.csv')

Process

Cleaning and processing data for analysis

Key tasks

  • Check the data for errors
  • Choose your tools
  • Transform the data so you can work with it effectively
  • Document the cleaning process

Deliverable

  • Documentation of any cleaning or manipulating the data
Analyzing the the structure of attributes
sapply(all_rides, class)
##            ride_id      rideable_type         started_at           ended_at 
##        "character"        "character"        "character"        "character" 
## start_station_name   start_station_id   end_station_name     end_station_id 
##        "character"        "character"        "character"        "character" 
##          start_lat          start_lng            end_lat            end_lng 
##          "numeric"          "numeric"          "numeric"          "numeric" 
##      member_casual 
##        "character"
Number of columns, rows and total NA values
ncol(all_rides)
## [1] 13
nrow(all_rides)
## [1] 5859061
sum(is.na(all_rides))
## [1] 11946
Removing all the empty cells
all_rides <- janitor::remove_empty(all_rides, which=c("cols"))
all_rides <- janitor::remove_empty(all_rides, which=c("rows"))
Removing all the NA values
all_rides <- na.omit(all_rides)
sum(is.na(all_rides))
## [1] 0
Checking the number of rows left after removing NA
nrow(all_rides)
## [1] 5853088
Checking the dimension of the dataset
dim(all_rides)
## [1] 5853088      13
Calculating the ride time
all_rides$minutes <- difftime(all_rides$ended_at, all_rides$started_at, units = c("min"))
all_rides$minutes <- as.numeric(as.character(all_rides$minutes))
all_rides$minutes <- round(all_rides$minutes, 1)
Checking the datatype of “minutes” column
is.numeric(all_rides$minutes)
## [1] TRUE
Adding date, month, day, year, day of week, hour columns
all_rides$date <- as.Date(all_rides$started_at)
all_rides$month <- format(as.Date(all_rides$date), "%m")
all_rides$day <- format(as.Date(all_rides$date), "%d")
all_rides$year <- format(as.Date(all_rides$date), "%Y")
all_rides$day_of_week <- format(as.Date(all_rides$date), "%A")
all_rides$hour <- hour(all_rides$started_at)
Checking the modified dataset
head(all_rides)
##            ride_id rideable_type          started_at            ended_at
## 1 EC2DE40644C6B0F4  classic_bike 2022-05-23 23:06:58 2022-05-23 23:40:19
## 2 1C31AD03897EE385  classic_bike 2022-05-11 08:53:28 2022-05-11 09:31:22
## 3 1542FBEC830415CF  classic_bike 2022-05-26 18:36:28 2022-05-26 18:58:18
## 4 6FF59852924528F8  classic_bike 2022-05-10 07:30:07 2022-05-10 07:38:49
## 5 483C52CAAE12E3AC  classic_bike 2022-05-10 17:31:56 2022-05-10 17:36:57
## 6 C0A3AA5A614DCE01  classic_bike 2022-05-04 14:48:55 2022-05-04 14:56:04
##                  start_station_name start_station_id
## 1            Wabash Ave & Grand Ave     TA1307000117
## 2 DuSable Lake Shore Dr & Monroe St            13300
## 3           Clinton St & Madison St     TA1305000032
## 4           Clinton St & Madison St     TA1305000032
## 5           Clinton St & Madison St     TA1305000032
## 6           Carpenter St & Huron St            13196
##                end_station_name end_station_id start_lat start_lng  end_lat
## 1        Halsted St & Roscoe St   TA1309000025  41.89147 -87.62676 41.94367
## 2   Field Blvd & South Water St          15534  41.88096 -87.61674 41.88635
## 3       Wood St & Milwaukee Ave          13221  41.88224 -87.64107 41.90765
## 4        Clark St & Randolph St   TA1305000030  41.88224 -87.64107 41.88458
## 5           Morgan St & Lake St   TA1306000015  41.88224 -87.64107 41.88578
## 6 Sangamon St & Washington Blvd          13409  41.89456 -87.65345 41.88316
##     end_lng member_casual minutes       date month day year day_of_week hour
## 1 -87.64895        member    33.4 2022-05-23    05  23 2022      Monday   23
## 2 -87.61752        member    37.9 2022-05-11    05  11 2022   Wednesday    8
## 3 -87.67255        member    21.8 2022-05-26    05  26 2022    Thursday   18
## 4 -87.63189        member     8.7 2022-05-10    05  10 2022     Tuesday    7
## 5 -87.65102        member     5.0 2022-05-10    05  10 2022     Tuesday   17
## 6 -87.65110        member     7.2 2022-05-04    05  04 2022   Wednesday   14
Adding the season column
all_rides <- all_rides %>% mutate(season = case_when(month == "01" ~ "Winter",
                                                     month == "02" ~ "Winter", 
                                                     month == "03" ~ "Spring", 
                                                     month == "04" ~ "Spring", 
                                                     month == "05" ~ "Summer", 
                                                     month == "06" ~ "Summer", 
                                                     month == "07" ~ "Summer", 
                                                     month == "08" ~ "Summer", 
                                                     month == "09" ~ "Fall", 
                                                     month == "10" ~ "Fall", 
                                                     month == "11" ~ "Winter", 
                                                     month == "12" ~ "Winter"))
Mapping the time column to phases of a day
all_rides <- all_rides %>% mutate(time_of_day = case_when(hour == "0" ~ "Night",
                                                          hour == "1" ~ "Night",
                                                          hour == "2" ~ "Night", 
                                                          hour == "3" ~ "Night", 
                                                          hour == "4" ~ "Night", 
                                                          hour == "5" ~ "Morning", 
                                                          hour == "6" ~ "Morning", 
                                                          hour == "7" ~ "Morning", 
                                                          hour == "8" ~ "Morning", 
                                                          hour == "9" ~ "Morning", 
                                                          hour == "10" ~ "Morning", 
                                                          hour == "11" ~ "Morning", 
                                                          hour == "12" ~ "Afternoon", 
                                                          hour == "13" ~ "Afternoon", 
                                                          hour == "14" ~ "Afternoon", 
                                                          hour == "15" ~ "Afternoon", 
                                                          hour == "16" ~ "Afternoon", 
                                                          hour == "17" ~ "Afternoon", 
                                                          hour == "18" ~ "Evening", 
                                                          hour == "19" ~ "Evening", 
                                                          hour == "20" ~ "Evening", 
                                                          hour == "21" ~ "Evening", 
                                                          hour == "22" ~ "Evening", 
                                                          hour == "23" ~ "Evening"))
Mapping the months to their names
all_rides <- all_rides %>% mutate(month = case_when(month == "01" ~ "January", 
                                                    month == "02" ~ "February", 
                                                    month == "03" ~ "March", 
                                                    month == "04" ~ "April", 
                                                    month == "05" ~ "May", 
                                                    month == "06" ~ "June", 
                                                    month == "07" ~ "July", 
                                                    month == "08" ~ "August", 
                                                    month == "09" ~ "September", 
                                                    month == "10" ~ "October", 
                                                    month == "11" ~ "November", 
                                                    month == "12" ~ "December"))
Verifying the results
tail(all_rides)
##                  ride_id rideable_type          started_at            ended_at
## 5859056 A17D800CE963661A  classic_bike 2023-04-11 15:46:42 2023-04-11 15:50:03
## 5859057 8B441A6C436E9900  classic_bike 2023-04-29 21:20:21 2023-04-29 21:30:19
## 5859058 3980D64BE11540F1  classic_bike 2023-04-24 09:16:05 2023-04-24 09:22:27
## 5859059 3EF4B49FF7DAA02C  classic_bike 2023-04-18 07:53:51 2023-04-18 07:59:16
## 5859060 210B2ED6583DC231  classic_bike 2023-04-29 07:33:55 2023-04-29 07:38:57
## 5859061 D29CB39B9E3FC46A electric_bike 2023-04-18 08:00:32 2023-04-18 08:02:35
##                 start_station_name start_station_id          end_station_name
## 5859056     Michigan Ave & Lake St     TA1305000011    Clark St & Randolph St
## 5859057       Halsted St & 18th St            13099 Blue Island Ave & 18th St
## 5859058       Halsted St & 18th St            13099 Blue Island Ave & 18th St
## 5859059 Franklin St & Jackson Blvd     TA1305000025    Clark St & Randolph St
## 5859060     Michigan Ave & Lake St     TA1305000011    Clark St & Randolph St
## 5859061 Franklin St & Jackson Blvd     TA1305000025 Clark St & Ida B Wells Dr
##         end_station_id start_lat start_lng  end_lat   end_lng member_casual
## 5859056   TA1305000030  41.88602 -87.62440 41.88458 -87.63189        member
## 5859057          13135  41.85751 -87.64599 41.85756 -87.66154        casual
## 5859058          13135  41.85751 -87.64599 41.85756 -87.66154        casual
## 5859059   TA1305000030  41.87771 -87.63532 41.88458 -87.63189        casual
## 5859060   TA1305000030  41.88602 -87.62440 41.88458 -87.63189        casual
## 5859061   TA1305000009  41.87813 -87.63525 41.87593 -87.63058        casual
##         minutes       date month day year day_of_week hour season time_of_day
## 5859056     3.4 2023-04-11 April  11 2023     Tuesday   15 Spring   Afternoon
## 5859057    10.0 2023-04-29 April  29 2023    Saturday   21 Spring     Evening
## 5859058     6.4 2023-04-24 April  24 2023      Monday    9 Spring     Morning
## 5859059     5.4 2023-04-18 April  18 2023     Tuesday    7 Spring     Morning
## 5859060     5.0 2023-04-29 April  29 2023    Saturday    7 Spring     Morning
## 5859061     2.0 2023-04-18 April  18 2023     Tuesday    8 Spring     Morning

Analyze

Now all the required information are in one place and ready for exploration.

Key tasks

  • Aggregate your data so it’s useful and accessible.
  • Organise and format your data.
  • Perform calculations.
  • Identify trends and relationships.

Deliverable

  • A summary of analysis.
Checking the unique types of ride available
unique(all_rides$rideable_type)
## [1] "classic_bike"  "docked_bike"   "electric_bike"
Checking the unique values in member_casual
unique(all_rides$member_casual)
## [1] "member" "casual"
Verifying if there are any NA in dataset
sum(is.na(all_rides))
## [1] 0
Checking if there are any duplicates
#distinct(all_rides)
dim(all_rides)
## [1] 5853088      22
write_csv(all_rides, '/home/arjit/Projects/Case Study Cyclistic/Data/all_rides_modified.csv')
df <- read_csv('/home/arjit/Projects/Case Study Cyclistic/Data/all_rides_modified.csv')
## Rows: 5853088 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): ride_id, rideable_type, start_station_name, start_station_id, end...
## dbl   (7): start_lat, start_lng, end_lat, end_lng, minutes, year, hour
## dttm  (2): started_at, ended_at
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 22
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 EC2DE40644C6B0F4 classic_bike  2022-05-23 23:06:58 2022-05-23 23:40:19
## 2 1C31AD03897EE385 classic_bike  2022-05-11 08:53:28 2022-05-11 09:31:22
## 3 1542FBEC830415CF classic_bike  2022-05-26 18:36:28 2022-05-26 18:58:18
## 4 6FF59852924528F8 classic_bike  2022-05-10 07:30:07 2022-05-10 07:38:49
## 5 483C52CAAE12E3AC classic_bike  2022-05-10 17:31:56 2022-05-10 17:36:57
## 6 C0A3AA5A614DCE01 classic_bike  2022-05-04 14:48:55 2022-05-04 14:56:04
## # ℹ 18 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   minutes <dbl>, date <date>, month <chr>, day <chr>, year <dbl>,
## #   day_of_week <chr>, hour <dbl>, season <chr>, time_of_day <chr>
dim(df)
## [1] 5853088      22
Checking the structure of dataset
str(df)
## spc_tbl_ [5,853,088 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5853088] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr [1:5853088] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5853088], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" ...
##  $ ended_at          : POSIXct[1:5853088], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" ...
##  $ start_station_name: chr [1:5853088] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr [1:5853088] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr [1:5853088] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr [1:5853088] "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ start_lat         : num [1:5853088] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5853088] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:5853088] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5853088] -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr [1:5853088] "member" "member" "member" "member" ...
##  $ minutes           : num [1:5853088] 33.4 37.9 21.8 8.7 5 7.2 8.9 12.2 16.7 1.5 ...
##  $ date              : Date[1:5853088], format: "2022-05-23" "2022-05-11" ...
##  $ month             : chr [1:5853088] "May" "May" "May" "May" ...
##  $ day               : chr [1:5853088] "23" "11" "26" "10" ...
##  $ year              : num [1:5853088] 2022 2022 2022 2022 2022 ...
##  $ day_of_week       : chr [1:5853088] "Monday" "Wednesday" "Thursday" "Tuesday" ...
##  $ hour              : num [1:5853088] 23 8 18 7 17 14 12 19 17 7 ...
##  $ season            : chr [1:5853088] "Summer" "Summer" "Summer" "Summer" ...
##  $ time_of_day       : chr [1:5853088] "Evening" "Morning" "Evening" "Morning" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   minutes = col_double(),
##   ..   date = col_date(format = ""),
##   ..   month = col_character(),
##   ..   day = col_character(),
##   ..   year = col_double(),
##   ..   day_of_week = col_character(),
##   ..   hour = col_double(),
##   ..   season = col_character(),
##   ..   time_of_day = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
Analyzing the summary of the dataset
summary(df)
##    ride_id          rideable_type        started_at                 
##  Length:5853088     Length:5853088     Min.   :2022-05-01 00:00:06  
##  Class :character   Class :character   1st Qu.:2022-07-03 11:25:11  
##  Mode  :character   Mode  :character   Median :2022-08-28 13:00:14  
##                                        Mean   :2022-09-19 14:02:11  
##                                        3rd Qu.:2022-11-08 07:01:29  
##                                        Max.   :2023-04-30 23:59:05  
##     ended_at                   start_station_name start_station_id  
##  Min.   :2022-05-01 00:05:17   Length:5853088     Length:5853088    
##  1st Qu.:2022-07-03 11:48:44   Class :character   Class :character  
##  Median :2022-08-28 13:20:19   Mode  :character   Mode  :character  
##  Mean   :2022-09-19 14:18:00                                        
##  3rd Qu.:2022-11-08 07:12:01                                        
##  Max.   :2023-05-01 08:06:56                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5853088     Length:5853088     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng       member_casual         minutes         
##  Min.   : 0.00   Min.   :-88.14   Length:5853088     Min.   :-10353.40  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:     5.70  
##  Median :41.90   Median :-87.64   Mode  :character   Median :    10.00  
##  Mean   :41.90   Mean   :-87.65                      Mean   :    15.83  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:    17.90  
##  Max.   :42.37   Max.   :  0.00                      Max.   : 32035.40  
##       date               month               day                 year     
##  Min.   :2022-05-01   Length:5853088     Length:5853088     Min.   :2022  
##  1st Qu.:2022-07-03   Class :character   Class :character   1st Qu.:2022  
##  Median :2022-08-28   Mode  :character   Mode  :character   Median :2022  
##  Mean   :2022-09-18                                         Mean   :2022  
##  3rd Qu.:2022-11-08                                         3rd Qu.:2022  
##  Max.   :2023-04-30                                         Max.   :2023  
##  day_of_week             hour          season          time_of_day       
##  Length:5853088     Min.   : 0.00   Length:5853088     Length:5853088    
##  Class :character   1st Qu.:11.00   Class :character   Class :character  
##  Mode  :character   Median :15.00   Mode  :character   Mode  :character  
##                     Mean   :14.21                                        
##                     3rd Qu.:18.00                                        
##                     Max.   :23.00
Calculating the ride length in kilometers
df$ride_length <- distGeo(matrix(c(df$start_lng, df$start_lat), ncol = 2), matrix(c(df$end_lng, df$end_lat), ncol = 2))

df$ride_length <- df$ride_length/1000
Removing the rows with ride length as negative or zero
df <- df[!(df$ride_length <= 0),]
Removing the NA in the dataset
sum(is.na(df))
## [1] 3020593
df <- na.omit(df)

Doing descriptive analysis, studying:

  • avg_ride_length -> straight average(total ride length / total rides)

  • median_length -> midpoint number of ride length

  • max_ride_length -> longest ride

  • min_ride_length -> shortest ride

df %>% summarise(avg_ride_length = mean(ride_length), median_length = median(ride_length), 
                 max_ride_length = max(ride_length), min_ride_length = min(ride_length))
## # A tibble: 1 × 4
##   avg_ride_length median_length max_ride_length min_ride_length
##             <dbl>         <dbl>           <dbl>           <dbl>
## 1            2.18          1.61           9817.       0.0000185

Compare members and casual riders

  • Member vs casual riders difference depending on the total rides taken
df %>% 
    group_by(member_casual) %>% 
    summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(df)) * 100)
## # A tibble: 2 × 3
##   member_casual ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           1668876            38.5
## 2 member           2663765            61.5
ggplot(df, aes(x = member_casual, fill=member_casual)) +
    geom_bar() +
    labs(x="Casuals vs Members", y="Number Of Rides", title= "Casuals vs Members distribution")

We can see on the Casuals vs Members distribution chart, members possessing ~ 61%, and casual riders have ~ 39% of the dataset. So it is clearly visible that in the whole year 2022 members used ride share 22% more than casual riders.

Comparison between Member Casual riders depending on ride length

df %>%
  group_by(member_casual) %>% 
  summarise(avg_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length), min_ride_length = min(ride_length))
## # A tibble: 2 × 5
##   member_casual avg_ride_length median_length max_ride_length min_ride_length
##   <chr>                   <dbl>         <dbl>           <dbl>           <dbl>
## 1 casual                   2.29          1.74           9817.       0.0000185
## 2 member                   2.12          1.53           9817.       0.0000202

From the above table we can conclude that casual riders took bike for longer rides than members, as the average trip duration / average ride length of member riders is lower than the average trip duration / average ride length of casual riders.

Fixing the names of days in a week

df$day_of_week <- ordered(df$day_of_week, 
                                    levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
df %>% 
  group_by(member_casual, day_of_week) %>%  
  summarise(number_of_rides = n()
  ,avg_ride_length = mean(ride_length),.groups="drop") %>% 
  arrange(member_casual, day_of_week)
## # A tibble: 14 × 4
##    member_casual day_of_week number_of_rides avg_ride_length
##    <chr>         <ord>                 <int>           <dbl>
##  1 casual        Sunday               277155            2.39
##  2 casual        Monday               194373            2.20
##  3 casual        Tuesday              191531            2.18
##  4 casual        Wednesday            200016            2.28
##  5 casual        Thursday             223623            2.22
##  6 casual        Friday               245092            2.24
##  7 casual        Saturday             337086            2.42
##  8 member        Sunday               299215            2.17
##  9 member        Monday               374705            2.03
## 10 member        Tuesday              422636            2.07
## 11 member        Wednesday            429342            2.22
## 12 member        Thursday             428473            2.09
## 13 member        Friday               374112            2.05
## 14 member        Saturday             335282            2.21

Analyzing the total rides by members and casual riders in a particular day in a week

df %>%  
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n(), .groups="drop") %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Day of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the above graph, we can conclude that members took consistent rides throughout the week with least on Sunday. For the casual riders, the most rides were taken on weekends.

Analyzing the average ride time of members casual riders on a particular day of a week

df %>%  
  group_by(member_casual, day_of_week) %>% 
  summarise(average_ride_length = mean(ride_length), .groups="drop") %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Average ride time by Members and Casual riders Vs. Day of the week")

The average ride length for members are comparatively less than that of casual riders. Also it can be seen that weekend average ride length is much higher for casual riders along with total rides. So both of this facts can be correlated for casual riders. For members average ride length is about the same throughout the week

Grouping the rides according to member casual and months of a year

df %>% 
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n(), average_ride_length = mean(ride_length), .groups="drop") %>% 
  arrange(member_casual, month)
## # A tibble: 24 × 4
##    member_casual month    number_of_rides average_ride_length
##    <chr>         <chr>              <int>               <dbl>
##  1 casual        April             102551                2.23
##  2 casual        August            252507                2.33
##  3 casual        December           30197                1.80
##  4 casual        February           30825                1.87
##  5 casual        January            28189                1.73
##  6 casual        July              289054                2.37
##  7 casual        June              269373                2.39
##  8 casual        March              44359                1.89
##  9 casual        May               200562                2.43
## 10 casual        November           69553                2.27
## # ℹ 14 more rows

Analyzing the total rides by members and casual riders in a month

df$month <- ordered(df$month, 
                            levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
df %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Month", x = "Month", y= "Number Of Rides") +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

The months June, July, August and September are the most busy time of the year among both members and casual riders. It is possible due to winter there is a significant drop in total rides in the months of November, December, January and February for both type of customers. But we can see that member’s total rides are higher than casual riders throughout the year except from June, July and August.

Analyzing the ride length by members casual riders in a month

df %>%  
  group_by(member_casual, month) %>% 
  summarise(average_ride_length = mean(ride_length),.groups="drop") %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Average ride length by Members and Casual riders Vs. Month") +
  theme(axis.text.x = element_text(angle = 30))

Average ride length of members is about the same throughout the year. While casual riders average ride length is greater than the members throughout the year.

Analyzing the average distance traveled by member casual rider

df %>% 
  group_by(member_casual) %>%
  summarise(avg_ride_length = mean(ride_length)) %>%
  ggplot() + 
  geom_col(mapping= aes(x= member_casual,y= avg_ride_length,fill=member_casual), show.legend = FALSE)+
  labs(title = "Mean travel distance by Members and Casual riders", x="Member and Casual riders", y="Average distance In Km")

From the above chart we can see that both riders travel about the same average distance. This similarity could be possible due to that member take (same ride time) rides throughout the week, but casual riders took rides mostly in weekends with higher ride time.

Analysis and visualize the bike demand by hour in a day

df %>%
    ggplot(aes(hour, fill= member_casual)) +
    labs(x="Hour of the day", title="Cyclistic's Bike demand by hour in a day") +
    geom_bar()

From the above chart we can see more members between 7am and 11am and more casual riders between 3pm and 12am. Also there is bigger volume rise in the afternoon for both type of riders. This information needs to be checked on day basis.

Analysis and visualize the bikes demand by hour in a day of a week

df %>%
    ggplot(aes(hour, fill=member_casual)) +
    geom_bar() +
    labs(x="Hour of the day", title="Cyclistic's bike demand per hour by day of the week") +
    facet_wrap(~ day_of_week)

There is a lot of difference between the weekdays and weekends. There is a big increase of volume in the weekdays between 7am to 10am and another volume increase from 5pm to 7pm. We can hypothesize that members use the bikes as daily routine like going to work (same behavior throughout the weekdays) and go back from work (5pm - 7pm). Weekends are completely different for members and casual riders, Friday, Saturday and Sunday there is huge peak in volume for casual riders, from this we can hypothesize that casual riders mostly use bike share for leisure activity in the weekends.

Analyzing the preferred bike by member casual riders

df %>%
    group_by(rideable_type) %>% 
    summarise(count = length(ride_id))
## # A tibble: 3 × 2
##   rideable_type   count
##   <chr>           <int>
## 1 classic_bike  2481004
## 2 docked_bike    137651
## 3 electric_bike 1713986
ggplot(df, aes(x=rideable_type, fill=member_casual)) +
    labs(x="Rideable type", title="Rideable type Vs. total rides by Members and casual riders") +
    geom_bar()

From the above viz we can see that members mostly use classic bikes, followed by electric bikes. Docked bikes mostly used by casual riders. Electric bikes are more favored by members.

Analyzing and visualizing the top 5 starting stations for casual riders

df %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   start_station_name                 count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            46527
## 2 DuSable Lake Shore Dr & Monroe St  25609
## 3 Millennium Park                    20893
## 4 Michigan Ave & Oak St              20459
## 5 DuSable Lake Shore Dr & North Blvd 20302
df %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for casual riders")
## Selecting by count

Analyzing and visualizing the top 5 ending stations for casual riders

df %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   end_station_name                   count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            49616
## 2 DuSable Lake Shore Dr & North Blvd 23507
## 3 DuSable Lake Shore Dr & Monroe St  23431
## 4 Millennium Park                    22762
## 5 Michigan Ave & Oak St              22148
df %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(end_station_name= fct_reorder(end_station_name, count)) %>% 
  ggplot(aes(x=end_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="End Station Name", title="Top 5 ending stations for casual riders")
## Selecting by count

Analyzing and visualizing the top 5 starting stations for member riders

df %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   start_station_name           count
##   <chr>                        <int>
## 1 Kingsbury St & Kinzie St     23442
## 2 Clark St & Elm St            20969
## 3 Clinton St & Washington Blvd 20260
## 4 Wells St & Concord Ln        19897
## 5 Loomis St & Lexington St     18979
df %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Start Station Name", title="Top 5 starting stations for member riders")
## Selecting by count

Analyzing and visualizing the top 5 ending stations for casual riders

df %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5)
## Selecting by count
## # A tibble: 5 × 2
##   end_station_name             count
##   <chr>                        <int>
## 1 Kingsbury St & Kinzie St     23231
## 2 Clark St & Elm St            21304
## 3 Clinton St & Washington Blvd 21181
## 4 Wells St & Concord Ln        20662
## 5 University Ave & 57th St     19424
df %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(5) %>% 
  mutate(end_station_name= fct_reorder(end_station_name, count)) %>% 
  ggplot(aes(x=end_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="End Station Name", title="Top 5 ending stations for member riders")
## Selecting by count

Analyze and visualize the dataset on coordinate basis

coordinates_df <- df %>% 
filter(start_lng != end_lng & start_lat != end_lat) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual, rideable_type) %>%
summarise(total_rides = n(),.groups="drop") %>%
filter(total_rides > 200)
casual_riders <- coordinates_df %>% filter(member_casual == "casual")
member_riders <- coordinates_df %>% filter(member_casual == "member")

Setting up the ggmap for Chicago

chicago <- c(left = -87.700424, bottom = 41.790769, right = -87.554855, top = 41.990119)

chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain")
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.

Visualization on the map

ggmap(chicago_map,darken = c(0.1, "white")) +
   geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +
   coord_fixed(0.8) +
   labs(title = "Most used routes by Casual riders",x=NULL,y=NULL) +
   theme(legend.position="none")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 8 rows containing missing values (`geom_point()`).

ggmap(chicago_map,darken = c(0.1, "white")) +
    geom_point(member_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +  
    coord_fixed(0.8) +
    labs(title = "Most used routes by Member riders",x=NULL,y=NULL) +
    theme(legend.position="none")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 37 rows containing missing values (`geom_point()`).

We can clearly see the casual rides are mostly located around the center of the town (or the bay area), with all their trips located around that area points towards their bike usage pattern, which is for leisure, probably tourist or sightseeing related rides.

Members are mostly use bike all over the city including main city area and outside main center. This can be hypothesize as they travel for work purpose.

Share

This phase will be done by presentation, but here we can use RMarkdown to share our analysis and visualizations.

Key tasks

  • Determine the best way to share your findings.
  • Create effective data visualizations.
  • Present your findings.
  • Ensure your work is accessible.

Deliverable

  • Support visualizations and key findings.

Main insights and conclusions

  • Members holds the biggest proportion of the total rides, ~22% bigger than casual riders.
  • In all months we have more members than casual riders.
  • For casual riders the biggest volume of data is on the weekend.
  • There is a bigger volume of bikers in the afternoon. This could be possible that member use bikes for work purpose, this information can be backed by their bike usage in colder months, where there is significant drop in casual members in those months.

How members differs from casuals:

  • Members have the bigger volume of data, except on Saturday and Sunday. On the weekend, casuals riders have the most rides.
  • Casuals riders have more ride length (ride duration) than members. Average ride time of member are mostly same slight increase in end of week.
  • We have more members during the morning, mainly between 7am and 10am. And more casuals between 3pm and 12am.
  • Members have a bigger preference for classic bikes, followed by electric bike.
  • Members have a more fixed use for bikes for routine activities. Where as casual rider’s usage is different, mostly all activity in the weekend.
  • Casual member spend time near the center of the city or the bay area, where as member are scattered throughout the city.

Act

Act phase will be done by the Cyclistic’s executive team, Director of Marketing (Lily Moreno), Marketing Analytics team on the basis of my analysis. (Data-driven decision making)

Deliverable

  • Top three recommendations:
    • Offer a weekend-only membership at a different price point than the full annual membership.

    • Coupons and discounts could be handed out along with the annual subscription / weekend-only membership for the usage of electric bikes targeting casual riders. Also increasing the number of electric bike while reducing classic bikes if electric bike costs more for the pass, this can be beneficial for the company. (As electric bike are already in trend and usage is good as per member and ride type data.

    • Create marketing campaigns which can be sent via email, or advertisement in the docking stations explaining why annual member is beneficial. Campaigns should be placed at the peak months of the year.

  • Additional Recommendations:
    • Raise awareness of the ease of use, affordability, accessibility and green options.
    • Focus on dense and walk able areas, partnering with school and companies.

Conclusion

Thank you for your time and interest to review my capstone project! This project helped me to walk through the data analysis process from start to finish using real-world data and business questions. To learn from the others code too, I have referred the analysis done by Elaine Sanchez Wilber and Sayantan Bagchi. I’m truly excited and look forward to growing in the field of data analysis.