Data Analysis of Bike Sharing Rental using Python

Sriyanda Afrida W
8 min readOct 22, 2020

--

Photo by Markus Winkler on Unsplash

Dataset

I used Bike Sharing Rental dataset from Hacktiv8. Access link:

http://bit.ly/dwp-data-bike

Introduction

Three months ago, I decided to start Hacktiv8’s Introduction To Python For Data Science Class Program and I’ve learned a lot since then. This course includes all the materials you need to become a Data Scientist. At the end of the program, each student is required to complete a Data Analysis or Machine Learning Project. The project I’m working on is “Data Analysis of Bike Sharing Rental”, which I basically worked on by answering a few questions below to get insights from the data.

Outline

The bike sharing rental dataset has a total of 46,230 rows x 16 columns. Where this dataset describes the bike sharing system which is a means of renting bicycles starting from getting a membership, renting, borrowing and returning bicycles. Business processes are carried out automatically through a network of kiosk locations throughout the city. Using this system, people can rent bicycles from one location and return them to another as needed.

There are several questions I asked myself about the data, and these are five questions that I am going to answer in this article:

1. Which age group uses the Citybike rental the most and also what gender?

2. How many trips (borrowed bicycles) are there every Monday — Sunday and the hour?

3. What days have the most bicycle borrowings occurred?

4. 5 Favorite station of departure and time (hour) with your most favorite bike rental?

5. Does the age of a person affect the trip duration by the minute?

First, we can import the python libraries which we will use to perform the analysis process on the Bike Sharing Rental dataset to Jupyter Notebook. Libraries used include numpy, pandas, seaborn, matplotlib.pyplot, tensorflow, etc.

Figure 1: Import Python Libraries to Jupyter Notebook

Data Wrangling

Data wrangling is the process of transforming raw data into a ready-to-use format for analysis. In the data wrangling process using the Bike Sharing Rental dataset, the first thing that needs to be done is to change the date format to datetime, this is intended to avoid possible data ambiguity. Then proceed by extracting the desired value according to the needs of the analysis using “.dt.strftime” and changing the data type dayname to categorical.

Figure 2: Data Wrangling Process 2.1

Furthermore, we can add an hour column which is used to help analyze favorite borrowed hours, add an age column to help in analyzing a specific age group. Added column route (starts station name — end station name) which can be used to find out the route of the station for borrowing bicycles.

Figure 3: Data Wrangling Process 2.2
Figure 4: Output Data Wrangling Process 2.2

The data wrangling process can be continued with the process of grouping generations by age and creating a function to map generation data based on that age grouping, such as Gen Z, Millennial, Gen X, Boomers, and Traditional.

Figure 5: Data Wrangling Process 2.3

Moreover, we can add a duration by minute column.

Figure 6: Data Wrangling Process 2.4

Data Exploration

  1. Which age group uses the Citybike rental the most and also what gender?
Figure 7: Data Exploration Process 3.1

The first process to answer the questions above is to do a group by column on the age, gender, and age category toward bikeID. Previously we have defined age into numbered form. Where, 0 for unknown, 1 for male, and 2 for female. Then obtained information from the data exploration process above as follows.

Figure 8: Output of Data Exploration Process 3.1

It was argued that The age group that uses the city bike rental the most is Gen X with an age range of 40–55 years. Then followed by the Millennial age category with a range of 26–39 in the next sequence. Last but not least, the gender that uses the Citybike rental the most is men. After that we can visualize the data using Seaborn.

Figure 9: Data Visualization from Data Exploration Process 3.1

2. How many trips (borrowed bicycles) are there every Monday — Sunday and the hour?

Figure 10: Data Exploration Process 3.2

For question number 2, we can do group by which is used to display or select a data set based on a certain data group. The data grouped by is the dayname and hour column toward gender. After that we can do the visualization.

Figure 11: Output of Data Exploration Process 3.2

The results obtained are The highest total daily borrowing of bicycles is Thursday with a total of 3,076 transactions and the most daily borrowing times today are 08 AM, 04 PM, and 06 PM. Then in second place for the most total daily bicycle borrowings is Friday with a total of 950 transactions and the most daily borrowing time today is 08 AM. Finally, for the most daily total borrowed bicycles, the next is Tuesday with a total of 924 transactions and the most daily borrowing time today is 08 AM.

Figure 12: Data Visualization from Data Exploration Process 3.2

3. What days have the most bicycle borrowings occurred?

Figure 12: Data Exploration Process 3.3

To find out which days have the most bicycle loans, the first thing we have to do in the data exploration process converts the date string into a datetime (strftime) object. Then we do group by for the data in the day name column toward gender.

Figure 13: Output of Data Exploration Process 3.3

We can visualize the data using the barplot by entering the total_by_day data, and filling in the variable x with the dayname, and the y variable with the total.

Figure 14: Data Visualization from Data Exploration Process 3.3

4. 5 Favorite station of departure and time (hour) with your most favorite bike rental?

Figure 15: Data Exploration Process 3.4

The first thing that needs to know the 5 favorite departure stations for borrowing bicycles and favorite is to do a group by using the data in the column start station name, hour against the index data using the bikeID column.

Figure 16: Output of Data Exploration Process 3.4

We then visualize the data obtained using the barplot by filling in x as the start station name, y as the total, and hue as the hour.

Figure 17: Data Visualization Process from Data Exploration Process 3.4

So, the 5 favorite departure stations for borrowing bicycles and favorite times are Grove St PATH with favorite times for borrowing bicycles is 05 AM, 06 AM, 07 PM and 08 PM. Hamilton Park with favorite time to rent bicycles is 08 AM. Columbus Dr at Exchange P1 with favorite time to rent a bike is 5 PM. Sip Ave with favorite time to borrow bicycles is 06 PM. Last, Newport PATH favorite time to rent bikes is 05 PM.

Figure 18: Data Visualization from Data Exploration Process 3.4

5. Does the age of a person affect the trip duration by the minute?

To see does the age of a person affect the trip duration by the minute, we can use a linear regression model to answer this question. The step that must be done is to break down the training data and testing data, then continue with the Fitting Linear Regression Method with Initiate the model and process model training. Finally, we can do the testing and evaluation process.

Figure 19: Data Exploration Process 3.5

We can do visualization to see the connection.

Figure 20: Data Visualization Process from Data Exploration Process 3.5
Figure 21: Data Visualization from Data Exploration Process 3.5

From the results of the visualization above, The user’s age does not affect the duration of the bicycle borrowing (trip duration by the minute), and vice versa. Because the results of the visualization above do not show a pattern of relationship between User Age and Borrowing Time. Although the Mean Absolute Error (MAE) value obtained from the model made is 4 minutes from the time value of the original data.

Conclusion

In this article I did the analysis about data bike sharing rental, and these are the summary of what I have done.

  1. The age group that uses the city bike rental the most is Gen X with an age range of 40–55 years. And the gender that uses the Citybike rental the most is men.
  2. The highest total daily borrowing of bicycles is Thursday with a total of 3,076 transactions and the most daily borrowing times today are 08 AM, 04 PM, and 06 PM. Then in second place for the most total daily bicycle borrowings is Friday with a total of 950 transactions and the most daily borrowing time today is 08 AM. Finally, for the most daily total borrowed bicycles, the next is Tuesday with a total of 924 transactions and the most daily borrowing time today is 08 AM.
  3. The day that has the most bicycles borrowed is Thursday with a total of 8,470 transactions.
  4. The 5 favorite departure stations for borrowing bicycles and favorite times are:

i.) Grove St PATH with favorite times for borrowing bicycles is 05 AM, 06 AM, 07 PM and 08 PM.
ii.) Hamilton Park with favorite time to rent bicycles is 08 AM.
iii.) Columbus Dr at Exchange P1 with favorite time to rent a bike is 5 PM.
iv.) Sip Ave with favorite time to borrow bicycles is 06 PM.
v.) Newport PATH favorite time to rent bikes is 05 PM.

5. The user’s age does not affect the duration of the bicycle borrowing (trip duration by the minute), and vice versa. Because the results of the visualization above do not show a pattern of relationship between User Age and Borrowing Time. Although the Mean Absolute Error (MAE) value obtained from the model made is 4 minutes from the time value of the original data.

These are some of the key takeaways that I got from analyzing the data, I hope it also helps you gain some insights on Bike Sharing Rental.

Thank you for reading my article! Feel free to leave a comment below and please do not hesitate to connect and leave a message in my Linkedin profile if you want to ask about anything.

You can view the full Google Colab access to the data analysis process using the Bike Sharing Rental dataset in this article that I have made.

--

--