Online Sales Data Project Overview

Project Objectives

Cleaning the data.
Analysing sales performance and customer behaviour.
Visualizing trends and invistigating returned items.

About the dataset

The dataset comprises anonymized data on online sales transactions, capturing various aspects of product purchases, customer details, and order characteristics. This dataset can be utilized for analyzing sales trends, customer purchase behavior, and order management in e-commerce or retail.
Data Link HERE, Data usability is 8.82 and License is CCO:Public Domain.

columns discription

InvoiceNo: Number of the purchase in sequence.
StockCode: The code where the item is stored in the warehouse.
Description: Name of the item.
Quantity: Item amount per purchase.
InvoiceDate: Date of the purchase.
UnitPrice: price per unit within the purchase.
CustomerID: Register id for the customer.
Country: Country of the purchase.
Discount: Amount reduced per unit of the purchase.
PaymentMethod: How the payment was made.
ShippingCost: Fees to ship each order.
Category: Classification of product purchased.
SalesChannel: How the purchase was made, online or offline.
ReturnStatus: Whether the order was delivered successfully or returned.
ShipmentProvider: name of the company to ship the orders.
WarehouseLocation: Where the product was stored before shipping.
OrderPriority: Importance of the order, high, medium or low.
							

1- Libraries & tools Used

								
		Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

		MySQL Workbench

		Excel
								
							

2- data cleaning and preprocessing

Data usability was 8.2, no duplicates were found and no typing mistakes.
Cleaning steps involved filling nulls in 3 columns, reducing decimals, changing data types, trimming in one column, correcting negative values in 2 columns & reshaping a column entirely.
All cleaning process was performed repeatedly using Python, SQl and Excel.

2-1 Reducing Decimals


Discount column had more than 6 decimals within some values that i will reduce to 2.
									
		Python
df['Discount'] = df['Discount'].round(2)

		SQL
UPDATE online_sales_2
SET Discount = round(Discount, 2);
									
								

2-2 Changing Data Types

									
		Python
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['CustomerID'] = df['CustomerID'].astype('str')
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

		SQL
ALTER TABLE online_sales_2
MODIFY COLUMN InvoiceNo text,
MODIFY COLUMN CustomerID text,
MODIFY COLUMN InvoiceDate datetime;
									
								

2-3 Trimming Customer ID column


CustomerID column values had extra (.0) after the Id number, which needed to be removed
									
		Python
df["CustomerID"] = df["CustomerID"].str.rstrip('0').str.rstrip('.')

		SQL
UPDATE online_sales_2 
SET CustomerID = REPLACE (CustomerID, '.0', '');
									
								

2-4 Turning negative Values into positive


Quantity & unitPrice columns had some negative values.
									
		Python
df['Quantity'] = df['Quantity'].abs()
df['UnitPrice'] = df["UnitPrice"].abs()

		SQL
UPDATE online_sales_2
	SET 
	Quantity = abs(Quantity),
	UnitPrice = abs(UnitPrice);

		Excel
=IF(D2 < 0,(D2*-1),D2)
									
								

2-5 Reshaping Category Column


Category column values were totally messed, all items were calssified as all categories.
									
		Python
df['Category'] = df['Description'].apply(lambda x: 'Apparel' if x == 'T-shirt' else 
		'Furniture' if x == 'Office Chair' else
		'Stationary' if x in ['Notebook', 'Blue Pen'] else
		'Accessories' if x in ['Backpack', 'Desk Lamp', 'Wall Clock', 'White Mug'] else
		'Electronics' if x in ['Wireless Mouse', 'USB Cable', 'Headphones'] else
		'unknown')

		SQL
UPDATE online_sales_2
	SET Category = 
	CASE
		WHEN `Description` = 'T-shirt' THEN 'Apparel'
        WHEN `Description` = 'Office Chair' THEN 'Furniture'
        WHEN `Description` IN ('Notebook', 'Blue Pen') THEN 'Stationary'
        WHEN `Description` IN ('Wireless Mouse', 'USB Cable', 'Headphones') THEN 'Electronics'
        WHEN `Description` IN ('Backpack', 'Desk Lamp', 'Wall Clock', 'White Mug') THEN 'Accessories'
	END;

		Excel
(=IFS(C2="T-shirt","Apparel",C2="Office Chair","Furniture",C2="Notebook","Stationary",C2="Blue Pen","Stationary"
,C2="Backpack","Accessories",C2="Desk Lamp","Accessories",C2="Wall Clock","Accessories",C2="White Mug","Accessories"
,C2="Wireless Mouse","Electronics",C2="USB Cable","Electronics",C2="Headphones","Electronics"))
									
								

2-6 Filling Nulls


CustomerID, ShippingCost & WarehouseLocation columns had nulls.
For CustomerID column I will replace nulls with (unknown).
									
		Python
df['CustomerID'].replace('nan', 'unknown', inplace = True)

		SQL
UPDATE online_sales_2
	SET CustomerID = 'Unknown'
	WHERE CustomerID = '';
										
		Excel
=IF(G2="","Unknown", G2)

									
								

For ShippingCost & WarehouseLocation nulls,i will Use forward fill Method.
									
		Python
df['ShippingCost'].fillna(method = 'ffill', inplace = True)
df['WarehouseLocation'].fillna(method="ffill" , inplace = True)

		SQL
	-- 1- creating a new column row_num to use as an index
ALTER TABLE online_sales_2 ADD COLUMN row_num INT;
SET @row_number = 0; 
UPDATE online_sales_2 SET row_num = (@row_number := @row_number + 1) ORDER BY invoicedate;
		
	-- 2- Creating a common table expression to group null cell with the nearest non null cell above
WITH cte1 AS 
(
SELECT ShippingCost, row_num, COUNT(ShippingCost) OVER (ORDER BY row_num) AS nulls_grouped
FROM online_sales_2
),  -- 3- Creating another CTE to make a column with ShippingCost nulls filled
cte2 AS 
(
SELECT row_num, ShippingCost, nulls_grouped, 
FIRST_VALUE(ShippingCost) OVER (PARTITION BY nulls_grouped ORDER BY row_num) AS new_shippingcost
FROM cte1
)  -- 4- Using a join to update online_sales_2 from the new filled column created as a cte (cte2)
UPDATE online_sales_2
JOIN cte2 
ON online_sales_2.row_num = cte2.row_num
SET online_sales_2.ShippingCost = cte2.new_shippingcost
WHERE online_sales_2.ShippingCost IS NULL;  -- Notice that the three steps are in one query without ;
-- 5- Same procedure for WarehouseLocation nulls

		Excel
Select the column with nulls - Click home from ribbon - Click Find & Select from Editing section - 
go to special - tik on blanks - all null cells will be selected - type = and arrow up - then Ctrl+Enter.
									
								

before moving to the next step I needed to figure out whether the Discount valus is per unit or per order, and whether ShippingCost is paid by Client or the company.
I followed four steps
-1 How proportionate is the discount to unit price and order price?
-2 If there is an invoice repeated, does it has the same discount applied?
-3 Will there be negative values if the product povider pays for shipping?
-4 Business Logic.

Discount value was more proportionate with UnitPrice, same invoice had different discount values, over 500 negative values if company pays the shipping and business logic is discount is per unit while client pays for shipping.
Decision is Discount is per unit and client pays ShippingCost.

3- Feature Engineering

Feature Engineering is the process of creating new features or transforming existing features to improve the performance the data.

3-1 Python

									
df['QuantityPrice'] = (df.Quantity * df.UnitPrice).round(2)
df['OrderDiscount'] = (df.Quantity * df.Discount).round(2)
df['OrderNetAmount'] = (df.QuantityPrice - df.OrderDiscount).round(2)
df['InvoiceAmount'] = (df.OrderNetAmount + df.ShippingCost).round(2)
df['OrderYear'] = df.InvoiceDate.dt.year.astype(str)
df['OrderMonthName'] = df.InvoiceDate.dt.month_name()
df['OrderMonth'] = df.InvoiceDate.dt.month
df['OrderDay'] = df.InvoiceDate.dt.day.astype(str)
df['OrderTime'] = df.InvoiceDate.dt.time
df['OrderDayOfWeek'] = df.InvoiceDate.dt.day_name()
									
								

3-2 SQL


Here I created new columns contaning numbers parallel to a distinct value of another column to minimize memory usage.
Then i will create new tables of distinct values like countries that i can use later using joins.
Lastly deleting those columns while keeping the indexes columns will minimize data usage.
Example
									
	-- Creating the columns
ALTER TABLE online_sales_2 
ADD COLUMN (
DescriptionIndex text,
CountryIndex text,
PaymentMethodIndex text,
CategoryIndex text,
SalesChannelIndex text,
ReturnStatusIndex text,
ShipmentProviderIndex text,
WarehouseLocationIndex text,
OrderPriorityIndex text
);
	-- Filling a column
UPDATE online_sales_2
SET SalesChannelIndex = 
CASE
WHEN SalesChannel = 'In-store' THEN '1'
WHEN SalesChannel = 'Online' THEN '2'
END;
	-- Creating new table and accommodating data
CREATE TABLE sales_channel (
SalesChannelIndex TEXT,
SalesChannel TEXT)
	AS
SELECT DISTINCT SalesChannelIndex, SalesChannel FROM online_sales_2;
	-- Deleting columns
ALTER TABLE online_sales_2 
DROP COLUMN SalesChannel;

									
								

Same procedure was performed to all columns created.

4- Data Visualization & EDA

4-1 Product Performance

4-1-1 Revenue generated by item

Visualizing the revenue generated for each item helps with understanding which product performed well and what product needs solutions for inhanced sales

4-1-2 Quantity Sold By Product

Visualizing revenue generated for items is not enough to draw a conclusion about products performance, quantity sold helps with making the pic more clear.
Product quantity sold included within the same chart above.

4-1-3 Products Price Range

All of the items sold nearly the same quantity and generated very close revenue, despite the differences in usage, sizes,materials and most importantly the production cost 'not available in the data'. that requires a further look in products price range and averages.


Equal price distribution for all items which is the reason behind equal revenue generated alongside the same performance of quantities sold.

4-2 Sales trends

Visualizing complex data trends makes it easier to stakeholders to understand and make strategic decisions as it
1- Identifies upwards and downwards over time.
2- Reveals seasons patterns and cycle behaviour.
3- Allows comparison against past performance.
4- Helps in spotting anomalies or outliers.
5- Helps with predicting future trends and planning for it.

4-2-1 Monthly Trends and yearly Growth


No monthly trends and no growth over years.

4-2-2 Day Of The Week Trends


No trends can be seen.

4-2-3 Purchase Time Trends


No trends to be seen

4-3 Customer Analysis

Customer analysis has a crucial role in business success due to many reasons like,
1- Indentify customers segements and their needs toward more precise marketing efforts.
2- Understanding trends and customer behaviour toward better future planning.
3- Helps in strategic pricing decisions based on customer behaviour.
4- It guides toward product improvement or new product development.

4-3-1 Customer Distribution & Sales Channel preference

4-4 Returned Items Analysis

Despite being within the normal range of returned ites, analyzing it provides insights that can help to improved product quality, customer satisfaction, cost savings, and strategic business adjustments

4-4-1 Distribution Of Returned items


Here the aim is to identify if an item was returned with a higher rate, which may lead to a potential production issues.


No product was returned at a noticeable higher rate than other products, that means no production flaws that made a certain product defective.

4-4-2 Warehouse Returns


A warehouse may ship a damaged product


No certain warehouse damaged items with an increased rate before shipping as returned items are evenly distributed between warehouses.

4-4-3 Shipment Provider Returns


Damages, delays or even underperforming


Returned items are distributed even between shipping providers, hence no signs of a certain shipping provider delivering products with damages nor the behaviour of shipping employees is inappropriate.

5 Findings

5-1 Product Performance


All products performed the same when it comes to quantity sold and revenue generated despite being different in sizes, usage and production cost, reason is the same average price for all products.

5-2 Sales trends


Visualizing the count of purchases against day hours, days, weeks, month and quarter did not show any signs of purchasing trends, all purchasing patterns are stable.

5-3 Customer Analysis


Customers are evenly distributed between the 12 countries within the data.
No customer preference for payment method nor sales channel.

5-4 Returns Analysis


No certain warehouse damaged items with an increased rate before shipping as returned items are evenly distributed between warehouses.
No product was returned at a noticeable higher rate than other products, that means no production flaws that made a certain product defective.
Returned items are distributed even between shipping providers, hence no signs of a certain shipping provider delivering products with damages nor the behaviour of shipping employees is inappropriate.

6 Implications

6-1 Product Performance


The uniform performance despite the differences between products implies that pricing strategy is the key and it suggests that the average price being the same might be a key driving equal performance across diverse products.
Pricing strategy should be investigated to decide whether same pricing is more effective toward the best performance or the differentiated pricing based on product characteristics.

6-2 Sales trends


Stable purchasing patterns across time frames suggests a stable customers behaviour which indicates a mature market where consumer habits are not affected with external factors.

6-3 Customer Analysis


Even distribution across the 12 countries suggests the wide geographical reach, it may implies a strong global distributing brand.
The equal preference between sales channels and payment method implies the success of making different options of payments and sales channels of an equal value, liability and accessibility.

6-4 Returns Analysis


The even distribution of the returned items implies a high level in production quality, robust quality control and handling at warehouses and a reliable shipping providers with consistency in shipping services with no underperforming.


!!! The overall result that distribution is stable in all aspects might suggest that data is generated in a controlled environment or represents a distributer with high standards in operations and a robust efficiency beside a stable long term contracts.