Simple Sales Case Study to design solution from ingestion to Visualize Data Analytics through Metabase dashboard

ISHMEET KAUR
5 min readOct 4, 2020

Aim: With provided 5 csv sales files, ingress the data, transform and load into master tables(star\snowflake) and display KPIs with BI tool.

Tools Used: MySQL database docker, Metabase docker, cloud Virtual Machine, pandas libraries of python.

Analyzing provided dataset:

  1. Salesman code can repeat depending on whether its active or not active.

2. Customer code can repeat depending on customer addr1 ,addr2,email id ,phone,active

Product Code can remain same but price can differ.

Solution Design used while implementation:

Solution Design with the provided files in master database.

Why docker is used for this design?

They don’t waste physical resources since they don’t need them to be allocated for their Kernel, as they don’t have a separate Kernel.

You can’t run Linux containers on Windows or Mac, because they need Linux Kennel to operate. The solution for Mac and Windows users would be to install a type 2 hypervisor such as VirtualBox or WMware Workstation, boot up the Linux machine, and then run Linux containers inside of it (in fact that’s what Docker for Mac and Docker for Windows do, but they use native hypervisors that come with the respective OS).

Setting up and running Linux containers is not that straightforward. It’s troublesome and requires a decent knowledge of Linux. Managing them is even more tedious.

As I’ve mentioned above, what Docker, Inc does is it makes Linux containers easy to use and available to everybody, and you do not have to be a Linux geek to use Linux containers nowadays thanks to docker.

Why Metabase Dashboard is used to visualize data?

It starts in a couple of clicks with the hosted version, or use Docker to get up and running on your own for free and connect to the existing data easily and have a BI solution that would usually take sales data.

Installation of Metabase on the new container link:

https://www.metabase.com/docs/latest/operations-guide/running-metabase-on-docker.html

Mounting a mapped file storage volume link:

https://www.metabase.com/docs/latest/operations-guide/running-metabase-on-docker.html

How to Configure Metabase according to data source:

https://www.metabase.com/docs/v0.36.5.1/setting-up-metabase.html

Design Solution Implementation Step:

1 .csv files needs to be placed in cloud Virtual machine at the path home/cloud_user/ishmeet/ey_files
2.Cron job will detect the files in every 1 min and if present it will execute python script.
3.Python script will create dataframe with pandas libraries and clean the data and place in the dataframe.

cust_data = pd.read_csv (r'/home/cloud_user/ey_files/Customer.csv', encoding='utf-8')

cust_df = pd.DataFrame(cust_data, columns= ['Distrib ID','Cust Code','Cust Name','Cust Addr1','Cust Addr2','Cust Addr3','Email','Phone','Active','Date Created','Date Updated'])

cust_df.dropna(how="all", inplace=True)

cust_df['CreateDate'] = pd.to_datetime('today')

cust_df['UniqueKey'] = [str(uuid.uuid4()) for x in range(len(cust_df))]

cust_df.to_sql('customer', con = engine, if_exists = 'append', chunksize = 1000, index=False)

shutil.move('/home/cloud_user/ey_files/Customer.csv','/home/cloud_user/ey_files_processed/Customer.csv’)

For master customer ,salesman and product make sure data customer code and salesman code which are already inserted in the master should update only and even triggers should be maintained on the table to trigger records in case of update of record. If record not present insert the record.

4. Script will load the file first in mysql docker container in the staging table of stage database.
5. After loading the database in staging database it will execute the store procedure to load the data in the master database consider dimension and fact tables considering no duplicate data should load.
6. Once the data get loaded into master table with metabase docker service KPIs can be analyzed using using public ip address of cloud server url link running on port 3000.

Sales figure for salesman

SELECT `Salesman Ky`.`SalesmanName` AS `SalesmanName`, sum(`sales`.`ProductQty`) AS `sum`, sum(`product__via__Product_KY`.`ProductPrice`) AS `sum_2`

FROM `sales` LEFT JOIN `salesman` `Salesman Ky`

ON `sales`.`Salesman_KY` = `Salesman Ky`.`UniqueKey` LEFT JOIN `product` `product__via__Product_KY` ON `sales`.`Product_KY` = `product__via__Product_KY`.`Uniquekey` WHERE `Salesman Ky`.`Active` = ‘Yes’ GROUP BY `Salesman Ky`.`SalesmanName` ORDER BY `Salesman Ky`.`SalesmanName` ASC

Sales figure for customer

SELECT `Customer Ky`.`CustName` AS `CustName`, sum(`sales`.`ProductQty`) AS `sum`, sum(`product__via__Product_KY`.`ProductPrice`) AS `sum_2`FROM `sales`LEFT JOIN `customer` `Customer Ky` ON `sales`.`Customer_KY` = `Customer Ky`.`UniqueKey` LEFT JOIN `product` `product__via__Product_KY` ON `sales`.`Product_KY` = `product__via__Product_KY`.`Uniquekey`

GROUP BY `Customer Ky`.`CustName

`ORDER BY `Customer Ky`.`CustName` ASC

SALES FOR EACH PRODUCT

SELECT `Product Ky`.`ProductName` AS `ProductName`, sum(`sales`.`ProductQty`) AS `sum

`FROM `sales` RIGHT JOIN `product` `Product Ky` ON `sales`.`Product_KY` = `Product Ky`.`Uniquekey`

GROUP BY `Product Ky`.`ProductName

`ORDER BY `sum` DESC, `Product Ky`.`ProductName` ASC

STOCK LEVEL FOR EACH PRODUCT

SELECT `Product`.`ProductName` AS `ProductName`, sum(`stock`.`ProductQty`) AS `sum

`FROM `stock` RIGHT JOIN `product` `Product` ON `stock`.`Product_Ky` = `Product`.`Uniquekey`

GROUP BY `Product`.`ProductName`ORDER BY `sum` DESC, `Product`.`ProductName` ASC

Sales ( Years vs Total Revenue)
SELECT makedate(year(`sales`.`BillDate`), 1) AS `BillDate`, sum(`product__via__Product_KY`.`ProductPrice`) AS `sum`
FROM `sales `LEFT JOIN `product` `product__via__Product_KY`
ON `sales`.`Product_KY` = `product__via__Product_KY`.`Uniquekey`
GROUP BY makedate(year(`sales`.`BillDate`), 1)
ORDER BY makedate(year(`sales`.`BillDate`), 1) ASC

Total Products Sold each year

SELECT makedate(year(`sales`.`BillDate`), 1) AS `BillDate`, sum(`sales`.`ProductQty`) AS `sum`FROM `sales`GROUP BY makedate(year(`sales`.`BillDate`), 1)ORDER BY makedate(year(`sales`.`BillDate`), 1) ASC

–Total Sales Done by Active Salesman

–
SELECT `Salesman Ky`.`SalesmanName` AS `SalesmanName`, sum(`sales`.`ProductQty`) AS `sum`FROM `sales`LEFT JOIN `salesman` `Salesman Ky` ON `sales`.`Salesman_KY` = `Salesman Ky`.`UniqueKey` WHERE `Salesman Ky`.`Active` = ‘Yes’

GROUP BY `Salesman Ky`.`SalesmanName`ORDER BY `Salesman Ky`.`SalesmanName` ASC

Top 10 selling product

SELECT `product__via__Product_KY`.`ProductName` AS `ProductName`, sum(`sales`.`ProductQty`) AS `sum`FROM `sales`LEFT JOIN `product` `product__via__Product_KY` ON `sales`.`Product_KY` = `product__via__Product_KY`.`Uniquekey`GROUP BY `product__via__Product_KY`.`ProductName`ORDER BY `sum` DESC, `product__via__Product_KY`.`ProductName` ASC

LIMIT 10

Future architecture design on Cloud for Big Data

1. Securely store the data in Azure Data lake storage with the help of azure data factory or Azure Data Migration Services.

2. With the prepared data model , extract load and transform the data with help of data bricks or data mapping flow feature of azure data factory in azure data lake storage securely.

3. Finally, model and server in azure synapse analytics/Azure Datawarehouse for analysis service and parallelly load the data into azure cosmos database for accessing real-time web and mobile applications in a scalable and flexible way

4. Maintain dashboard with Power BI dashboard after performing analytics on warehouse data.

5.Dashboard can be maintained on the real-time transactions storing and transformations report with the proper auditing and quality checks.

How would you extend this architecture to support: Real time data ingestion?
Complex event processing?

--

--