Olist E-commerce: An Explanatory Data Analysis on the Brazil Marketplace


A few weeks ago, I joined a community named Datachallenge Space, where they usually host a monthly data analytics challenge. I participated in the April challenge on a public dataset called “Olist E-commerce”. The aim of this analysis is to test one’s ability to explore, clean, analyze, and visualize data. The dataset helps analysts like me who haven’t dealt with an eCommerce dataset before.

What is Olist?

Olist operates an online e-commerce site for sellers, that connects merchants and their products to the main marketplaces of Brazil. Olist’s technology enables small merchants to gain market share across the country through a SaaS licensing model for small brick-and-mortar businesses.

The company needs insight and recommendations on its sales and how the available data could help it improve its sales and visibility.

About the Olist Dataset

This Brazilian e-commerce order dataset was obtained from the Olist Store. The dataset contains information on 100,000 orders placed between 2016 and 2018 on several Brazilian marketplaces. It enables viewing orders from a variety of angles, including customer location, product attributes, order status, pricing, payment, and freight performance. A geolocation dataset that links Brazilian zip codes to lat/lng coordinates was also made available.

References to corporations and partners in the review language have been swapped out for the names of the great houses from Game of Thrones. This is actual commercial data that has been anonymized.

Use Case

Olist can use the insight and recommendations to improve their business sales and seller performances. It could also help the business understand how customers behave and how their feedback could affect the growth of the marketplace.

Data Analysis Process

  • Data Exploration
  • Data Cleaning
  • E-commerce Metrics
  • Perform Exploratory Data analysis (EDA).
  • Data Visualization
  • Communicating Insight and Recommendation


I explored the dataset with Microsoft Excel and found the following

  • Empty and Null
  • Unwanted special characters
  • The product category name is written in the Brazilian language (there is a separate table that contains the English version).
  • The Table and Columns title contains an underscore.
  • Error spelling


Data cleaning is a process by which inaccurate, poorly formatted, duplicate, invalid, or otherwise messy data is organized and corrected. This is the most important part of data analysis, as well as the most time-consuming. Well, I have done one that is even dirtier than this. You can check my sentiment analysis on Davido Timeless album, which took me 10 days to complete, and 6 days on data cleaning alone.

I imported the dataset into PowerQuery using UNICODE UTF 8 for Transformation/Cleaning, and cleaning. This Unicode helped me remove unwanted characters in the city column in the seller and geography tables.

Power Query is a data transformation and data preparation engine. Power Query comes with a graphical interface for getting data from sources and a Power Query Editor for applying transformations.

Renamed Tables and Columns

I started by renaming all eight tables that contain underscores that I am not satisfied with. I removed the underscores and renamed each table with a short, unique, and meaningful identity.

The column names are not written properly, and also have an underscore. I used the Power Query function to rename the columns once instead of renaming them one after the other….. A simple code did it

= Table.RenameColumns(#"Changed Type",{{"customer_id", "Customer Id"}, {"customer_unique_id", "Unique Id"}, {"customer_zip_code_prefix", "Zip Code"}, {"customer_city", "City"}, {"customer_state", "State"}})

I did this for all the tables and moved on to removing and replacing null and empty records.

Replacing and Removing Nulls and Empty Records

I started by filling Up/Down the empty date records in the order table, such as carrier date and delivered date. There are 610 null and empty records in the products table. The columns include category name, name length, description length, and photo Qty. I replaced the category with “Not Available” (N/A) and ‘0” for the remaining columns that are numerics.

I also replaced the review title and description with comments in the Review table.

Importing and Replacing Product Category Column Records

In the power query, I click on “New Source” at the HOME bar to import the new CSV file containing the English version of the product category names. After importing it, I replaced the special character (underscore) with “Space” and capitalized it, then merged it with the product table containing the product category column. I deleted the Brazilian language afterward.



I started by analyzing some important metrics for retail industry KPIs for the organization, which include:

  • Total Revenue
  • Total orders
  • Average Order value
  • Number of sellers
  • Average order per customer
  • Percentage of canceled orders
  • Average order seller

Before I forget, I had an activation code issue with Microsoft Excel before I started the analysis, so I used Power BI Dax and Power Query instead… E pain me sha but it also allows me to test Dax for the first time….🤣🤣🤣🤣

Kindly note that for every metric here, I created a new measure on Power Bi to perform the calculation.

Total Revenue

To calculate the total revenue of the business, I sum the payment value in the order payment together to get their total revenue.

Total Revenue = SUM('Order Payments'[Payment Value])

Total revenue is $16.01 million.

Note: Because I don’t know how the company makes its profit, I didn’t calculate the business profit. It is unclear, so I left that part out, but it might be from the seller’s price or the shipping fee.

Total Orders

To get the number of orders on Olist between 2016 and 2020 I used the count function to get the total orders on Olist.

Totalorders = COUNT('Order Items'[Item Id])

The total number of orders is 113,000.

Average Order Value (AOV)

The average order value (AOV) is the average dollar amount customers spend each time an order is placed on your e-commerce website. —– Shopify

Traditionally, online businesses that know their AOV focus on getting customers to spend past that threshold—for instance, if their AOV is $30, by offering free shipping on orders over $35. But while increasing order value this way can help your revenue stream, it may not bring you the highest profit margins due to the cut in shipping.

To calculate the average order value, I divide total revenue by the total number of orders.

AOV = [Total Revenue] /[No of orders]

The average order value is $161.

Number of sellers on Olist

I created a new measure to get the total number of sellers who are putting their products on Olist. To get this done, I count distinct sellers’ IDs.

No of sellers = DISTINCTCOUNT('Order Items'[Seller Id])

The total number of sellers is 3095.

Average order per customer

The average order per customer metric helps business owners and organizations know the average order that was placed by a single customer. This is the opposite of order value, which shows dollars spent on a single order. to get this calculation, also called a basket. I divided the total order by the total order payment.

Basket = [Totalorders] / COUNT('Order Payments'[Order Id])

The average order per customer is 1.08.

Percentage of canceled orders

This particular metric is very important to retail businesses as it shows how much people like their service, their products, their price, and their shipping fee. Although this is not the only metric to consider, it is powerful. To get the percentage of canceled orders, I first used the Count iF function to get the canceled orders from the order status in the orders table.

Unfortunately for me, Power Bi Dax does not support COUNT IF.. pressure tiwa.😁 Well, I did research, and I later used the below function. The below-nested function allows me to enter my table, and column, and filter what I need.

Canceled = CALCULATE(COUNTX(Orders, COUNT(Orders[Order Status])), Orders[Order Status] = "canceled")

It took me some time, but we did😁😁😁To get the percentage of orders that were canceled, I divide the canceled orders by the number of orders.

Percentage of Cancellation = Orders[Canceled] / Orders[No of orders]

Average orders per seller

This metric also helps business owners make informed decisions. It lets them know if they need more sellers or need to improve their sales and marketing. I divide the total orders by the number of sellers in the order item table (I counted it earlier, using the count function).

AVG order per sellers = [Totalorders] / [No of sellers]

Explanatory Data Analysis

Now let’s do the main the main lol. To help Olist gain reasonable insights into its marketplace platform, I provided answers to some business questions.

Let’s go

I used the PowerBI Metrics table to analyze my data. Actually, I should have visualized it directly, but my mumu self wants to use the metric table since it is my first time doing such on Power Bi and I don’t want to drag visualization into my analysis yet so I won’t get confused and carried away by “how should I visualize this or which chart will be good”. This is the pivot table of power Bi lol

What is the total revenue generated by Olist, and how has it changed over time?

We already have the total revenue, we only need to plot it to get the sales by months over the years. I added my total revenue into the value and purchase time (year and month) into the rows. It looks like this.

Metrics Table in Power Bi (Pivot table in Excel)

How many orders were placed on Olist, and how does this vary by month or season?

I added my already counted number of orders into the value and added the year and month into the row.

What are the most popular product categories on Olist, and how do their sales volumes?

To get the third question, I added total orders to the value and product category names to the row.

It’s worth saying that I didn’t merge some tables because Power Bi has automatically created relationships between them, which saved me time modeling or merging. Regardless, I still created relationships and merged.

What is the average order value (AOV) on Olist, and how does this vary by payment method?

I created my measure for AOV inside the order payment table, so there is no need for merging or creating relationships. Simply added my average order value (we created earlier) to the value and payment methods in the rows.

What is the distribution of seller ratings on Olist, and how does this impact sales performance?

Well, here I returned to my power query to group the review score in the review table into excellent, very good, good, poor, and weak. I divided the 1 – 5 rating into qualitative. I used the IF condition to get this done. IF the review score is equal to 5 then excellent. I used the conditional column function to make it faster.

IF Condition Column

After getting the rating distribution, I went back to Power Bi to perfume my analysis. We want to see how sellers’ ratings affect sales. I added the newly created column “Rating Distribution” into Row and my total revenue into Value.

Total Revenue by Rating Distribution

What is the average customer rating for products sold on Olist, and how does this impact sales performance?
I merged three tables together using a power query, order, product, and review. I then group the review score by product category. The power query is a group-by function that allows records to be grouped just like SQL.

Unable to Screenshot Now….

I created a relationship between the new table and the order payment so I could use the total revenue. I added my total revenue into Value and put it as average, including my average review, which I grouped, then added product category name into the rows.

Quick One: How to create a relationship between two tables in case Power Bi did not or you need one for a new table. On your Ribbon, go to Modeling, and click Manage Relationship, There you will see all your tables’ unique Id and their relationship tables. You can click on “New” to create a new relationship, and you can also click on the mark button to turn off or activate relationships.

New Relationship in Power BI

What are the top-selling products on Olist, and how have their sales trends changed over time?

To find the Top selling products, I inserted my total orders into Value and product category names into Rows. This is easy because we already created all these measures from the beginning.

Which payment methods are most commonly used by Olist customers, and how does this vary by product category?.

Well, credit cards are the most commonly used payment method, followed by Boleto, the country’s most popular payment method. I merged some tables together (I didn’t mention it), like order payment, order items, and product category. With this, I don’t need to create a relationship between order payment and product value. The funniest thing is, it is not possible because they don’t have the same unique ID, I will have to merge the order table so I can merge the product category table.

I inserted my sum payment value into Values, payment methods in columns, and product category names in rows.

AOV By Payment Method

How do customer reviews and ratings affect sales and product performance on Olist?

I added the average payment value and average rating into the value and product category names in the row to get the performance of sales in products.

Which geolocation has a high customer density?

I did count customer IDs I counted the customer ID as a numeric number (99441) and used qualitative data, the city as rows.

Olist Visualization


The exciting part of the analysis is when everything you do will look amazing… wawuu

Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. Everything we’ve been analyzing might look somehow to newbies or someone who doesn’t understand the role of an analyst, but with visuals, everyone’s na boss lol

Since I am ready in Power Bi, there is no need to import and export my analysis to any other tool. I visualized with PowerBI Desktop.

I used CARD, which displayed all my important single metrics. The card is useful to show a single numerical value or a metric. I also used a pie chart to display the most commonly used payment methods.

Payment Distribution

Line charts, stacked bar charts, clustered bar charts, clustered column bar charts, and line charts were all used for my other analyses.

Insight & Recommendation


  1. The review with the highest rating generated the most revenue.
  2. The average order cancellation rate is 0.63%, which shows that most buyers are satisfied with the products. Additionally, it demonstrates that customer orders were fulfilled on time, preventing cancellation.
  3. Credit cards are the most popular method of payment, followed by the country’s most popular payment option, boleto.
  4. It is important to keep in mind that Sao Paulo has the most clients because it is the most populous city in Brazil, which may also explain why it has the most customers.
  5. Bed Bath Table is the most popular product category on Olist-based order volume. While this is good, it is also good to note that Bed Bath is one of the lowest products ($89) on Olist.
  6. Health Beauty is the product category that brought in the highest revenue in 2017 ($1,45,959) and the Bed Bath Table Category is the product with the highest revenue in 2018 (784,289).
  7. The number of orders increased between May-August.ust Every year


  1. Olist should offer free shipping for customers with more than two orders, which could increase the average order value.
  2. Olist needs to target the right audience with their most expensive products and consider delivery costs. For instance, paying the delivery cost when purchasing a computer could be difficult.
  3. Olist should provide discounts between May and June to boost order volume, as sales during those times went up.
  4. The average order per customer and average order value are bad. The business should investigate the shipping fee and the product pricing if it is excessively high or pricey.


  • The dataset does not contain the date when a new customer registered which limited the customer analysis
  • Overall the dataset, To make a more sensitive analysis, I would need a longer period dataset with every information included such as customer and seller registration date, customer and seller’s name






I'm Damilare Damidez, A Data Analyst and Webmaster. I have been on the internet for over a decade. I help business owners to make data driven decision.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *