Sally is the owner of a grocery store in a market place in a large Mediterranean city, where she sells fruit and vegetables. In order to register sales and print purchase tickets to give away to her customers, Sally uses a Point-of-Sale (POS) device.

Suppose that we take a look at Sally’s POS device, which contains tickets corresponding to all sales performed at the store in the last couple of years. Sally is interested in extracting intelligence out of this data. For instance, since each ticket has a billing associated to it, she could add up the billing of all her tickets and divide by the total number of tickets. The number that comes out is the *average ticket*, which is the first insight that may be extracted from the data. Often, this is the only data insight which is commonly looked at by shop owners.

Consider the following question. Suppose that Sally’s average ticket is 9€, and let us group the tickets according to whether they are above or below the average ticket. Which of the two sets would you guess is bigger? More precisely, which of the following is true?

- Sally sells more tickets with billing above 9€.
- Sally sells about the same amount of tickets above and below 9€.
- Sally sells more tickets with billing below 9€.

The correct answer is 3. In order to explain and visualize how this is the case, we can draw a *histogram*: we count the number of tickets with billing between 0 and 1 €, then between 1 and 2 €, then between 2 and 3 €, and so on… In the end, over each interval, we draw a box as high the number of tickets falling in each bin.

If you have chosen answer 2 above, it could be that perhaps you would expect to observe a histogram as the one below.

Answering 2) could be in some sense natural if we expected the ticket billing to follow a so-called *normal distribution*, in which sample values distribute evenly around a central point. A ticket billing histogram that resembles what one observes in reality looks more like this:

It turns out that in reality the ticket average splits the tickets in a very uneven way: the vast majority of tickets are below the average. Indeed, the ticket distribution of a shop is very far from fitting a normal distribution: the vast majority of tickets are of small value, and then there is a *long-tail* of fewer and fewer tickets with increasingly high billing.

Indeed, this is an example of the so-called 80/20 rule, or Pareto principle, named after the Italian economist Vilfredo Pareto. In this context, the principle tells us that roughly 80% of Sally’s shop’s total billing comes from the 20% of tickets with smallest billing.

This phenomenon was first reported by Pareto in 1896 regarding the study of economic wealth distribution, but it features in many areas of science, such as medicine, biology or engineering. There is a family of probability distributions which model these phenomena, called power-law distributions.

Another place where it is possible to observe an 80/20 rule is in the context of product rankings. Sally’s POS device contains not only information about billing, but about products. Namely, it is possible to know for each of the products in Sally’s shop how much has been sold over a period of time and at which price, making it possible to know how much has each individual product grossed.

Imagine we take all products available at Sally’s shop over a 90-day period, compute how much did each product gross and sort all products from top-selling to bottom-selling. Such a ranking would look like this:

Here, each dot represents a single product: the higher the dot is placed, the more it grosses. Typically there are a select few products which clearly outperform the rest, and a long tail where the vast majority of products, which sell little, are to be found.

In this context, the 80/20 principle states that 80% of the shop’s income is obtained by the top-selling 20% of products.

One of the first things that we can conclude after learning that Sally’s tickets follow a power-law distribution is that the average ticket is not such a good descriptor of what the billing for a typical purchase at her shop is. The point at which the graph in figure 2 peaks is known as the *mode* of the distribution. The mode is a more interesting statistical descriptor in this case, as a vast amount of the shop’s tickets concentrate about that value. After looking at the tickets distribution, there is a qualitative difference in the information that Sally can obtain:

- Before: «Sally, your average ticket is 9€».
- After: «Sally, 35% of your tickets have billing between 3€ and 7€».

Knowing how much this bulk of tickets around the distribution mode deviates from the distribution mean is also interesting for the following reason. A ticket distribution where the mode and mean are far away means that Sally makes her revenue by selling more tickets of smaller value. If the mode and mean are close together, the contrary happens: Sally sells less tickets of greater value. Deciding which is preferable could be a matter of business strategy, but in the former case Sally is more protected against an eventual loss of customers.

In order to run the shop, is it vital for Sally to know which of her products perform well and which do not. Knowing that the product ranking follows a long-tail pattern is the baseline on which we can build methods that categorize products according to their relevance in terms of grossing. This kind of categorization is known as ABC analysis. Typically:

**A**products are about the top-selling 20%.**B**products are those with medium impact to the global sales.**C**products are those that have marginal impact on total sales.

Knowing the category for a given product determines the amount of effort and attention that should be put into it.

Since Sally is a user of our BitPhy Dashboard, she has instant, real-time access to the ABC classification of her shop’s products.

The phenomena which we have described has been widely known to take place in the retail sector, and it is in some sense conditioned by the physical nature of a retail shop. Namely, there is a finite amount of space available on which products can be exposed. This limits the total number of products available to be bought at any time.

The rise of e-commerce has provoked a shift in the shape of these product ranking distributions. Since an online shop has no physical limitation on the number of products that can be stored, the overall number of products available is much higher. This leads to rather heavy long tails in product rankings, which decrease at a much slower pace and therefore have a big overall impact. So to say, the trend consists on selling less items of more products.

At BitPhy we have discovered that this is one fundamental difference that tells apart physical retail shops from their online counterparts, and it must be taking into account when designing any kind of algorithm that outputs intelligent shop analytics.

The importance of long-tails for the economics of e-commerce has received lots of attention especially after Chris Anderson wrote about it in Wired. Certainly, this piece and the books that have spanned from it are a good read if you are interested in understanding this key aspect of the economic impact of the rise of e-commerce.

If you are interested in the probability distributions governing the long-tail and 80/20 phenomena described here, the 2006 paper by Clauset, Shalizi & Newman is a great place to start (the relationship of the probability density function in the discrete case to the Riemann-Hurwitz zeta function is particularly appealing to the mathematician in me).

More recently, Alstott, Bullmore & Plenz published a paper that describes the `powerlaw`

python package, available at the PyPI package repository.

Both papers include software implementations of the methods they describe, and in particular they suggest methods and tests to decide how well your data fits a theoretical power-law distribution.

Finally, it should also be noted that the power-law continuous distribution function is commonly available to python users as `scipy.stats.powerlaw`

.