In the previous article, we focused on two types of distributions: the Gaussian distribution and the Power Law distribution. We saw that these distributions had diametrically opposite statistical properties. Namely, Power Laws are driven by rare events, while Gaussians are not.
This rare-event-driven property raised 3 problems with many of our favorite statistical tools (e.g. mean, standard deviation, regression, etc.) in analyzing Power Laws. The takeaway was that if data are Gaussian-like, one can use common approaches like regression and computing expectation values without worry. However, if data are more Power Law-like, these techniques can give incorrect and misleading results.
We also saw a third (more mischievous) distribution that could resemble both a Gaussian and a Power Law (despite their opposite properties) called a Log Normal distribution.
This ambiguity presents challenges for practitioners in deciding the best way to analyze a given dataset. To help overcome these challenges, it can be advantageous to determine whether data fit a Power Law, Log Normal, or some other type of distribution.
A popular way of fitting a Power Law to real-world data is what I’ll call the “Log-Log approach” [1]. The idea comes from taking the logarithm of the Power Law’s probability density function (PDF), as derived below.
The above derivation translates the Power Law’s PDF definition into a linear equation, as shown in the figure below.
This implies that the histogram of data following a power law will follow a straight line. In practice, what this looks like is generating a histogram for some data and plotting it on a log-log plot [1]. One might go even further and perform a linear regression to estimate the distribution’s α value (here, α = -m+1).
However, there are significant limitations to this approach. These are described in reference [1] and summarized below.
- Slope (hence α) estimations are subject to systematic errors
- Regression errors can be hard to estimate
- Fit can look good even if the distribution does not follow a Power Law
- Fits may not obey basic conditions for probability distributions e.g. normalization
While the Log-Log approach is simple to implement, its limitations make it less than optimal. Instead, we can turn to a more mathematically sound approach via Maximum Likelihood, a widely used statistical method for inferring the best parameters for a model given some data.
Maximum Likelihood consists of 2 key steps. Step 1: obtain a likelihood function. Step 2: maximize the likelihood with respect to your model parameters.
Step 1: Write Likelihood Function
Likelihood is a special type of probability. Put simply, it quantifies the probability of our data given a particular model. We can express it as the joint probability over all our observed data [3]. In the case of a Pareto distribution, we can write this as follows.
To make maximizing the likelihood a little easier, it is customary to work with the log-likelihood (they are maximized by the same value of α).
Step 2: Maximize Likelihood
With a (log) likelihood function in hand, we can now frame the task of determining the best choice of parameters as an optimization problem. To find the optimal α value based on our data, this boils down to setting the derivative of l(α) with respect to α equal to zero and then solving for α. A derivation of this is given below.
This provides us with the so-called Maximum Likelihood estimator for α. With this, we can plug in observed values of x to estimate a Pareto distribution’s α value.
With the theoretical foundation set, let’s see what this looks like when applied to real-world data (from my social media accounts).
One domain in which fat-tailed data are prevalent is social media. For instance, a small percentage of creators get the bulk of the attention, a minority of Medium blogs get the majority of reads, and so on.
Here we will use the powerlaw Python library to determine whether data from my various social media channels (i.e. Medium, YouTube, LinkedIn) truly follow a Power Law distribution. The data and code for these examples are available on the GitHub repository.
Artificial Data
Before applying the Maximum Likelihood-based approach to messy data from the real world, let’s see what happens when we apply this technique to artificial data (truly) generated from Pareto and Log Normal distributions, respectively. This will help ground our expectations before using the approach on data in which we do not know the “true” underlying distribution class.
First, we import some helpful libraries.
import numpy as np
import matplotlib.pyplot as plt
import powerlaw
import pandas as pdnp.random.seed(0)
Next, let’s generate data from Pareto and Log Normal distributions.
# power law data
a = 2
x_min = 1
n = 1_000
x = np.linspace(0, n, n+1)
s_pareto = (np.random.pareto(a, len(x)) + 1) * x_min# log normal data
m = 10
s = 1
s_lognormal = np.random.lognormal(m, s, len(x)) * s * np.sqrt(2*np.pi)
To get a sense of what these data look like, it’s helpful to plot histograms. Here, I plot a histogram of each sample’s raw values and the log of the raw values. This latter distribution makes it easier to distinguish between Power Law and Log Normal data visually.
As we can see from the above histograms, the distributions of raw values look qualitatively similar for both distributions. However, we can see a stark difference in the log distributions. Namely, the log Power Law distribution is highly skewed and not mean-centered, while the log of the Log Normal distribution is reminiscent of a Gaussian distribution.
Now, we can use the powerlaw library to fit a Power Law to each sample and estimate values for α and x_min. Here’s what that looks like for our Power Law sample.
# fit power to power law data
results = powerlaw.Fit(s_pareto)# printing results
print("alpha = " + str(results.power_law.alpha)) # note: powerlaw lib's alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print("x_min = " + str(results.power_law.xmin))
print('p = ' + str(compute_power_law_p_val(results)))
# Calculating best minimal value for power law fit
# alpha = 2.9331912195958676
# x_min = 1.2703447024073973
# p = 0.999
The fit does a decent job at estimating the true parameter values (i.e. a=3, x_min=1), as seen by the alpha and x_min values printed above. The value p above quantifies the quality of the fit. A higher p means a better fit (more on this value in section 4.1 of ref [1]).
We can do a similar thing for the Log Normal distribution.
# fit power to log normal data
results = powerlaw.Fit(s_lognormal)
print("alpha = " + str(results.power_law.alpha)) # note: powerlaw lib's alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print("x_min = " + str(results.power_law.xmin))
print('p = ' + str(compute_power_law_p_val(results)))# Calculating best minimal value for power law fit
# alpha = 2.5508694755027337
# x_min = 76574.4701482522
# p = 0.999
We can see that the Log Normal sample also fits a Power Law distribution well (p=0.999). Notice, however, that the x_min value is far in the tail. While this may be helpful for some use cases, it doesn’t tell us much about the distribution that best fits all the data in the sample.
To overcome this, we can manually set the x_min value to the sample minimum and redo the fit.
# fixing xmin so that fit must include all data
results = powerlaw.Fit(s_lognormal, xmin=np.min(s_lognormal))
print("alpha = " + str(results.power_law.alpha))
print("x_min = " + str(results.power_law.xmin))# alpha = 1.3087955873576855
# x_min = 2201.318351239509
The .Fit() method also automatically generates estimates for a Log Normal distribution.
print("mu = " + str(results.lognormal.mu))
print("sigma = " + str(results.lognormal.sigma))# mu = 10.933481999687547
# sigma = 0.9834599169175509
The estimated Log Normal parameter values are close to the actual values (mu=10, sigma=1), so the fit did a good job once again!
However, by fixing x_min, we lost our quality metric p (for whatever reason, the method doesn’t generate values for it when x_min is provided). So this begs the question, which distribution parameters should I go with? The Power Law or Log Normal?
To answer this question, we can compare the Power Law fit to other candidate distributions via Log-likelihood ratios (R). A positive R implies the Power Law is a better fit, while a negative R implies the alternative distribution is better. Additionally, each comparison gives us a significance value (p). This is demonstrated in the code block below.
distribution_list = ['lognormal', 'exponential', 'truncated_power_law', \
'stretched_exponential', 'lognormal_positive']for distribution in distribution_list:
R, p = results.distribution_compare('power_law', distribution)
print("power law vs " + distribution +
": R = " + str(np.round(R,3)) +
", p = " + str(np.round(p,3)))
# power law vs lognormal: R = -776.987, p = 0.0
# power law vs exponential: R = -737.24, p = 0.0
# power law vs truncated_power_law: R = -419.958, p = 0.0
# power law vs stretched_exponential: R = -737.289, p = 0.0
# power law vs lognormal_positive: R = -776.987, p = 0.0
As shown above, every alternative distribution is preferred over the Power Law when including all the data in the Log Normal sample. Additionally, based on the likelihood ratios, the lognormal and lognormal_positive fits work best.
Real-world Data
Now that we’ve applied the powerlaw library to data where we know the ground truth let’s try it on data for which the underlying distribution is unknown.
We will follow a similar procedure as we did above but with data from the real world. Here, we will analyze the following data. Monthly followers gained on my Medium profile, earnings across all my YouTube videos, and daily impressions on my LinkedIn posts for the past year.
We’ll start by plotting histograms.
Two things jump out to me from these plots. One, all three look more like the Log Normal histograms than the Power Law histograms we saw before. Two, the Medium and YouTube distributions are sparse, meaning they may have insufficient data for drawing strong conclusions.
Next, we’ll apply the Power Law fit to all three distributions while setting x_min as the smallest value in each sample. The results of this are printed below.
To determine which distribution is best, we can again do head-to-head comparisons of the Power Law fit to some alternatives. These results are given below.
Using the rule of thumb significance cutoff of p<0.1 we can draw the following conclusions. Medium followers and LinkedIn impressions best fit a Log Normal distribution, while a Power Law best represents YouTube earnings.
Of course, since the Medium followers and YouTube earrings data here is limited (N<100), we should take any conclusions from those data with a grain of salt.
Many standard statistical tools break down when applied to data following a Power Law distribution. Accordingly, detecting Power Laws in empirical data can help practitioners avoid incorrect analyses and misleading conclusions.
However, Power Laws are an extreme case of the more general phenomenon of fat tails. In the next article of this series, we will take this work one step further and quantify fat-tailedness for any given dataset via 4 handy heuristics.
Be the first to comment