Any good statistics student will need to do some integrals in her / his life. While I generally feel comfortable with simple integrals, I thought it might be worth setting up a workflow to help automate this process!
Previously, especially coming from a physics background, I’ve worked a lot with Mathematica, an advanced version of the software available online as WolframAlpha. Mathematica is extremely powerful, but it’s not open-source and comes with a hefty license, so I decided to research alternatives.
The main options I looked into were Sage, Maxima, and SymPy, and I eventually decided to take SymPy for a spin.1 This will also be my first post in Python, which is well-supported by the knitr / R Markdown framework.
Given a PDF \(f(x)\) of a continuous random variable \(X\), expectations of functions of \(X\) take the form of integrals. Concretely, let \(g(X)\) be a function of the random variable. Then \[ \mathbb{E}_X[g(X)] = \int_{-\infty}^{\infty}g(x)f(x)dx \] (If \(X\) is a multivariate random variable, the integral should be appropriately converted to multiple dimensions.)
The normalization condition of a PDF can be written as: \[ \mathbb{E}_X[1] = \int_{-\infty}^{\infty}f(x)dx = 1 \] Moments of \(X\) take the form: \[ \mathbb{E}_X[X^n] = \int_{-\infty}^{\infty}x^nf(x)dx \] From which one can get the mean, \(\mathbb{E}_X[X]\), and variance, \[ Var(X) = \mathbb{E}_X[X^2] - (\mathbb{E}_X[X])^2 \] One final useful expectation is the moment generating function, or MGF. For a real variable \(t\), the MGF is a function of \(t\) in a neighborhood around 0 such that the expectation \[ M_X(t) = \mathbb{E}_X[e^{tX} ]= \int_{-\infty}^{\infty}e^{tx}f(x)dx \] exists.
In this post, we’ll use SymPy to try to compute these quantities analytically for a few distributions. The type of software exemplified by SymPy and Mathematica is called a computer algebra system, and uses coded rules to manipulate expressions.
First we import SymPy:
import sympy as sym
print sym.__version__
## 1.1.1
To write SymPy expressions, one first defines the symbols that are manipulated. We start out with \(x\), the variable with respect to which PDFs are defined, and \(t\), the variable for MGFs. We then define some simple helper functions for expressing our expectations of interest.
sym.init_printing()
x, t = sym.symbols('x t', real=True)
def area(dist):
return sym.simplify(sym.integrate(dist, (x, -sym.oo, sym.oo)))
def mean(dist):
return area(dist*x)
def EX2(dist):
return area(dist*x**2)
def variance(dist):
return sym.simplify(EX2(dist) - mean(dist)**2)
def mgf(dist):
return sym.simplify(area(dist*sym.exp(x*t)))
def latex(result):
return "$" + sym.latex(result) + "$\n"
def summarize(dist):
print "Distribution: " + latex(dist)
print "Area: " + latex(area(dist))
print "Mean: " + latex(mean(dist))
print "Variance: " + latex(variance(dist))
print "MGF: " + latex(mgf(dist))
summarise = summarize # alias
Our summarize
(or summarise
) function allows us to print the relevant summary information given a “distribution”, which is just a SymPy function of x
.
Next, we define the other symbols that will be used throughout this post.2
# Define other symbols that show up
mu = sym.symbols('mu', real=True)
sigma, a, b, lamb, nu = sym.symbols('sigma a b lambda nu', positive=True)
We start with the normal distribution:3
normal = (2*sym.pi*sigma**2) ** sym.Rational(-1, 2) * sym.exp(-(x-mu)**2/(2*sigma**2))
summarize(normal)
Distribution: \(\frac{\sqrt{2}}{2 \sqrt{\pi} \sigma} e^{- \frac{\left(- \mu + x\right)^{2}}{2 \sigma^{2}}}\)
Area: \(1\)
Mean: \(\mu\)
Variance: \(\sigma^{2}\)
MGF: \(e^{\frac{t}{2} \left(2 \mu + \sigma^{2} t\right)}\)
laplace = (2*b) ** (-1) * sym.exp(-sym.Abs(x-mu)/b)
summarize(laplace)
Distribution: \(\frac{1}{2 b} e^{- \frac{1}{b} \left|{\mu - x}\right|}\)
Area: \(1\)
Mean: \(\mu\)
Variance: \(2 b^{2}\)
MGF: \(\begin{cases} - \frac{e^{\mu t}}{b^{2} t^{2} - 1} & \text{for}\: \operatorname{periodic_{argument}}{\left (e^{\frac{i \pi}{2}} \operatorname{polar\_lift}{\left (t \right )},\infty \right )} = 0 \\\int_{-\infty}^{\infty} \frac{1}{2 b} e^{\frac{1}{b} \left(b t x - \left|{\mu - x}\right|\right)}\, dx & \text{otherwise} \end{cases}\)
I have no idea what the intimidating condition is, but the MGF is correct.
This function is defined piecewise:
expo = sym.Piecewise(
(0, x < 0),
(lamb * sym.exp(-lamb*x), True)
)
summarize(expo)
Distribution: \(\begin{cases} 0 & \text{for}\: x < 0 \\\lambda e^{- \lambda x} & \text{otherwise} \end{cases}\)
Area: \(1\)
Mean: \(\frac{1}{\lambda}\)
Variance: \(\frac{1}{\lambda^{2}}\)
MGF: \(\begin{cases} \frac{\lambda}{\lambda - t} & \text{for}\: \left|{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (t \right )},\infty \right )}}\right| \leq \frac{\pi}{2} \\\int_{-\infty}^{\infty} \begin{cases} 0 & \text{for}\: x < 0 \\\lambda e^{- x \left(\lambda - t\right)} & \text{otherwise} \end{cases}\, dx & \text{otherwise} \end{cases}\)
gamma = sym.Piecewise(
(0, x < 0),
(b**a / sym.gamma(a) * x**(a-1) * sym.exp(-x*b), True)
)
summarize(gamma)
Distribution: \(\begin{cases} 0 & \text{for}\: x < 0 \\\frac{b^{a} x^{a - 1}}{\Gamma{\left(a \right)}} e^{- b x} & \text{otherwise} \end{cases}\)
Area: \(1\)
Mean: \(\frac{a}{b}\)
Variance: \(\frac{a}{b^{2}}\)
MGF: \(\begin{cases} \begin{cases} b^{a} t^{- a} \left(\frac{1}{t} \left(b - t\right)\right)^{- a} & \text{for}\: \frac{b}{\left|{t}\right|} > 1 \\b^{a} t^{- a} \left(\frac{1}{t} \left(- b + t\right)\right)^{- a} e^{- i \pi a} & \text{otherwise} \end{cases} & \text{for}\: \left|{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (t \right )},\infty \right )}}\right| \leq \frac{\pi}{2} \\\int_{-\infty}^{\infty} \begin{cases} 0 & \text{for}\: x < 0 \\\frac{b^{a} x^{a - 1}}{\Gamma{\left(a \right)}} e^{x \left(- b + t\right)} & \text{otherwise} \end{cases}\, dx & \text{otherwise} \end{cases}\)
Fun fact: Wikipedia tells us that the Exponential, Chi-squared, and Erlang distributions are all special cases of the Gamma.
The Beta distribution is the first one that SymPy was unable to evaluate. When I tried the area, mean, variance, and MGF, all the integrals hanged, and I had to abort the operation.5
beta = sym.Piecewise(
(0, x < 0),
(0, x > 1),
(x**(a-1)*(1-x)**(b-1)/(sym.gamma(a)*sym.gamma(b)/sym.gamma(a+b)), True)
)
print "Distribution: " + latex(beta)
# area(beta) # had to abort
Distribution: \(\begin{cases} 0 & \text{for}\: x > 1 \vee x < 0 \\\frac{x^{a - 1} \Gamma{\left(a + b \right)}}{\Gamma{\left(a \right)} \Gamma{\left(b \right)}} \left(- x + 1\right)^{b - 1} & \text{otherwise} \end{cases}\)
However, the Uniform distribution, a special case of the Beta with \(a = b = 1\), works just fine:
uniform = sym.Piecewise(
(0, x < 0),
(0, x > 1),
(1, True)
)
summarize(uniform)
Distribution: \(\begin{cases} 0 & \text{for}\: x > 1 \vee x < 0 \\1 & \text{otherwise} \end{cases}\)
Area: \(1\)
Mean: \(\frac{1}{2}\)
Variance: \(\frac{1}{12}\)
MGF: \(\begin{cases} 1 & \text{for}\: t = 0 \\\frac{1}{t} \left(e^{t} - 1\right) & \text{otherwise} \end{cases}\)
Last we come to the Student t distribution. This one doesn’t have an MGF (see Wikipedia), so we display each quantity of interest separately rather than use our summarize
function.
student = (1 + ((x-mu) / sigma)**2 / nu)**(-(1+nu)/2) * sym.gamma((nu+1)/2)/(sym.gamma(nu/2)*sym.sqrt(nu*sym.pi)*sigma)
print "Distribution: " + latex(student)
Distribution: \(\frac{\left(1 + \frac{\left(- \mu + x\right)^{2}}{\nu \sigma^{2}}\right)^{- \frac{\nu}{2} - \frac{1}{2}}}{\sqrt{\pi} \sqrt{\nu} \sigma \Gamma{\left(\frac{\nu}{2} \right)}} \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}\)
print "Area: " + latex(area(student))
Area: \(1\)
print "Mean: " + latex(mean(student))
Mean: \(\begin{cases} \mu & \text{for}\: \frac{\nu}{2} + \frac{1}{2} > 1 \\\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx & \text{otherwise} \end{cases}\)
print "Variance: " + latex(sym.trigsimp(variance(student)))
Variance: \(- \begin{cases} \mu^{2} & \text{for}\: \frac{\nu}{2} + \frac{1}{2} > 1 \\\left(\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx\right)^{2} & \text{otherwise} \end{cases} + \begin{cases} \frac{2 \left(\mu^{2} \nu - 2 \mu^{2} + \nu \sigma^{2}\right) \cos^{2}{\left (\frac{\pi \nu}{2} \right )}}{\left(\nu - 2\right) \left(\cos{\left (\pi \nu \right )} + 1\right)} & \text{for}\: - \frac{\nu}{2} + \frac{5}{2} < 2 \wedge - \frac{\nu}{2} + 3 < 2 \\\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x^{2} \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx & \text{otherwise} \end{cases}\)
Here, I used sym.trigsimp
, which added a few simplifications compared to sym.simplify
(you can check this yourself). Yet still, SymPy doesn’t quite get the simplified expression for the variance. Notice that \[
2\cos^2(y/2) = \cos(y) + 1
\] so the expressions with \(y = \pi \nu\) should cancel. If we notice this by eye, we can then use SymPy to finish the job, which is valid for \(\nu > 2\).6
expression = -mu**2 + (mu**2*nu-2*mu**2+nu*sigma**2)/(nu-2)
print(latex(expression))
\(- \mu^{2} + \frac{1}{\nu - 2} \left(\mu^{2} \nu - 2 \mu^{2} + \nu \sigma^{2}\right)\)
print(latex(sym.simplify(expression)))
\(\frac{\nu \sigma^{2}}{\nu - 2}\)
Out of seven common continuous distributions, SymPy pretty much solved five of them. For the Beta distribution, it had trouble with all the integrals, and for the Student t distribution, it failed to notice some simplifications.
I imagine this is not competitive with Mathematica, but for its free price and Python integration, I do think SymPy could make a valuable addition to a statistician’s toolbox.
This blog post was generated from an R Markdown file using the knitr
and blogdown
packages. The original source can be downloaded from GitHub.
UPDATE 2018-04-06: printing the area calculation (PDF normalization) as well.
lamb
instead of lambda
, because lambda
is a predefined Python construct.↩
I chose to display code output using the knitr
option results="asis"
, so that the LaTeX formatting would show up.↩
This is also documented by another user as a GitHub issue.↩
I find it strange that SymPy wasn’t able to simplify the conditions on \(\nu\) to give \(\nu > 1\) for the first part and \(\nu > 2\) for the second.↩