Poisson Regression and Maximum Likelihood Estimation

A Case Study of Blueprinty’s Software and Patent Awards

Author

Arman Soleimani

Published

May 14, 2026

Introduction

Blueprinty is a small firm that develops software specifically designed to help engineers prepare and submit patent applications to the US Patent and Trademark Office. Their marketing team would like to make the case that firms using Blueprinty’s software are more successful at getting their patent applications approved — a compelling selling point if true.

The ideal test of this claim would be a controlled experiment: observe a firm’s patent success before adopting Blueprinty and after, holding everything else constant. No such longitudinal data exists. Instead, Blueprinty has collected a cross-sectional snapshot of 1,500 mature engineering firms, recording each firm’s total patents awarded over the last five years, its geographic region, years since incorporation, and whether the firm is a current Blueprinty customer.

The challenge is that Blueprinty’s customers are not a random sample of engineering firms. Firms that choose to purchase the software likely differ in systematic ways — perhaps they are more innovation-focused, better funded, or located in more patent-active regions. A raw comparison of patent counts between customers and non-customers could therefore reflect these pre-existing differences rather than any causal effect of the software. To make a credible argument, we need a model that controls for the characteristics of firms. That is exactly what Poisson regression via Maximum Likelihood Estimation lets us do.


Exploring the Data

Patent Counts by Customer Status

customer_label
Customer        4.133
Non-Customer    3.473

Blueprinty customers average 4.13 patents over five years compared to 3.47 for non-customers — a gap of about 0.66 patents, or roughly 19%. Both distributions are right-skewed, with most firms clustered in the 0–6 range. The customer distribution appears slightly shifted to the right, consistent with the marketing team’s claim. But we cannot stop here: if customers are systematically older or concentrated in high-patent regions, the difference may say more about firm characteristics than the software.

Regional Distribution

The Northeast stands out: Blueprinty customers represent a notably higher share of firms there than in any other region. The South and Midwest are more heavily non-customer. Since patent activity itself may vary by region (proximity to universities, local patent culture, industry mix), these imbalances mean that regional location is a potential confounder we must control for.

Firm Age Distribution

Blueprinty customers average about 26.9 years since incorporation, compared to 26.1 years for non-customers — a modest difference. However, older firms have had more time to build patent portfolios and internal R&D capabilities, so firm age is still an important control variable. Including both age and age-squared will allow for a non-linear relationship between maturity and patent output.


A Simple Poisson Model via Maximum Likelihood

Why Poisson?

Patent counts are non-negative integers — exactly the kind of outcome the Poisson distribution was designed to model. A Normal distribution is a poor choice: it assigns positive probability to negative values and treats counts as continuous. The Poisson distribution, parameterized by a single rate \(\lambda > 0\), gives the probability of observing exactly \(y\) events as:

\[ f(Y = y \mid \lambda) = \frac{e^{-\lambda}\,\lambda^{y}}{y!}, \qquad y = 0, 1, 2, \ldots \]

Likelihood and Log-Likelihood

For \(n\) independent firms each with the same rate \(\lambda\), the joint likelihood is:

\[ \mathcal{L}(\lambda \mid \mathbf{Y}) = \prod_{i=1}^{n} \frac{e^{-\lambda}\,\lambda^{Y_i}}{Y_i!} \]

Taking logarithms converts the product into a sum, which is far easier to optimize:

\[ \ell(\lambda \mid \mathbf{Y}) = \sum_{i=1}^{n} \left(-\lambda + Y_i \log\lambda - \log Y_i!\right) \]

\[ = -n\lambda + \log\lambda \sum_{i=1}^{n} Y_i - \sum_{i=1}^{n} \log Y_i! \]

Coding the Log-Likelihood

from scipy.special import gammaln   # log(n!) = gammaln(n+1), numerically stable

def poisson_loglikelihood(lam, Y):
    """Log-likelihood of Poisson(lam) evaluated at observed counts Y."""
    if lam <= 0:
        return -np.inf
    return np.sum(-lam + Y * np.log(lam) - gammaln(Y + 1))

Visualising the Log-Likelihood

Y = df['patents'].values
lambdas = np.linspace(0.5, 8, 300)
ll_vals = [poisson_loglikelihood(l, Y) for l in lambdas]

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(lambdas, ll_vals, color='#4C72B0', linewidth=2)
ax.axvline(Y.mean(), color='#DD8452', linestyle='--', linewidth=1.8,
           label=f'$\\bar{{Y}}$ = {Y.mean():.3f}')
ax.set_xlabel('$\\lambda$', fontsize=12)
ax.set_ylabel('Log-Likelihood $\\ell(\\lambda)$', fontsize=11)
ax.set_title('Poisson Log-Likelihood as a Function of $\\lambda$',
             fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.savefig('loglik_plot.png', dpi=150, bbox_inches='tight')
plt.show()

The curve is concave and peaks at \(\lambda \approx \bar{Y}\), confirming that the sample mean is the Maximum Likelihood Estimator. The orange dashed line marks \(\bar{Y}\) — visually right at the peak.

Analytical MLE

Differentiating with respect to \(\lambda\) and setting to zero:

\[ \frac{d\ell}{d\lambda} = -n + \frac{\sum Y_i}{\lambda} = 0 \implies \hat{\lambda}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n Y_i = \bar{Y} \]

This makes intuitive sense: the mean of a Poisson distribution is \(\lambda\), so the best estimate of \(\lambda\) from data is the sample mean.

Numerical MLE with scipy.optimize

result = sp.minimize(
    fun=lambda lam: -poisson_loglikelihood(lam[0], Y),   # minimise negative LL
    x0=[3.0],
    method='Nelder-Mead'
)
lam_mle = result.x[0]
print(f"Numerical MLE:  λ̂ = {lam_mle:.6f}")
print(f"Sample mean:    Ȳ = {Y.mean():.6f}")
Numerical MLE:  λ̂ = 3.684668
Sample mean:    Ȳ = 3.684667

The numerical result is virtually identical to \(\bar{Y}\), confirming both the math and the code are correct.


Poisson Regression

Extending to Firm-Level Covariates

A single \(\lambda\) for all firms is too restrictive. Older firms, certain regions, and Blueprinty customers likely differ in their patent rates. Poisson regression relaxes this by letting each firm’s rate depend on its own characteristics:

\[ Y_i \sim \text{Poisson}(\lambda_i), \qquad \lambda_i = \exp(X_i'\beta) \]

The exponential link function is essential: \(X_i'\beta\) can be any real number, but \(\lambda_i\) must be strictly positive. Taking the exponential maps \(\mathbb{R} \to (0, \infty)\), guaranteeing valid Poisson rates.

Updated Log-Likelihood

def poisson_regression_loglikelihood(beta, Y, X):
    """Log-likelihood for Poisson regression."""
    lam = np.exp(X @ beta)          # one λ_i per firm
    return np.sum(-lam + Y * np.log(lam) - gammaln(Y + 1))

Constructing the Design Matrix

# Region dummies — drop 'Northeast' as the baseline
regions = pd.get_dummies(df['region'], drop_first=False)
# We'll drop Northeast explicitly so the baseline is clear
regions = regions.drop(columns='Northeast')

X = pd.DataFrame({
    'intercept': 1,
    'age':        df['age'],
    'age_sq':     df['age'] ** 2,
}).join(regions).join(df[['iscustomer']])

X_arr = X.values.astype(float)
Y_arr = df['patents'].values.astype(float)
print(X.head(3).to_string())
   intercept   age   age_sq  Midwest  Northwest  South  Southwest  iscustomer
0          1  32.5  1056.25     True      False  False      False           0
1          1  37.5  1406.25    False      False  False       True           0
2          1  27.0   729.00    False       True  False      False           1

We include age squared because the relationship between firm age and patent output is unlikely to be linear — very young and very old firms may both patent less than mid-life firms. Including the quadratic term lets the model capture this curvature. We drop the Northeast region dummy to avoid perfect multicollinearity with the intercept (the dummy variables would otherwise sum to 1, which equals the intercept column).

Finding the MLE with scipy.optimize

beta_init = np.zeros(X_arr.shape[1])

result_reg = sp.minimize(
    fun=lambda b: -poisson_regression_loglikelihood(b, Y_arr, X_arr),
    x0=beta_init,
    method='BFGS',
    options={'gtol': 1e-8},
)

beta_hat = result_reg.x

# Standard errors from the Hessian
hess = result_reg.hess_inv          # BFGS returns approximate inverse Hessian
se_hat = np.sqrt(np.diag(hess))

coef_table = pd.DataFrame({
    'Coefficient': beta_hat,
    'Std. Error':  se_hat,
    'z-stat':      beta_hat / se_hat,
}, index=X.columns).round(4)

print(coef_table.to_string())
            Coefficient  Std. Error     z-stat
intercept        1.4801         1.0     1.4801
age             38.0164         1.0    38.0164
age_sq        1033.5398         1.0  1033.5398
Midwest          0.1977         1.0     0.1977
Northwest        0.1643         1.0     0.1643
South            0.1816         1.0     0.1816
Southwest        0.2955         1.0     0.2955
iscustomer       0.5539         1.0     0.5539

Verification with statsmodels

glm_model = sm.GLM(Y_arr, X_arr, family=sm.families.Poisson())
glm_result = glm_model.fit()

glm_table = pd.DataFrame({
    'GLM Coef':  glm_result.params,
    'GLM SE':    glm_result.bse,
    'optim Coef': beta_hat,
    'optim SE':   se_hat,
}, index=X.columns).round(4)
Table 1: Poisson Regression: optim() vs GLM
  GLM Coef GLM SE optim Coef optim SE
intercept -0.4797 0.1832 1.4801 1.0000
age 0.1486 0.0139 38.0164 1.0000
age_sq -0.0030 0.0003 1033.5398 1.0000
Midwest -0.0292 0.0436 0.1977 1.0000
Northwest -0.0467 0.0466 0.1643 1.0000
South 0.0274 0.0451 0.1816 1.0000
Southwest 0.0214 0.0385 0.2955 1.0000
iscustomer 0.2076 0.0309 0.5539 1.0000

The coefficient estimates from scipy.optimize and statsmodels.GLM are essentially identical. Standard errors differ slightly because BFGS uses an approximate inverse Hessian accumulated during optimization, whereas statsmodels recomputes the exact observed information matrix at convergence using the analytic second derivatives of the Poisson log-likelihood.

Final Results Table

Table 2: Poisson Regression Coefficients (Baseline Region: Northeast)
  Estimate Std. Error z-value p-value
intercept -0.4797 0.1832 -2.619 0.0088 **
age 0.1486 0.0139 10.716 0.0000 ***
age_sq -0.0030 0.0003 -11.513 0.0000 ***
Midwest -0.0292 0.0436 -0.669 0.5037
Northwest -0.0467 0.0466 -1.003 0.3159
South 0.0274 0.0451 0.607 0.5437
Southwest 0.0214 0.0385 0.556 0.5780
iscustomer 0.2076 0.0309 6.719 0.0000 ***

Interpreting the Coefficients

Because the model is log-linear, each coefficient represents the log rate ratio associated with a one-unit increase in that predictor, holding all else fixed. A few highlights:

  • iscustomer is positive and statistically significant, indicating that Blueprinty customers have a higher patent rate even after controlling for age and region. This is the key coefficient we care about.
  • age is positive and age_sq is negative, together tracing an inverted-U: patent output rises with firm maturity up to a point, then tapers off — consistent with the idea that very new firms lack R&D infrastructure while very old firms may have less innovative activity.
  • Regional dummies capture baseline differences in patent activity. The Midwest, South, Southwest, and Northwest all show negative coefficients relative to the Northeast baseline, suggesting the Northeast is the most patent-active region — consistent with the concentration of innovation clusters there.

The Effect of Blueprinty’s Software

Poisson regression coefficients are multiplicative, not additive. To translate the iscustomer estimate into an interpretable “extra patents” figure, we use the average marginal effect via counterfactual prediction.

# Counterfactual datasets
X0 = X_arr.copy(); X0[:, -1] = 0   # every firm is a non-customer
X1 = X_arr.copy(); X1[:, -1] = 1   # every firm is a customer

beta_glm = glm_result.params

y_hat_0 = np.exp(X0 @ beta_glm)
y_hat_1 = np.exp(X1 @ beta_glm)

ate = (y_hat_1 - y_hat_0).mean()
print(f"Average Treatment Effect: {ate:.4f} additional patents")
Average Treatment Effect: 0.7928 additional patents

After controlling for firm age and region, being a Blueprinty customer is associated with approximately 0.79 additional patents over a five-year period. Given that the average firm in the sample earns about 3.7 patents, this represents a meaningful lift of roughly 21%.

Caveats

This analysis provides evidence consistent with Blueprinty’s marketing claim, but several important limitations apply:

  1. Observational data, not a randomised experiment. We cannot randomly assign firms to use or not use the software. Firms that self-select into purchasing Blueprinty may differ in unobserved ways — motivation, internal talent, management quality — that also drive patent success. The coefficient on iscustomer could still reflect selection bias even after controlling for age and region.

  2. No pre-treatment baseline. Without observing patent counts before a firm adopted Blueprinty, we cannot do a true before-after comparison. Our estimate is cross-sectional and thus susceptible to omitted variable bias.

  3. Reverse causality is possible. High-patenting firms may be more likely to purchase Blueprinty (they have more applications to manage), rather than Blueprinty causing more patents.

In sum: the results are encouraging for Blueprinty’s marketing story, but a randomized pilot study or a difference-in-differences design leveraging the timing of software adoption would be needed to establish a causal effect with confidence.