Why is academic writing so dense?

Published in

Towards Data Science

7 min readMay 24, 2018

I analyzed 100K papers to find out.

Motivation

Recently, I took a long break from posting on Medium. The 2nd half of the Public Finance class at Stanford is about structural estimation. These paper easily run up to 100 pages/paper, and are filled with greek letters and lengthy, complex sentences.

This painful experience left me wondering: why aren’t researchers motivated to write simpler?

Theory & Testable Predictions

Should academic writing be simpler?

Theory 1: You might think it is a no-brainer — simple writing is easier to understand, which in turn leads to more citations. [Yes]
Theory 2: Unfortunately, peer-review publication game can be pretty messed up at times. Authors may write lengthy, complex papers to deflect every imaginable criticism. They might also use complexity or length to signal the quality of their paper — after all, no one wants to read 100 pages of Greek letters, not even the referees. [No]
Theory 3: A more innocuous view is that complex writings are necessary for expressing complex ideas. [No]

Testable Predictions

Prediction 1: simpler papers get more citations.
Prediction 2: complex papers get more citations.
Prediction 3: citations is unaffected by writing style.

Preview of Results

I found evidence for all 3 theories:

Academic writings are so dense, because citations don’t seem to be affected by it. In some case, researchers might even get rewarded for it, possibly because it makes flaws harder to detect.
There’s a lot of heterogeneity, based on the subjects and cultural context.
My results are by no means final — as there might be other confounding factors that I haven’t yet considered. Leave me a message if you have any suggestions on how I can make my analysis more rigorous.

Let’s get started :)

Data

To answer this question, I use citation data from the 2008 SIGKDD paper ArnetMiner: Extraction and Mining of Academic Social Network, which gathered the following information for 3 million papers:

Result based on the first 100K papers

100K Sample

Since I have limited processing power and memory, I took the first 100K papers as my sample. If you’re interested in replicating my analysis, I’ve provided my Python code at the end of this blog. We can compare our results and see if the first 100K paper were representative.

Readability Measure

I compute the Linsear-write score for each paper’s abstract. I picked Linsear-write because purportedly, it was developed by the US Air Force to calculate the readability of their technical manual, as opposed to some random academic construct.

Implicitly, I am making 2 assumptions:

The complexity of the abstract is a good proxy for the complexity of the paper.
Linsear-write is a good measure of text complexity.

Correlation Coefficient:

Corr[# of citation, readability]: -0.026.
Corr[year, readability]: 0.013

Observations

The relationship between # of citations and Linsear readability score is non-linear, and highly variable: Prior to 10th grade comprehension, # of citations increases with writing complexity; Post 10th grade comprehension, # of citations decreases with writing complexity. But this relationship seems rather weak — since there’s a lot of variability around 10th grade.

The relationship between Linsear score and year is less obvious: Over time, more papers are being published every year. So # of easy papers and # of hard papers both increased. # of hard papers increased more, relative to the # of easy papers. Hard papers got harder over time.

Causal interpretation

Correlation != Causality

What is stopping me from concluding that complex writing leads to low citation? Let me check if I’m falling into any of the causal inference pitfalls:

Reverse causality: low citation leads to complex writing style?
Coincidental correlation: the correlation is due to random chance. Unlikely, given the large sample size. But to be rigorous, I can do t-test.
Omitted variable bias: a third factor causes both complex writing style and low citation.

Omitted variable bias is a real concern. Possible third factor causing both complex writing style and low citation are:

Example 1: theoretical fields have harder terms and fewer researchers. But since theses hard terms are commonly used by all researchers in the field, they don’t tax the readers much; and the low citations is simply an artifact of having fewer researchers in the field to cite them.
Example 2: over time, fields become more advanced and specialized. More advanced concepts might necessitate more complex writing, and more specialized questions might generate fewer citations.
Solution: (1) for each paper, demean the citations and complexity using Journal average in that year. (2) Compute the correlation between demeaned citations and demeaned complexity for each journal.
Note: my solution here deals with the 2 omitted variable bias that I think are the most important. There could be other confounding factors that I miss.

Demeaned Correlation Coefficients

For each journal, I compute correlation between demeaned citations and demeaned readability. Here’s the distribution for correlation coefficients for 1,584 journals.

Observations:

On average, a paper that’s more complex than its peer has slightly fewer citations. But the effect is really small — suggesting that academics don’t get penalized for writing abstruse papers in terms of citations.

There are a great deal of heterogeneity: we have a lot of journals with negative correlations, zero correlations and positive correlations, providing evidence for all three theories.

Follow-up: what are the journals with positive, zero or negative correlations?

Click here for the full list.

Economics

Unfortunately for me, the only two top economic journals in this sample have really positive correlations:

Journal of Economic Theory: 0.18
Games and Economic Behavior: 0.35

This is consistent with the view that complex papers are harder to find flaws with => more readers would buy the conclusions and cite it. This corroborates many anecdotes from senior researchers in the field.

Computer Science

Most of the Association for Computing Machinery (ACM) journals have strongly negative correlations. This is expected — since most follow up papers would need to replicate its results. If it is hard to understand, later papers may not want to replicate it.

Word Cloud

Corr[Readability, Citations]<-0.22: bottom quartile

Corr[Readability, Citations]>0.18: top quartile

Observations:

Whether writers get penalized for writing complex sentences depends on the subject matter: the first word cloud features more practical subjects, while the second word cloud features more abstract subjects.
There are important cultural difference. Coming from Asia, I can attest that teachers put a great deal of emphasis on writing complex sentences, and using fancy vocabularies. When I first came to the U.S., I had to un-learn many of my old habits.

Conclusions

I found evidence for all 3 theories:

Academic writings are so dense, because citations don’t seem to be affected by it. In some case, researchers might even get rewarded for it, possibly because it makes flaws harder to detect.
There’s a lot of heterogeneity, based on the subjects and cultural context.
My results are by no means final — as there might be other confounding factors that I haven’t yet considered. Leave me a message if you have any suggestions on how I can make my analysis more rigorous.Appendix: Python Codes

Appendix: Python Code

Let’s import some Python modules:

import re
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import *
from textstat.textstat import textstat

Load JSON objects into Pandas data frame:

df = pandas.read_json('inPath',lines=True)
df = df.head(100000)

Drop rows with missing data:

df1=df.dropna()

Count the number of citations:

df['num_citation']=df['references'].apply(lambda x: len(re.split(',',str(x))))

Compute readability of the abstract, using the Linsear metric.

df['linsear'] = df['abstract'].apply(lambda x:textstat.linsear_write_formula(x))

Plot # of citations against readability:

plt.plot(df['linsear'],df['num_citation'],'ro')
plt.ylabel('# of citations')
plt.xlabel('linsear')
plt.title('# of citations vs. linsear')
plt.show()

Demean citations and readability:

demean = lambda df: df - df.mean()
df1=df.groupby(['venue', 'year']).transform(demean)
df['demeaned grade level']=df1['grade level']
df['demeaned citation'] = df1['num_citation']
correlations = df.groupby('venue')['demeaned grade level','demeaned citation'].corr().unstack()
correlations = correlations['demeaned grade level','demeaned citation'].sort_values()

Plotting word cloud:

neg_corr=correlations[(correlations<-0.22)]
pos_text=pos_corr.index.str.cat(sep=',')
neg_text=neg_corr.index.str.cat(sep=',')
common_set = set(pos_text.lower().split()) & set(neg_text.lower().split())
common_list=[]
for e in common_set:
    for w in re.split('\W+',e):
        if w!='':
            common_list.append(w)
common_list=common_list+['Review','Letters','Application','Personal']

wc= WordCloud(stopwords=common_list).generate(neg_text)
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.savefig('your path to word cloud with negative correlations')