Scientific citations in Wikipedia

Is Wikipedia a reliable source for sources? Seems so!
Finn Årup Nielsen


The Internet-based encyclop√¶dia Wikipedia has grown to become one of the most visited Web sites on the Internet, but critics have questioned the quality of entries. An empirical study of Wikipedia found errors in a 2005 sample of science entries. Biased coverage and lack of sources are among the “Wikipedia risks.”

The study here describes a simple assessment of these aspects by examining the outbound links from Wikipedia articles to articles in scientific journals with a comparison against journal statistics from Journal Citation Reports such as impact factors.

The results show an increasing use of structured citation markup and good agreement with citation patterns seen in the scientific literature though with a slight tendency to cite articles in high-impact journals such as Nature and Science. These results increase confidence in Wikipedia as a reliable information resource for science in general.

The study

This study went through the entire English Wikipedia corpus (that's 2.5 gigabyte — compressed!) to identify structured scientific citations, and count them to see which journals were cited the most.

Some interesting points:

...And a Danish note: For comparing the Wikipedia citation number the present de facto standard for counting journal citations was used: the Journal Citation Reports (JCR) from Thomson Scientific. JCRs are available on the web, but the company requires paid subscription to view the numbers.

Most recent analysis

The most recent analysis I have done is for the July Wikipedia database dump. The dump has now grown to a 2.9 gigabytes compressed file. The image below shows the result in a scatter plot where the Wikipedia citations to each journal are compared to a combined number of total citations and impact factor from Journal Citation Reports. The upper right corner has Nature and Science, while the journal shown as the left most dot is Australian Systematic Botany.
Scatter plot of Wikipedia citations and Journal Citation

Scatter plot of Wikipedia citations and Journal Citation Reports. Wikipedia data from July 2007. Click for high resolution image.

A word of caution

Wikipedia is evolving and requires no strict formatting of references. A citation may be formatted in a variety of ways; it may be removed or reformatted.

Not all citations from Wikipedia were counted. Very many citations use a free-hand format for the reference, and in the study I did not attempt to count all these citations. I estimate that at least half (and probably more) of the references are using the free-hand format as of July 2007. The numbers in the articles are only for “one-line” citations. Actually a structured citation may span multiple lines and these were not counted. The most recent analysis count them (shown in the scatter plot above). There is a number of citations that are not matched by my algoritm since the reference may be not nicely formatted or lack the essential journal information. It only amounts to a minor part, — less than 5% I believe that these issues affects the numbers equally so that the relatively the numbers can be trusted.

References and Downloads

The study was published in the electronic open access journal “First Monday” — the August 2007 issue: The PDF file from arXiv and my university department publication database has higher resolution images, but does not incorporate final edits that First Monday did.


Dansk omtale (Danish comments)


	      Nielsen Finn Årup Nielsen is a Post Doc at the Department of Informatics and Mathematical Modelling at the Technical University of Denmark on a grant from the Lundbeckfonden to CIMBI. He is also attached to Neurobiology Research Unit at the Copenhagen University Hospital Rigshospitalet. He contributes from time to time on the Danish and English language Wikipedias as the “fnielsen” user.

Previous study by the author: Mining Posterior Cingulate
Newer study by the author: Clustering scientific citations in Wikipedia

$Id: Nielsen2007Scientific.html,v 1.18 2008/07/14 14:55:10 fn Exp $