Studying the Wikipedia Over Time

One of the fastest growing and controversial pages on the Web is the Wikipedia. As of a few moments ago, the front page of the site indicated that it had 1,416,076 articles alone in English. There are many other articles in other languages. The Wikipedia:Multilingual coordination indicates that there are presently “somewhat active” Wiki encyclopedias in 114 languages.

Can studying the Wikipedia tell us things about the growth and structure of the World Wide Web?

A paper developed for presentation at the Proceedings of the Web Intelligence Conference in Hong Kong this coming December takes a look at the structure of links between pages of the Wikipedia, which it calls a wikigraph.

The paper is Temporal Analysis of the Wikigraph (pdf), from the Dipartimento di Informatica e Sistemistica at Universit`a di Roma “La Sapienza”.

What makes this interesting is that there are time stamps associated with changes to the Wikipedia. As the authors note:

The Wikigraph differs from other Web graphs studied in the literature by the fact that there are timestamps associated with each node. The timestamps indicate the creation and update dates of each page, and this allows us to do a detailed analysis of the Wikipedia evolution over time.

Can a study of updates and changes to the Wikipedia as reflected in these timestamps provide some insights into other “webgraphs” where such time information isn’t usually available? That’s one of the inquiries that is explored in the paper.

In addition to providing some insight into the structure of the web, the paper provides some tidbits about the wikipedia itself. For instance:

About only 7.5 % of the articles on the Wikipedia have a single editor.

About 50% have more than 7 people involved.

Around 5% have had more than 50 editors.

The average number of updates per user has dropped by about 30% in the last two years.

The average number of outlinks per article have grown from 7 out-links to an average of 16 over the past two and a half years.

There are also some interesting statistics about vandalism and the amount of time it takes to address acts of vandalism. For instance, when someone vandalizes a page by a mass deletion of content, there’s often a correction made within three minutes of that act – really an incredible figure.

Great paper.

4 thoughts on “Studying the Wikipedia Over Time”

  1. Hi Barry,

    I haven’t looked at the other papers that will be presented at this conference yet, but this one is is one of the best papers I’ve read in a while.

    The brain analogy is a good one, but I really like the comparison made early on in the paper about measuring the web being like looking at a star-filled sky at night – there’s no way to tell how far away those stars are, and the light from each has traveled different distances to get to it’s point in the night sky.

  2. A great paper indeed, Bill. It makes you think of the analogy to all this in a single human brain. Synaptic circuits keep flashing away, strengthening some associations and deleting or modifying others. The astonishing thing in Wikipedia is the number of active neurons (if that’s the right word) who are monitoring all this activity to end up with an average 3 minute response to any stimulus. Given the human/computer interaction, that’s staggering.

  3. I was lately thinking how Wikipedia may be poluted with all those marketing types, trying to get a link from a relevant page on a trusted website. I for one was considering it sometime earlier, but dropped the idea, unless my content does provide value (speaking of which, someone did link to me from Wikipedia already).

    And yes, it is really astonishing how Wikipedia operates – it can rightfully be compared to such complex structures as human brain or universe – or the Web.

    The study might make a good night read if I make it, I guess. Thanks for bringing this up, Bill.

  4. Thanks, Yuri.

    That’s a topic that isn’t discussed much, but worth considering – how much of an impact commercial activities and marketing might have upon the wikipedia.

    The study is an interesting one, and had me thinking about some of the other things on the web that had explicit timestamps upon them – blog posts, comments in places like Flickr, and so on.

    I think that it shows some of the potential benefits of having an archive of the web over time from an indexing stance.

Comments are closed.