Monday, August 10, 2009

Stackoverflow Data

Stackoverflow.com, in case you haven't heard of it, is a very popular question and answer site for programmers.

It was developed by Jeff Atwood and Joel Spoelski, both well known bloggers on technology, and they produce an entertaining podcast on the development of stackoverflow.com and (more or less) related computer topics.

Anyway, to get to the point, the data at stackoverflow is all user generated, and licensed under Creative Commons. They release, about once a month, an xml dump of all the data, in a big archived file. The release includes all posts and all comments, and some user profile data.

With some pro-active assistance from Stéphane Bortzmeyer, I have imported the data from the August dump into an Rdbhost database and made it available to anonymous users, with SELECT privilege only (no changes to data permitted). The database engine behind the Rdbhost webservice is Postgresql.

The database is at: www.rdbhost.com/rdbadmin/main.html?r0000000767
Just click the login button; no authentication is required. The SQL admin software is a work-alike to Adminer (formerly phpMinAdmin), implemented in javascript.

Hopefully, the table and field names make sense and their meanings can be inferred. If you find the indexes to be insufficient for your purposes, email me or put a comment on this blog entry. There is a 3 second query duration limit, so any query needing a full-table scan on any of the larger tables will likely fail.

Have fun.

Edited: Added mention of Postgresql and made a few grammar/punctuation corrections.

1 comment:

  1. Three seconds is not always sufficient time for a query, even for a web-page backend query.

    The time limit has been increased to 8 seconds.

    ReplyDelete