Monday, September 7, 2009

Stackoverflow Data


The stackoverflow database account here has been updated to include the September data dump. That includes cumulative data up to 31 August.

A fairly complete set of indexes was added to the basic tables. Long text fields, like message bodies, do not benefit much from ordinary indexes, because you rarely search on the whole content, and the key content is not necessarily at the beginning of the field, where an index would accelerate access. So the short fields are indexed, large text fields are not.

Now, full text indexing might be useful, but the typical use case is to find posts on a given topic, and just searching the stackoverflow site using its on-site search, or Googling, would be more generally useful. Full text searches in Postgresql involve, optimally, a non-standard functions that normalize the search terms; it gets better results than a straight keyword search, but involves a learning curve.

Maybe queries like 'how many posts about python have scores over 100?' would be useful, but that can be approximated by querying on tags joined to posts via tagging.

I hope the database is useful. Let me know if you have any comments or complaints.

The stackoverflow database is at:
Just login with the default login and no password.

David

No comments:

Post a Comment