Hi, Jeremy,
I'm supervising (just from an academic point of view) Felipe's work. We
have (I guess) enough disk space, and probably could get more if needed.
We have also some experience in analyzing libre (free, open source)
software projects (have a look at
http://libresoft.urjc.es ), and
probably several of the techniques we're using for that could be applied
to analyzing wikipedia.
So, if you're interested, we could explore how to collaborate. For now,
we're using the database dumps for the analysis. From your message it
seems that you're spidering the content from the website, am I right?
Saludos,
Jesus.
On Mon, 2006-06-26 at 11:36 -0500, Jeremy Dunck wrote:
On 6/26/06, Felipe Ortega
<glimmer_phoenix(a)yahoo.es> wrote:
In the past few weeks I' ve read a bunch of
mail messages talking about what is precisely my first goal: extracting behavioral
conclusions from a quantitative analysis of wikipedia database dumps in all languages.
But, despite all my efforts, and some mails offering myself as contributor I have
received no answer from Wikipedia Community. I only wanted to contribute in a very
interesting area (I think) and I hope it could lead me to build an interesting thesis
about this topic. I'm currently developing some scripts in Phyton that analyze
database dumps.
I wrote a paper with some preliminar results, if you may take a glance to it. I only ask
for some collaboration from anyone involved with the project, because otherwise maybe I
should simply think that all my efforts don' t bother anyone.
Felipe,
I'm a bachelor student at University of Texas at Dallas, also
working on what I'm calling fine-grained statistics for Wikipedia,
using python to interpret text from database dumps. :)
I've been working on it off and on for almost a year. The big
problems for me has been disk space and wikitext parsing. After
fiddling for quite a while trying to make my own parser, I have
finally broken down to using the HTML as rendered by MediaWiki, then
using that as the basis for the rest.
My basic goal is to provide statistics for things such as "how many
revisions has this piece of text survived" and similar, then render
that information onto a reader's wikipedia browser page. As a second
goal, it'd be nice to find some combination of stats that suggest
which bits of a page a more likely trustworthy.
How far have you gotten, and does it sound like we're on the same track?
Cheers,
Jeremy
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l --
-----
Jesus M. Gonzalez Barahona | GSyC (DITTE)
Edificio Departamental II (ESCET) | jgb(a)gsyc.escet.urjc.es / jgb(a)computer.org
Universidad Rey Juan Carlos | tel: +34 91 664 74 67
c/ Tulipan s/n | fax: +34 91 664 74 94
28933 Mostoles, Spain