Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include: 1. JSON Dumps 2. Sqlite Dumps 3. Daily dumps of revisions in last 24 hours 4. Dumps optimized for very fast import into various external storage and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
I think I would be very interested in 3, or even, in having every month a dump of that month's revisions. As I have built tools for the xml dumps, no change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps) much easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
If periodic update dumps are being considered, information that describes changes to old data (page deletes, user renames, etc) would be very useful to have along with new revisions.
-Aaron On Mar 31, 2011 6:27 PM, "Luca de Alfaro" luca@dealfaro.org wrote:
I think I would be very interested in 3, or even, in having every month a dump of that month's revisions. As I have built tools for the xml dumps,
no
change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps) much easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Would incremental dumps, as described by brion long time ago (http://leuksman.com/log/2007/10/14/incremental-dumps/) be what you're looking for?
On Fri, Apr 1, 2011 at 5:01 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
If periodic update dumps are being considered, information that describes changes to old data (page deletes, user renames, etc) would be very useful to have along with new revisions.
-Aaron
On Mar 31, 2011 6:27 PM, "Luca de Alfaro" luca@dealfaro.org wrote:
I think I would be very interested in 3, or even, in having every month a dump of that month's revisions. As I have built tools for the xml dumps, no change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps) much easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Yes. That's a lot like what I had in mind.
On Thu, Mar 31, 2011 at 7:33 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Would incremental dumps, as described by brion long time ago (http://leuksman.com/log/2007/10/14/incremental-dumps/) be what you're looking for?
On Fri, Apr 1, 2011 at 5:01 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
If periodic update dumps are being considered, information that describes changes to old data (page deletes, user renames, etc) would be very
useful
to have along with new revisions.
-Aaron
On Mar 31, 2011 6:27 PM, "Luca de Alfaro" luca@dealfaro.org wrote:
I think I would be very interested in 3, or even, in having every month
a
dump of that month's revisions. As I have built tools for the xml dumps, no change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps)
much
easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com
wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Yuvi Panda T http://yuvi.in/blog
Not quite... if I am reading correctly the proposal by Brion, this would list all the pages that changed in a specific interval. If the interval is large, like a month, this could be a very large size, if all the history of a page is provided. What I was suggesting is to include only the changes (the revisions) that occur in a specific time span.
Luca
On Thu, Mar 31, 2011 at 5:33 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Would incremental dumps, as described by brion long time ago (http://leuksman.com/log/2007/10/14/incremental-dumps/) be what you're looking for?
On Fri, Apr 1, 2011 at 5:01 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
If periodic update dumps are being considered, information that describes changes to old data (page deletes, user renames, etc) would be very
useful
to have along with new revisions.
-Aaron
On Mar 31, 2011 6:27 PM, "Luca de Alfaro" luca@dealfaro.org wrote:
I think I would be very interested in 3, or even, in having every month
a
dump of that month's revisions. As I have built tools for the xml dumps, no change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps)
much
easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com
wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
That's not the way I read Brion's proposal. It looks to me like there would only be records for each new revision and for those revisions and pages that were updated--that no old data that had not been updated or created would be included. Either way, this is essential. I'm sure no one would disagree.
-Aaron
On Fri, Apr 1, 2011 at 12:06 PM, Luca de Alfaro luca@dealfaro.com wrote:
Not quite... if I am reading correctly the proposal by Brion, this would list all the pages that changed in a specific interval. If the interval is large, like a month, this could be a very large size, if all the history of a page is provided. What I was suggesting is to include only the changes (the revisions) that occur in a specific time span.
Luca
On Thu, Mar 31, 2011 at 5:33 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Would incremental dumps, as described by brion long time ago (http://leuksman.com/log/2007/10/14/incremental-dumps/) be what you're looking for?
On Fri, Apr 1, 2011 at 5:01 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
If periodic update dumps are being considered, information that
describes
changes to old data (page deletes, user renames, etc) would be very
useful
to have along with new revisions.
-Aaron
On Mar 31, 2011 6:27 PM, "Luca de Alfaro" luca@dealfaro.org wrote:
I think I would be very interested in 3, or even, in having every month
a
dump of that month's revisions. As I have built tools for the xml
dumps,
no change in format is good for me (and for WikiTrust).
I would find incremental dumps (with occasional, yearly, full dumps)
much
easier to manage than full dumps.
Luca
On Thu, Mar 31, 2011 at 2:27 PM, Yuvi Panda yuvipanda@gmail.com
wrote:
Hi, I'm a student planning on doing GSoC this year on mediawiki. Specifically, I'd like to work on data dumps.
I'm writing this to gauge what would be useful to the research community. Several ideas thrown about include:
- JSON Dumps
- Sqlite Dumps
- Daily dumps of revisions in last 24 hours
- Dumps optimized for very fast import into various external storage
and smaller size (diffs) 5. JSON/CSV for Special:Import and Special:Export
Would any of these be useful? Or is there anything else that I'm missing, that you would consider much more useful?
Feedback would be invaluable :)
Thanks :)
Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Yuvi Panda T http://yuvi.in/blog
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org