Hi,
i wanna share my idea (writed in the article) about the reducing size of pageviews dump: https://en.wikipedia.org/wiki/User:Du%C5%A1an_Krehe%C4%BE/Signpost_draft:New...)
The primary technical content would be done.
Dušan Kreheľ
Hello Dušan,
I find this really fascinating. Unfortunately, it looks like the article doesn't explain the proposed format. Where is the domain in the new format? What does "DAY_HOUR" mean? What's the difference between "DAY_HOUR2", "DAY2_HOUR", and "DAY2_HOUR2"? What is the file naming scheme for the new format?
Being fascinated by file formats myself I also wonder. Why not make it binary?
Kind regards Thiemo
Hello Thiemo.
I updated the document. Look You the document or the document changes.
I think, for the low number values is better storing as text. Example, one reason, the RAW data have lower memory size. Example for input "1 15 85" is the test size 7 B, but in memory format would be minimal 3 input values * (minimal) 4 bytes per one value = 12 bytes. The binary dump data would be compress (for the zero bytes in RAW data) both as a text data. The text format is more human format as like to use, Example for the programmer or to use in the spreadsheet calculator.
Dušan Kreheľ
2022-09-03 11:16 GMT+02:00, Thiemo Kreuz thiemo.kreuz@wikimedia.de:
Hello Dušan,
I find this really fascinating. Unfortunately, it looks like the article doesn't explain the proposed format. Where is the domain in the new format? What does "DAY_HOUR" mean? What's the difference between "DAY_HOUR2", "DAY2_HOUR", and "DAY2_HOUR2"? What is the file naming scheme for the new format?
Being fascinated by file formats myself I also wonder. Why not make it binary?
Kind regards Thiemo _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
I'd imagine the current format is optimized for being able to output hourly dumps (and thus reducing data latency and data processing costs), not so much for storage space
Our pageview dumps were in the middle of a refactor when our team changed a lot. We haven't been able to finish it, but we do actually have a well-compressed version that we just haven't properly launched as a new dataset. I'm working on prioritizing that.
On Sun, Sep 4, 2022 at 02:58 Gergő Tisza gtisza@gmail.com wrote:
I'd imagine the current format is optimized for being able to output hourly dumps (and thus reducing data latency and data processing costs), not so much for storage space _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Thiemo and all: I also added tests for the binary version.
Dušan.
2022-09-05 2:45 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Our pageview dumps were in the middle of a refactor when our team changed a lot. We haven't been able to finish it, but we do actually have a well-compressed version that we just haven't properly launched as a new dataset. I'm working on prioritizing that.
On Sun, Sep 4, 2022 at 02:58 Gergő Tisza gtisza@gmail.com wrote:
I'd imagine the current format is optimized for being able to output hourly dumps (and thus reducing data latency and data processing costs), not so much for storage space _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/Signpost_draft:New_pageview_dump_export_format_(concept)&oldid=1108690384. Please let me know if it's still confusing.
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/Signpost_draft:New_pageview_dump_export_format_(concept)&oldid=1108690384. Please let me know if it's still confusing.
The big update of the article is done. Please, You look.
Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others.
2022-09-18 22:35 GMT+02:00, Dušan Kreheľ dusankrehel@gmail.com:
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/Signpost_draft:New_pageview_dump_export_format_(concept)&oldid=1108690384. Please let me know if it's still confusing.
@Dušan Kreheľ: I think there's a misunderstanding. I read your re-written article. In it, you say that the current format is:
domain_code page_title count_views total_response_size
For an example, you give this:
sk Kreheľ 2 0
But, actually, that format is deprecated and the new format is pageviews complete, which looks like this:
sk.wikipedia Kreheľ null desktop 13 B2D2G2J2O2T1V1X1
The B2D2G2J2O2T1V1X1 is exactly the kind of encoding you're talking about, and no 0-values are present.
You made the point that we are missing a yearly rollup in this new format. This would be quite a large file, but if there's a good use case for such a dump, a request in phabricator is a good way to proceed.
On Sat, Oct 1, 2022 at 9:58 AM Dušan Kreheľ dusankrehel@gmail.com wrote:
The big update of the article is done. Please, You look.
Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others.
2022-09-18 22:35 GMT+02:00, Dušan Kreheľ dusankrehel@gmail.com:
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal <
https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/...
.
Please let me know if it's still confusing.
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
A link to the source code has been added.
@Dan Andreescu: The format is correct now. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by wÃk. And the skwiki has only 49MB for the year 2021, which does not require the level of the end user who processes them for their purpose.
2022-10-06 19:31 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
@Dušan Kreheľ: I think there's a misunderstanding. I read your re-written article. In it, you say that the current format is:
domain_code page_title count_views total_response_size
For an example, you give this:
sk Kreheľ 2 0
But, actually, that format is deprecated and the new format is pageviews complete, which looks like this:
sk.wikipedia Kreheľ null desktop 13 B2D2G2J2O2T1V1X1
The B2D2G2J2O2T1V1X1 is exactly the kind of encoding you're talking about, and no 0-values are present.
You made the point that we are missing a yearly rollup in this new format. This would be quite a large file, but if there's a good use case for such a dump, a request in phabricator is a good way to proceed.
On Sat, Oct 1, 2022 at 9:58 AM Dušan Kreheľ dusankrehel@gmail.com wrote:
The big update of the article is done. Please, You look.
Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others.
2022-09-18 22:35 GMT+02:00, Dušan Kreheľ dusankrehel@gmail.com:
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal <
https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/...
.
Please let me know if it's still confusing.
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
[Fix]:
A link to the source code has been added.
@Dan Andreescu: The format is correct. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by local wikis. And the skwiki is only 49MB for the year 2021, which does not require a more demanding level of the end user who processes them for their purpose.
2022-11-08 21:30 GMT+01:00, Dušan Kreheľ dusankrehel@gmail.com:
A link to the source code has been added.
@Dan Andreescu: The format is correct now. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by wÃk. And the skwiki has only 49MB for the year 2021, which does not require the level of the end user who processes them for their purpose.
2022-10-06 19:31 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
@Dušan Kreheľ: I think there's a misunderstanding. I read your re-written article. In it, you say that the current format is:
domain_code page_title count_views total_response_size
For an example, you give this:
sk Kreheľ 2 0
But, actually, that format is deprecated and the new format is pageviews complete, which looks like this:
sk.wikipedia Kreheľ null desktop 13 B2D2G2J2O2T1V1X1
The B2D2G2J2O2T1V1X1 is exactly the kind of encoding you're talking about, and no 0-values are present.
You made the point that we are missing a yearly rollup in this new format. This would be quite a large file, but if there's a good use case for such a dump, a request in phabricator is a good way to proceed.
On Sat, Oct 1, 2022 at 9:58 AM Dušan Kreheľ dusankrehel@gmail.com wrote:
The big update of the article is done. Please, You look.
Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others.
2022-09-18 22:35 GMT+02:00, Dušan Kreheľ dusankrehel@gmail.com:
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal <
https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/...
.
Please let me know if it's still confusing.
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
The article is for me done now.
D. K.
2022-11-08 21:32 GMT+01:00, Dušan Kreheľ dusankrehel@gmail.com:
[Fix]:
A link to the source code has been added.
@Dan Andreescu: The format is correct. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by local wikis. And the skwiki is only 49MB for the year 2021, which does not require a more demanding level of the end user who processes them for their purpose.
2022-11-08 21:30 GMT+01:00, Dušan Kreheľ dusankrehel@gmail.com:
A link to the source code has been added.
@Dan Andreescu: The format is correct now. The annual summary is a typical basic statistical interval, and we save time by merging. The file size problem disappears if the file is split by wÃk. And the skwiki has only 49MB for the year 2021, which does not require the level of the end user who processes them for their purpose.
2022-10-06 19:31 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
@Dušan Kreheľ: I think there's a misunderstanding. I read your re-written article. In it, you say that the current format is:
domain_code page_title count_views total_response_size
For an example, you give this:
sk Kreheľ 2 0
But, actually, that format is deprecated and the new format is pageviews complete, which looks like this:
sk.wikipedia Kreheľ null desktop 13 B2D2G2J2O2T1V1X1
The B2D2G2J2O2T1V1X1 is exactly the kind of encoding you're talking about, and no 0-values are present.
You made the point that we are missing a yearly rollup in this new format. This would be quite a large file, but if there's a good use case for such a dump, a request in phabricator is a good way to proceed.
On Sat, Oct 1, 2022 at 9:58 AM Dušan Kreheľ dusankrehel@gmail.com wrote:
The big update of the article is done. Please, You look.
Gergő Tisza: The current fresh hour format can remain. Later it can be converted to another format. And thus be more suitable for others.
2022-09-18 22:35 GMT+02:00, Dušan Kreheľ dusankrehel@gmail.com:
I have updated the document. I added the export of human pageviews for year 2021. The statistics are in the article. A download link has been added.
Dan Andreescu: None problem was to understand You.
2022-09-05 21:48 GMT+02:00, Dan Andreescu dandreescu@wikimedia.org:
Hi Dušan,
I added the details on pageviews_complete to the talk page on your proposal <
https://en.wikipedia.org/w/index.php?title=User_talk:Du%C5%A1an_Krehe%C4%BE/...
.
Please let me know if it's still confusing.
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
wikitech-l@lists.wikimedia.org