Hi,
I just enrolled this list, thanks to Dan Andreescu, who let me know about it, and I have a question on processing clickstream data.
I downloaded a file for last month clickstream data (https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-202...) and have problems to open it and processing it.
The only programme I could open it was OpenRefine. Other programmes (Numbers and LibreOffice) just couldn't cope with it.
I can use OpenRefine to do some transformation and delete some rows I don't need, but even then, with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
Thanks.
Hi Robert, welcome to the list. Opening big files is definitely tricky. I believe (not sure) folks on this list usually write scripts to filter and aggregate these files, or load them into big data tools. But I found someone else asking a similar question with a very useful answer here: https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-hug...
On Thu, Jan 26, 2023 at 11:04 AM Robert Garrigos robert@garrigos.cat wrote:
Hi,
I just enrolled this list, thanks to Dan Andreescu, who let me know about it, and I have a question on processing clickstream data.
I downloaded a file for last month clickstream data ( https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-202...)
and have problems to open it and processing it.
The only programme I could open it was OpenRefine. Other programmes (Numbers and LibreOffice) just couldn't cope with it.
I can use OpenRefine to do some transformation and delete some rows I don't need, but even then, with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
Thanks.
======================== Robert Garrigós i Castro https://garrigos.cat +34 620 91 87 01 _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Thanks Dan. Klogg is definitely very fast with very large text files. it opened my largest files (1.5Gb) in just 1 sec. wow.
======================== Robert Garrigós i Castro https://garrigos.cat +34 620 91 87 01
El 26/1/23 a les 18:32, Dan Andreescu ha escrit:
Hi Robert, welcome to the list. Opening big files is definitely tricky. I believe (not sure) folks on this list usually write scripts to filter and aggregate these files, or load them into big data tools. But I found someone else asking a similar question with a very useful answer here: https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-hug... https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files
On Thu, Jan 26, 2023 at 11:04 AM Robert Garrigos <robert@garrigos.cat mailto:robert@garrigos.cat> wrote:
Hi, I just enrolled this list, thanks to Dan Andreescu, who let me know about it, and I have a question on processing clickstream data. I downloaded a file for last month clickstream data (https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-2022-12.tsv.gz <https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-2022-12.tsv.gz>) and have problems to open it and processing it. The only programme I could open it was OpenRefine. Other programmes (Numbers and LibreOffice) just couldn't cope with it. I can use OpenRefine to do some transformation and delete some rows I don't need, but even then, with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4. Which tools do you use to work with such big files? Thanks. -- ======================== Robert Garrigós i Castro https://garrigos.cat <https://garrigos.cat> +34 620 91 87 01 _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org <mailto:analytics@lists.wikimedia.org> To unsubscribe send an email to analytics-leave@lists.wikimedia.org <mailto:analytics-leave@lists.wikimedia.org>
Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Il 26/01/23 18:04, Robert Garrigos ha scritto:
with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
To sum a column in a CSV I would use visidata: https://www.visidata.org/docs/join/
I think I've used it with CSVs in the order of 10^7 rows, possibly 10^8. (Usually to make pivot tables.)
Best, Federico
Thanks, Federico, this is very interesting. I'll take a look.
======================== Robert Garrigós i Castro https://garrigos.cat +34 620 91 87 01
El 26/1/23 a les 19:50, Federico Leva (Nemo) ha escrit:
Il 26/01/23 18:04, Robert Garrigos ha scritto:
with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
To sum a column in a CSV I would use visidata: https://www.visidata.org/docs/join/
I think I've used it with CSVs in the order of 10^7 rows, possibly 10^8. (Usually to make pivot tables.)
Best, Federico
visidata is *amazing* for any vim users, I want to take this opportunity to ask if other folks use it...
On Thu, Jan 26, 2023 at 2:15 PM Robert Garrigos robert@garrigos.cat wrote:
Thanks, Federico, this is very interesting. I'll take a look.
======================== Robert Garrigós i Castro https://garrigos.cat +34 620 91 87 01
El 26/1/23 a les 19:50, Federico Leva (Nemo) ha escrit:
Il 26/01/23 18:04, Robert Garrigos ha scritto:
with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
To sum a column in a CSV I would use visidata: https://www.visidata.org/docs/join/
I think I've used it with CSVs in the order of 10^7 rows, possibly 10^8. (Usually to make pivot tables.)
Best, Federico
Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org