Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

List overview All Threads
Download

newer

older

Code of Conduct Committee for...

Re: [Analytics] python script...

Adrian Bielefeldt

13 May 2017 13 May '17

1:23 p.m.

Hello Nuri, I'm working on a project <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. uri_query, hour) from wmf.wdqs_extract, parse the queries with a java program using open_rdf as the parser and then analyze it for different metrics like variable count, which entities are being used and so on. At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. Unfortunately the program cannot connect to the website, so I'm assuming I have to create an exception for this request or ask for it to be created. Greetings, Adrian

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Addshore

13 May 13 May

3:47 p.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

You should be able to connect to query.wikidata.org via the webproxy. https://wikitech.wikimedia.org/wiki/HTTP_proxy On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt < Adrian.Bielefeldt(a)mailbox.tu-dresden.de> wrote:

...

Nuria Ruiz

8:40 p.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

Adrian,

...

At the moment I'm working on checking which entries equal one of the

example queries at https://www.wikidata.org/>wiki/Wikidata:SPARQL_query_serv ice/queries/examples <https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples> using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. The stats machines are useful to analyze data but we do not use them to do development. It seems like you would benefit from querying a development instance of wikidatata and looking at development logs to know what to expect. We strongly advise against doing development in production, looking at logs in a development environment would be synchronous so you can get your answers fast. Thanks, Nuria On Sat, May 13, 2017 at 5:47 PM, Addshore <addshorewiki(a)gmail.com> wrote:

...

Hello Nuri, I'm working on a project <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. uri_query, hour) from wmf.wdqs_extract, parse the queries with a java program using open_rdf as the parser and then analyze it for different metrics like variable count, which entities are being used and so on. At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_ service/queries/examples using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. Unfortunately the program cannot connect to the website, so I'm assuming I have to create an exception for this request or ask for it to be created. Greetings, Adrian _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Goran Milovanovic

9:25 p.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

Hi Nuria and Adrian, I also need to develop some Wikidata related Data Science things for WMDE, where I've started working as a Data Analyst in March. I work from R entirely (+SPARQL +SQL +etc). It is not impossible that I will be facing a problem similar to Adrian's, so I have hoped you wouldn't mind if I hijack this discussion for a mail or two. A question for Nuria: I understand that the production machines can be used to run analyses, as well as that we should avoid doing development there (I have a Labs instance where the development will be taking place), but I do not understand your following advise to Adrian: "It seems like you would benefit from querying a development instance of wikidatata and looking at development logs to know what to expect." - Nuria, what exactly do you have in mind when you say "a development instance of Wikidata"? Also, very important for me: are you implying that no attempts to access the SPARQL endpoint from production should be made? If yes, why, and what would be the alternative, suggested route to Wikidata from production? Or - this is my final attempt at the correct interpretation of your words - do you want to say that we should use production for statistics *exclusively*, in a sense that no datasets (except for, say, weblogs and mariaDB replicas on equiad) shoul be fetched from the production machines (i.e. implying that we need to collect the data somewhere else, and move to production for number crunching only)? Thank you. Best regards, Goran Milovanović Data Analyst, WMDE On 13 May 2017 22:40, "Nuria Ruiz" <nuria(a)wikimedia.org> wrote: Adrian,

...

At the moment I'm working on checking which entries equal one of the

...

Hello Nuri, I'm working on a project <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. uri_query, hour) from wmf.wdqs_extract, parse the queries with a java program using open_rdf as the parser and then analyze it for different metrics like variable count, which entities are being used and so on. At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/wi ki/Wikidata:SPARQL_query_service/queries/examples using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. Unfortunately the program cannot connect to the website, so I'm assuming I have to create an exception for this request or ask for it to be created. Greetings, Adrian _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Adrian Bielefeldt

14 May 14 May

11:10 a.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

Hi Addshore, thanks for the advice, I can now connect. Greetings, Adrian On 05/13/2017 05:47 PM, Addshore wrote:

...

You should be able to connect to query.wikidata.org <http://query.wikidata.org> via the webproxy. https://wikitech.wikimedia.org/wiki/HTTP_proxy On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt <Adrian.Bielefeldt(a)mailbox.tu-dresden.de <mailto:Adrian.Bielefeldt@mailbox.tu-dresden.de>> wrote: Hello Nuri, I'm working on a project <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. uri_query, hour) from wmf.wdqs_extract, parse the queries with a java program using open_rdf as the parser and then analyze it for different metrics like variable count, which entities are being used and so on. At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. Unfortunately the program cannot connect to the website, so I'm assuming I have to create an exception for this request or ask for it to be created. Greetings, Adrian _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

10:22 p.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

...

(i.e. implying that we need to collect the data somewhere else, and move

to production for number crunching only)? I think we should probably set up a sync up so you get an overview of how this works cause this is a brief response. Data is harvested in some production machines, it is processed (in different production machines) and moved to stats machines (also production but a sheltered environment). We do not use stats machines to harvest data. They just provide access to it and are sized so you can process and crunch data, this talk explains a bit how does this all works: https://www.youtube.com/watch?v=tx1pagZOsiM We might be talking pass each other here, if so, a meeting might help.

...

Nuria, what exactly do you have in mind when you say "a development

instance of Wikidata"? If you need to look at a wikidata query and see what it shows on the logs when you query x or y, that step should be done on a (wikidata) *test environment* that logs the http requests for your queries as received by the server. So you can "test" your queries agains a server and see how those are received. Thanks, Nuria On Sun, May 14, 2017 at 1:10 PM, Adrian Bielefeldt < Adrian.Bielefeldt(a)mailbox.tu-dresden.de> wrote:

...

Hi Addshore, thanks for the advice, I can now connect. Greetings, Adrian On 05/13/2017 05:47 PM, Addshore wrote: You should be able to connect to query.wikidata.org via the webproxy. https://wikitech.wikimedia.org/wiki/HTTP_proxy On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt < Adrian.Bielefeldt(a)mailbox.tu-dresden.de> wrote:

Hello Nuri, I'm working on a project <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. uri_query, hour) from wmf.wdqs_extract, parse the queries with a java program using open_rdf as the parser and then analyze it for different metrics like variable count, which entities are being used and so on. At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_ service/queries/examples using this <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> code. Unfortunately the program cannot connect to the website, so I'm assuming I have to create an exception for this request or ask for it to be created. Greetings, Adrian _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Addshore

15 May 15 May

8:45 a.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

...

(i.e. implying that we need to collect the data somewhere else, and move

Nuria, what exactly do you have in mind when you say "a development

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Julius Gonsior

10:38 a.m.

New subject: Connect to wikidata.org from stat1002.eqiad.wmnet

I'm not Adrian but we work together on this project, and that's indeed what we're doing, and the guess was correct as well. Thanks so far! On Mon 15.05 08:45, Addshore wrote:

...

I believe in this case data is being crunched, in hadoop, which is where the WDQS access logs are. And I think the page in question that Adrian wanted to load was https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples, at a guess he is looking at how often these example queries are requested via the service. On Mon, 15 May 2017 at 00:22 Nuria Ruiz <nuria(a)wikimedia.org> wrote: > >(i.e. implying that we need to collect the data somewhere else, and move > to production for number crunching only)? > I think we should probably set up a sync up so you get an overview of how > this works cause this is a brief response. Data is harvested in some > production machines, it is processed (in different production machines) and > moved to stats machines (also production but a sheltered environment). We > do not use stats machines to harvest data. They just provide access to it > and are sized so you can process and crunch data, this talk explains a bit > how does this all works: https://www.youtube.com/watch?v=tx1pagZOsiM > > We might be talking pass each other here, if so, a meeting might help. > > > >Nuria, what exactly do you have in mind when you say "a development > instance of Wikidata"? > If you need to look at a wikidata query and see what it shows on the logs > when you query x or y, that step should be done on a (wikidata) *test > environment* that logs the http requests for your queries as received by > the server. So you can "test" your queries agains a server and see how > those are received. > > > Thanks, > > Nuria > > > > > > On Sun, May 14, 2017 at 1:10 PM, Adrian Bielefeldt < > Adrian.Bielefeldt(a)mailbox.tu-dresden.de> wrote: > >> Hi Addshore, >> thanks for the advice, I can now connect. >> >> Greetings, >> >> Adrian >> >> >> On 05/13/2017 05:47 PM, Addshore wrote: >> >> You should be able to connect to query.wikidata.org via the webproxy. >> >> https://wikitech.wikimedia.org/wiki/HTTP_proxy >> >> On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt < >> Adrian.Bielefeldt(a)mailbox.tu-dresden.de> wrote: >> >>> Hello Nuri, >>> >>> I'm working on a project >>> <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> >>> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g. >>> uri_query, hour) from wmf.wdqs_extract, parse the queries with a java >>> program using open_rdf as the parser and then analyze it for different >>> metrics like variable count, which entities are being used and so on. >>> >>> At the moment I'm working on checking which entries equal one of the >>> example queries at >>> https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples >>> using this >>> <https://github.com/Wikidata/QueryAnalysis/blob/master/src/main/java/general/Main.java#L339-L376> >>> code. Unfortunately the program cannot connect to the website, so I'm >>> assuming I have to create an exception for this request or ask for it to be >>> created. >>> >>> Greetings, >>> >>> Adrian >>> _______________________________________________ >>> Analytics mailing list >>> Analytics(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

...

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

2545

days inactive

2547

days old

analytics@lists.wikimedia.org

Manage subscription

7 comments

5 participants

tags (0)

participants (5)

Addshore
Adrian Bielefeldt
Goran Milovanovic
Julius Gonsior
Nuria Ruiz