Wikidata-tech December 2017

wikidata-tech@lists.wikimedia.org

10 participants
10 discussions

Search on Wikibase/Wikidata sans CirrusSearch?

by Stas Malyshev

Hi! I wonder if anybody have run/is running Wikibase without CirrusSearch installed and whether the fulltext search is supposed to work in that configuration? The suggester/prefix search, aka wbsearchentities, works ok, but I can't make fulltext aka Special:Search find anything on my VM (which very well may be a consequence of me messing up, or some bug, or both :) So, I wonder - is it *supposed* to be working? Is anybody using it this way and does anybody care for such a use case? Thanks, -- Stas Malyshev smalyshev(a)wikimedia.org

6 years, 3 months

Call for proposals for WikiIndaba 2018 is open

by abdelwaheb turki

Dear Users, I have the great honour to inform you that the Call for Proposals for WikiIndaba 2018 is now open. WikiIndaba 2018 is the 3rd conference of African Wikimedia movement and will give to participants the opportunity to share their Wikimedia-related experience and skills with a wide and active African Wikimedia audience. The conference will be held in Tunisia from 16 to 18 March 2018. If you want to participate to WikiIndaba and share your works and thoughts with African Wikimedians, feel free to submit your proposal in https://meta.m.wikimedia.org/wiki/WikiIndaba_conference_2018/Submissions. The deadline for giving proposals will be January 15th, 2018. If you need a scholarship to attend WikiIndaba 2018, you can apply to it in https://docs.google.com/forms/d/e/1FAIpQLSdJJ2I0FBqp4SuiW5ypj-9lnLaAidUmhMs…. Looking forward to seeing you in Tunis next March. Yours Sincerely, Houcemeddine Turki Felix Nartey Isla Haddow-Flood

6 years, 4 months

Call for Proposals for WikiIndaba 2018 is open

by abdelwaheb turki

6 years, 4 months

New release of the Wikidata Toolkit + survey

by Thomas Pellissier Tanon

Dear all, I just released a new version (0.8) of the Wikidata Toolkit [1]. Wikidata Toolkit is a Java library allowing to easily reuse Wikidata content using dumps or the API but also providing helpers for editing. This release mainly adds several fixes that are needed to keep Wikidata Toolkit working with the changes done on Wikidata provides support for JDK 9. It also provides two features related to the Wikibase API: it is now possible to edit labels, descriptions and aliases using the WikibaseDataEditor (this is a work in progress that is likely to change) and there is now a wrapper for the wbEntitySearch API action. We have created a short survey to help doing technical choices for the future versions of the Wikidata Toolkit, especially related to Java 7 support and the RDF converter. Please fill it if you are using Wikidata Toolkit (it should take less than a minute): https://docs.google.com/forms/d/e/1FAIpQLSdN25X2sTv2wQe-y56d0hC4QmU06s6crr1… Best, Thomas (Tpt) [1] https://www.mediawiki.org/wiki/Wikidata_Toolkit

6 years, 4 months

Concerning several rules for adding labels in Arabic dialects to entities in Wikidata

by abdelwaheb turki

Dear Mr. or Ms., I thank you for your efforts. When I was in AICCSA 2017 conference last month, I discussed with Arab computational linguists several ideas. I found that you do not know these rules: * The label of a proper entity (Person, Place, Trademark...) in Modern Standard Arabic is the same as the one of such an entity in the following Arabic dialects: South Levantine Arabic (ajp), Gulf Arabic (afb), Hejazi Arabic (acw), Najdi Arabic (ars), Hadhrami Arabic (ayh), Sanaani Arabic (ayn), Ta'izzi-Adeni Arabic (acq), and Mesopotamian Arabic (acm). * Labels of places and people from Palestine, Jordan, Syria, Iraq, Kuwait, Yemen, Oman, Bahrain, Qatar, UAE, Saudi Arabia, Sudan, Djibouti, Comoros, Somalia, and Mauritania are the same in all Arabic dialects as in Modern Standard Arabic. I ask if you can make a bot that automatically applies these two rules. Yours Sincerely, Houcemeddine Turki

6 years, 4 months

Using WikiData as a multi-lingual multi-dialectal dictionary for Arabic dialects

by abdelwaheb turki

Dear Mr. or Ms., I thank you for your interest. The proceedings paper, the presentation slides as well as an overview about the discussions I have done about using Wikidata for the Natural Language Processing of Arabic dialects are available now in ResearchGate. Please see https://www.researchgate.net/publication/321039195_Using_WikiData_as_a_mult… for the proceedings paper and https://www.researchgate.net/publication/321039289_AICCSA_2017_-_Wikidata_P… for the presentation slides and for the overview about the discussions I have done about using Wikidata for the Natural Language Processing of Arabic dialects in ResearchGate. Yours Sincerely, Houcemeddine Turki

6 years, 4 months

Prefixes on Custom Wikibase Install

by Miguel Paraz

Hi, I'm running my own custom Wikibase install with my own prefixes. Right now, I need to write the query with the defined prefixes at the beginning, like: prefix p: <https://mparaz.com/wiki/Property:P> PREFIX ps: <http://mparaz.com/prop/> prefix q: <https://mparaz.com/wiki/Item:Q> prefix pp: <http://mparaz.com/prop/P> PREFIX pr: <http://mparaz.com/prop/reference/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX pq: <http://mparaz.com/prop/qualifier/> Could I possibly pre-define these prefixes to replace the Wikibase default ones? Thanks! Miguel

6 years, 4 months

Re: [Wikidata-tech] Report on loading wikidata

by Marco Neumann

did you try to point the wdqs copy to your tdb/fuseki endpoint? On Thu, 7 Dec 2017 at 18:58, Andy Seaborne <andy(a)apache.org> wrote: > Dell XPS 13 (model 9350) - the 2015 model. > Ubuntu 17.10, not a VM. > 1T SSD. > 16G RAM. > Two volumes = root and user. > Swappiness = 10 > > java version "1.8.0_151" (OpenJDK) > > Data: latest-truthy.nt.gz (version of 2017-11-24) > > == TDB1, tdbloader2 > 8 hours // 76,164 TPS > > Using SORT_ARGS: --temporary-directory=/home/afs/Datasets/tmp > to make sure the temporary files are on the large volume. > > The run took 28877 seconds and resulted in a 173G database. > > All the index files are the same size. > > node2id : 12G > OSP : 53G > SPO : 53G > POS : 53G > > Algorithm: > > Data phase: > > parse file, create node table and a temporary file of all triples (3x 64 > bit numbers, written in text. > > Index phase: > > for each index, sort the temp file (using sort(1), an external sort > utility), and make the index file by writing the sorted results, filling > the data blocks and creating any tree blocks needed. This is a > stream-write process - calculate the data block, write it out when full > and never touch it again. > > This results in data blocks being completely full, unlike the standard > B+Tree insertion algorithm. It is why indexes are exactly the same size. > > Building SPO is faster because the data is nearly sorted to start with,. > Data often tends to arrive grouped by subject. > > tdbloader2 is doing stream (append) I/O on index files, not a random > access pattern. > > == TDB1 tdbloader1 > 29 hours 43 minutes // 20,560 TPS > > 106,975 seconds > 297G DB-truthy > > node2id: 12G > OSP: 97G > SPO: 96G > POS: 98G > > Same size node2id table, larger indexes. > > Algorithm: > > Data phase: > > parse the file and create the node table and the SPO index. > The creation of SPO is by b+tree insert so blocks are partially full > (average is empirically about 2/3 full). When a block fills up, it is > split into 2. The node table is exactly the same as tdbloader2 because > nodes are stored in the same order. > > Index phase: > > for each index, copy SPO to the index. This is a tree sort and the > access pattern on blocks is fairly random which is a bad thing. Doing > one at a time is faster than two together because more RAM in the > OS-managed file system cache, is devoted to caching one index. A cache > miss is a possible write to disk, and always a read from disk, which is > a lot of work even with an SSD. > > Stream reading SPO is efficient - it is not random I/O, it is stream I/O. > > Once the cache-efficiency of the OS disk cache drops, tdbloader slows > down markedly. > > == Comparison of TDB1 loaders. > > Building an index is a sort because the B+Trees hold data sorted. > > The approach of tdbloader2 is to use an external sort algorithm (i.e. > sort larger than RAM using temporary files) done by a highly tuned > utility, unix sort(1). > > The approach of tdbloader1 is to copy into a sorted datastructure. For > example, copying index SPO to POS, it is creating a file with keys > sorted by P then O then S, which is not the arrival order which is > S-sorted. tdbloader1 maximises OS caching of memory mapped files by > doing indexes one at a time. Experimentation shows that doing two at > once is slower, and doing two in parallel is no better and sometimes > worse, than doing sequentially. > > == TDB2 > > TDB2 is experimental. The current TDB2 loader is a functional placeholder. > > It is writing all three indexes at the same time. While for SPO this is > not a bad access pattern (subjects are naturally grouped), for POS and > OSP, the I/O is a random pattern, not a stream pattern. There is more > than double contention for OS disk cache, hence it is slow and gets > slower faster. > > == More details. > > For more information, consult the Jena dev@ and user@ archives and the > code. > -- --- Marco Neumann KONA

6 years, 4 months

Wikidata deployments will now happen every week

by Léa Lacroix

Hello all, Some information that may interest people who contribute to Wikidata and Wikibase, or follow the development. We made some changes in our deployment structure, the most relevant things for you are: - we are now able to deploy new code for Wikidata every week, in the regular Mediawiki train, instead of every two weeks. test.wikidata.org will be updated on Tuesdays and wikidata.org on Wednesdays (details <https://wikitech.wikimedia.org/wiki/Deployments>) - beta.wikidata.org is now updated every 10min with new code (details <https://integration.wikimedia.org/ci/view/Beta/>) Which basically means that new features and bug fixes will arrive faster to you :) Thanks to Addshore who made this happen! If you have questions or want further details, feel free to ask. -- Léa Lacroix Project Manager Community Communication for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

6 years, 4 months

Tracking internal uses of Wikidata Query Service

by Stas Malyshev

Hi! We are seeing more use of the Wikidata Query Service by Wikimedia projects. Which is excellent news, but somewhat worse news is that the maintainers of WDQS do not have a good idea what these services are, what they needs are and so on. So, we have decided we want to start tracking internal uses of Wikidata Query Service. To that point, if you run any functionality on Wikimedia sites (Wikipedias, Wikidata, etc., anything with wikimedia domain) that uses queries to the Wikidata Query Service, please go to: https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage and add your project there. That is both if your project runs queries by itself on the background, or if it uses queries as part of user interaction scenario. We do not include labs tools currently unless it is absolutely vital infrastructure (i.e. if it went down, would it substantially degrade the main site functionality or make some features unusable?) If you still feel we should know about certain lab tool, please leave a note on the talk page. What's in it for you? We want to know these in order to better understand the scope of internal usage and as preparation for T178492 (creating internal WDQS setup) - with the goal to provide internal users more robust and more flexible service. Also we want it to ensure we do not break anything important when we do maintenance, and we know who to talk to if some queries do not work as expected and we want to fix it. What we want to know? - We'd like to have general description of the functionality (i.e., what the service is for) - How to recognize queries run by it - user agent? source host? specific query pattern? some other mark? It is recommended that it would be possible to recognize - What kind of queries it runs (no need to list every possible one of course but if there are typical cases it'd help to see it)? - How often the queries run - if it's periodic, or what is expected/statistical usage of the tool if it's user driven tool? - Where could we see the code at the base of it and who maintains it? - Feel free to add any other information about anything you think would be useful for us to know. What was that page again? https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage Thanks in advance, -- Stas Malyshev smalyshev(a)wikimedia.org

6 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech December 2017