Dear all,
Finally, I was able to execute the 2 millions of queries in 24 minutes. The solution was to load Virtuoso fully in-memory. I created a ramdisk filesystem and I copied the full Virtuoso installation there. The copy was done in a few minutes.I know Wikidata is stored in a volatile memory but I already synchronised this folder with a folder in the ssd disk. I think this solution could also be used to load Wikidata even faster if Virtuoso is stored in a ramdisk-based directory. When the loading is done, the folder could be moved back from the ramdisk-directory to a harddisk directory for data persistence Thanks for all your suggestions and ideas. It saved me time because I was able to narrow the set of possible solutions between software and hardware.
Best,
Adam
links https://www.linuxbabe.com/command-line/create-ramdisk-linux
Le jeu. 23 juil. 2020 à 09:01, Aidan Hogan aidhog@gmail.com a écrit :
Hi Adam,
On 2020-07-13 13:41, Adam Sanchez wrote:
Hi,
I have to launch 2 million queries against a Wikidata instance. I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0). The queries are simple, just 2 types.
select ?s ?p ?o { ?s ?p ?o. filter (?s = ?param) }
select ?s ?p ?o { ?s ?p ?o. filter (?o = ?param) }
If I use a Java ThreadPoolExecutor takes 6 hours. How can I speed up the queries processing even more?
Perhaps I am a bit late to respond.
It's not really clear to me what you are aiming for, but if this is a once-off task, I would recommend to download the dump in Turtle or N-Triples, load your two million parameters in memory in a sorted or hashed data structure in the programming language of your choice (should take considerably less than 1GB of memory assuming typical constants), use a streaming RDF parser for that language, and for each subject/object, check if its in your list in memory. This solution is about as good as you can get in terms of once-off batch processing.
If your idea is to index the data so you can do 2 million lookups in "interactive time", your problem is not what software to use, it's what hardware to use.
Traditional hard disks have a physical arm that takes maybe 5-10 ms to move. Sold state disks are quite a bit better but still have seeks in the range of 0.1 ms. Multiply those seek times by 2 million and you have a long wait (caching will help, as will multiple disks, but not by nearly enough). You would need to get the data into main memory (RAM) to have any chance of approximating interactive times, and even still you will probably not get interactive runtimes without leveraging some further assumptions about what you want to do to optimise further (e.g., if you're only interesting in Q ids, you can use integers or bit vectors, etc). In the most general case, you would probably need to pre-filter the data as much as you can, and also use as much compression as you can (ideally with compact data structures) to get the data into memory on one machine, or you might think about something like Redis (in-memory key-value store) on lots of machines. Essentially, if your goal is interactive times on millions of lookups, you very likely need to look at options purely in RAM (unless you have thousands of disks available at least). The good news is that 512GB(?) sounds like a lot of space to store stuff in.
Best, Aidan
I was thinking :
a) to implement a Virtuoso cluster to distribute the queries or b) to load Wikidata in a Spark dataframe (since Sansa framework is very slow, I would use my own implementation) or c) to load Wikidata in a Postgresql table and use Presto to distribute the queries or d) to load Wikidata in a PG-Strom table to use GPU parallelism.
What do you think? I am looking for ideas. Any suggestion will be appreciated.
Best,
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata