did you try to point the wdqs copy to your tdb/fuseki endpoint?

On Thu, 7 Dec 2017 at 18:58, Andy Seaborne <andy@apache.org> wrote:
Dell XPS 13 (model 9350) - the 2015 model.
Ubuntu 17.10, not a VM.
1T SSD.
16G RAM.
Two volumes = root and user.
Swappiness = 10

java version "1.8.0_151" (OpenJDK)

Data: latest-truthy.nt.gz (version of 2017-11-24)

== TDB1, tdbloader2
   8 hours // 76,164 TPS

Using SORT_ARGS: --temporary-directory=/home/afs/Datasets/tmp
to make sure the temporary files are on the large volume.

The run took 28877 seconds and resulted in a 173G database.

All the index files are the same size.

node2id : 12G
OSP     : 53G
SPO     : 53G
POS     : 53G

Algorithm:

Data phase:

parse file, create node table and a temporary file of all triples (3x 64
bit numbers, written in text.

Index phase:

for each index, sort the temp file (using sort(1), an external sort
utility), and make the index file by writing the sorted results, filling
the data blocks and creating any tree blocks needed. This is a
stream-write process - calculate the data block, write it out when full
and never touch it again.

This results in data blocks being completely full, unlike the standard
B+Tree insertion algorithm. It is why indexes are exactly the same size.

Building SPO is faster because the data is nearly sorted to start with,.
Data often tends to arrive grouped by subject.

tdbloader2 is doing stream (append) I/O on index files, not a random
access pattern.

== TDB1 tdbloader1
   29 hours 43 minutes // 20,560 TPS

106,975 seconds
297G    DB-truthy

node2id: 12G
OSP:     97G
SPO:     96G
POS:     98G

Same size node2id table, larger indexes.

Algorithm:

Data phase:

parse the file and create the node table and the SPO index.
The creation of SPO is by b+tree insert so blocks are partially full
(average is empirically about 2/3 full). When a block fills up, it is
split into 2.  The node table is exactly the same as tdbloader2 because
nodes are stored in the same order.

Index phase:

for each index, copy SPO to the index.  This is a tree sort and the
access pattern on blocks is fairly random which is a bad thing. Doing
one at a time is faster than two together because more RAM in the
OS-managed file system cache, is devoted to caching one index.  A cache
miss is a possible write to disk, and always a read from disk, which is
a lot of work even with an SSD.

Stream reading SPO is efficient - it is not random I/O, it is stream I/O.

Once the cache-efficiency of the OS disk cache drops, tdbloader slows
down markedly.

== Comparison of TDB1 loaders.

Building an index is a sort because the B+Trees hold data sorted.

The approach of tdbloader2 is to use an external sort algorithm (i.e.
sort larger than RAM using temporary files) done by a highly tuned
utility, unix sort(1).

The approach of tdbloader1 is to copy into a sorted datastructure. For
example, copying index SPO to POS, it is creating a file with keys
sorted by P then O then S, which is not the arrival order which is
S-sorted.  tdbloader1 maximises OS caching of memory mapped files by
doing indexes one at a time.  Experimentation shows that doing two at
once is slower, and doing two in parallel is no better and sometimes
worse, than doing sequentially.

== TDB2

TDB2 is experimental.  The current TDB2 loader is a functional placeholder.

It is writing all three indexes at the same time.  While for SPO this is
not a bad access pattern (subjects are naturally grouped), for POS and
OSP, the I/O is a random pattern, not a stream pattern.  There is more
than double contention for OS disk cache, hence it is slow and gets
slower faster.

== More details.

For more information, consult the Jena dev@ and user@ archives and the code.
--


---
Marco Neumann
KONA