Markdown Linked Data
Performance improvements
tldr;
switching from Comunica to Oxigraph for 300X SPARQL query speed up
While Part 1 was originally written a few weeks ago, I finally got this site updated with a new static site generator to make it easier to publish updates.
I’ll be working on a proper Part 2 to dive deeper into SPARQL queries and the patterns it enables, but in the meantime I wanted to note a big performance improvement.
I started this project using Comunica for the query engine. Since Comunica is written in JavaScript, it was easy to integrate with the SvelteKit app, and it offers a lot of extensibility to hook into the query pipeline. The RDF processing was implemented as custom Source and Store interface implementations. In an effort to only add complexity as-needed, the initial version would re-parse files on the fly for each query. As I reached a point that this wasn’t scaling well, I would add more optimizations.
One key optimization was to load the RDF data into an N3 in-memory store, which would be invalidated when files were updated. This worked well for a while, but after using it heavily on a daily basis my setup has grown to 672 documents including 8,672 RDF quads, and performance was starting to suffer. Simple queries could still be relatively fast, but the main query driving my “Today” dashboard to list the tasks I had scheduled for the day was creeping along at a painful 6s.
I had discovered Oxigraph fairly early on, and knew I’d probably want to switch to that eventually, and it was seeming like now was the time. Oxigraph is implemented in Rust, there is a WASM wrapper which may come in handy later, but I decided to go ahead and add support for the binary SPARQL server CLI. The server has endpoints for running SPARQL queries, but also to upload and replace an entire dataset, or to replace individual named graphs.
In the initial Comunica implementation I had included all data as part of the default GRAPH. However, when adding the N3 cache I started using the GRAPH internally to track the file which the data came from. When a file was modified, I called Store.deleteGraph to clear the old data associated with the file and re-add the updated contents. This ended up being fortuitous for Oxigraph, which supports HTTP PUT to replace the contents of a graph like:
curl -f -X PUT -H 'Content-Type:application/n-triples' \
-T MY_FILE.nt "http://localhost:7878/store?graph=http://example.com/g"
So, now on startup the app starts by reloading the whole dataset from all available Markdown files, but then when individual files are modified, it uses this graph replacement pattern to keep the Oxigraph server in-sync.
However, before starting on the sync implementation I did a quick test to do an export and load the existing data into Oxigraph. Re-running that 6 second Today dashboard query now ran in just 0.018 seconds. I expected it to be faster, but 300X was certainly a welcome surprise.
At this point Oxigraph is still an optional optimization. If an Oxigraph server is not configured it will continue to use the Comunica implemenation. Comunica is also still used for writing updates, though I may explore other ways to tie into Oxigraph later as-needed.