Abstract

Given historical versions of an RDF graph, we propose and compare several methods to predict whether or not the results of a SPARQL query will change for the next version. Unsurprisingly, we find that the best results for this task are achievable by considering the full history of results for the query over previous versions of the graph. However, given a previously unseen query, producing historical results requires costly offline maintenance of previous versions of the data, and costly online computation of the query results over these previous versions. This prompts us to explore more lightweight alternatives that rely on features computed from the query and statistical summaries of historical versions of the graph. We evaluate the quality of the predictions produced over weekly snapshots of Wikidata and daily snapshots of DBpedia. Our results provide insights into the trade-offs for predicting SPARQL query dynamics, where we find that a detailed history of changes for a query's results enables much more accurate predictions, but has higher overhead at runtime, versus more lightweight alternatives.

Predicting OSC Query Dynamics

Proposed architecture for predicting change OSC given a query Q and dynamic RDF graph $\mathcal{G}$
Figure 1: Proposed architecture for predicting change OSC given a query 𝑸 and dynamic RDF graph 𝓖
We present the architecture for our proposed system for predicting OSC in Figure 1 . The inputs are a query 𝑸 and a dynamic RDF graph 𝓖. The system then extracts a feature vector (𝑓1,...,𝑓k). from these inputs and feeds them into a pre-trained binary classifier to make the OSC prediction. The query features (𝑓1,...,𝑓i) are extracted online from the query itself. The predicate and degree features (𝑓i+1,...,𝑓j) are extracted from a statistical description d(𝓖) of 𝓖, whose details will be described later; in practice, d(𝓖) can be computed and maintained offline (independently of the query) in an incremental manner, requiring only the two most recent versions of 𝓖 to be updated. Finally, the results features (𝑓j+1,...,𝑓k) require as input the full historical results of 𝑸 for each version of 𝓖 (which we denote by 𝑸(𝓖)); this must be computed online. The binary classifier is pre-trained over a given set of queries 𝓖 for which ground truths are computed over withheld versions.

Experimental Data