An unstudied problem: query optimization for streaming data

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Interesting:

I haven’t done any kind of systematic literature review recently, so the directions below are kind of a random sampling of things I happen to be have stumbled over.

Exploring the plan space seems like a first step. It seems likely to behave differently than in a batch setting. For example, in a batch query when you’re joining a big relation against a small relation, you want to build an index on the small relation and scan the big relation against it, rather than the other way around. But in a streaming setting you have to build indexes on both relations anyway so this entire dimension of decision making is taken off the table. For another example, for join ordering in a batch setting we want to minimize the size of the intermediate results, but in a streaming setting we also want to minimize the change rate of the intermediate results which might be achieved by an ordering with larger intermediate results. So we need to figure out which are the most consequential decisions for streaming plans.

Replanning queries has a rich academic literature in the context of long-running batch systems. Likely this could be adapted to incremental/streaming settings. I’m somewhat bearish on this approach because of the sheer operational terror of having your cluster suddenly decide to shift itself into a different and not-yet-tested configuration. But maybe we can find some control theory -like framework that bounds the costs of transitions.

There is also a lot of academic work on multi-query optimization. I don’t really know anything about it yet other than that it exists, but it seems like an obvious seam to mine.

I’ve seen some work into robust cardinality estimation (to oversimplify – propagate error bars in estimates) but less into robust operators. If we build plans from operators that work well across large ranges of input distributions and that don’t have catastrophic failure modes, then we won’t be as reliant on accurate statistics. The example that came up in Query Optimization Through the Looking Glass is the nested loop operator in postgres which is faster in a small part of the input space but disastrously slower in the rest of it, making it a risky choice if your estimates might be inaccurate. There’s also some folk wisdom that worst-case optimal joins are more robust to poor join orders than traditional binary joins, but I haven’t seen this systematically tested. I think this is on the whole an under-explored direction.

I’ve also seen some work on worst-case cardinality estimation (eg). Rather than take a point estimate of the input data and emit a point estimate of the output cardinality, the approach is to take bounds on the possible input distributions (in the form of statements about the relative entropy of different subsets) and calculate a tight upper bound on the largest possible output cardinality across the whole space of possible inputs. I like this approach because it works better the more information it has but it doesn’t make disastrous decisions when it has little information. It’s effectiveness though is going to depend on whether we care more about finding the best plans or avoiding the worst plans, which in turn will depend on the distribution of costs across the space of plans, which brings us back to exploring the plan space.

Finally, I’ve done some tentative exploration into avoiding query planning entirely. If the space of query plans is simple enough then it’s feasible to allow the programmer to choose the plan, either by providing hints or in my case by having a direct mapping from the query language to the plan. For the Join Order Benchmark it seemed that having a human spend 30s per query making the obvious calls is sufficient to get pretty good plans compared to postgres. But there were too many confounding factors in that test to be sure and I haven’t revisited the work since to find out how the chosen plans compare to the best possible plans in the same execution engine.

Post external references

  1. 1
    https://scattered-thoughts.net/writing/why-query-planning-for-streaming-systems-is-hard
  2. 2
    https://db.in.tum.de/~leis/papers/lookingglass.pdf
  3. 3
    https://arxiv.org/pdf/1803.09930.pdf
  4. 4
    https://homes.cs.washington.edu/~suciu/sigmod-2019-pessimistic.pdf
  5. 5
    https://scattered-thoughts.net/writing/a-practical-relational-query-compiler-in-500-lines/#planning
Source