Implicit array cross joins

First of all, I’d like to say that this is a very cool specification. I’m looking forward to seeing some (efficient) real world implementations on top of nested schemaless data.

I’ll probably end up posting this as an issue once the language specification repo is online, but for now…
I have a suggestion regarding the language syntax for joins into nested array structures. In order to lower the barrier of entry, I believe it can be beneficial to allow implicit cross joins into arrays.

This can be achieved without breaking compatibility with SQL or the current syntax by adding a [] array accessor. So for example, for the following data:

		<< { 'household_id': 1,
		     'devices': [ { 'device_id': 1,
		     				'measurements': [ { 'frequency': 59.4, 'voltage': 219 },
		     				                  { 'frequency': 60.1, 'voltage': 222 }
		     			  { 'device_id': 2,
		     				'measurements': [ { 'frequency': 58.4, 'voltage': 221 }

This query:

SELECT household_id,
FROM   iot_events

would be equivalent to:

SELECT household_id,
FROM   iot_events AS i, i.devices AS d, d.measurements AS m

The general case is pretty straightforward. In the case where an implicit array cross join happens inside an aggregate function, it would instead be transformed as follows:

SELECT household_id,
          COLL_MAX(devices[].measurements[].voltage) AS max_voltage
  FROM iot_events


SELECT household_id,
       COLL_MAX(SELECT VALUES m.voltage FROM d.measurements AS m)
   FROM iot_events AS i, i.devices AS d

Implicit array cross joins within aggregate functions (COLL or not) are expanded as cross joins within the aggregation (unless they’re also used outside the aggregation).


Thanks for the suggestion @yoni. This reminds me somewhat of the FLATTEN function in Drill SQL. My primary concern with this proposal is that the syntax non-obviously changes the cardinality of the result set which is somewhat hard to see syntactically. The first time I encountered Drill’s FLATTEN it surprised me that the cardinality changed. One could argue that standard SQL’s aggregates do something very similar, but we’ve been trying to minimize such things for non-backwards compatible use cases (because IMHO, SQL aggregates are really quite weird when you think about their semantics).

What do you think about the cardinality concern? Also, would this syntax be supported in other clauses or limited to SELECT?

For the COLL_MAX example, we do have a syntax that would work here:

SELECT ..., COLL_MAX(devices[*].measurements[*].voltage)

The wild-card paths are similar to your proposal, but are independent from the FROM-clause

Good point @almann, the flattening is actually not my intention, but rather a side-effect because I couldn’t figure out how to describe the desired logic in your environment/binding semantics.
In fact, when using such semantics in our system, we do not implicitly flatten, but rather in a general sense scope the operation to the common array element.

Here’s how the same semantic would be defined without causing that side effect:

Consider for example the expression

devices[].measurements[].frequency * devices[].measurements[].voltage

In a broader context and before talking about implicit operations, if we were to define a target location then a user would expect it to be scoped to the target location:

SELECT devices[].measurements[].frequency * devices[].measurements[].voltage AS devices[].measurements[].mul
FROM   iot_events

The expected result would be:

<< { 'devices': [ { 'measurements': [ { 'mul': 13008.6 }, { 'mul': 13342.2 } ] },
                  { 'measurements': [ { 'mul': 12906.4 } ] }

Two examples to illustrate the difference the target location makes:

SELECT a[].b * a[].c[].d AS a[].c[].e
FROM   << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >>


<< { 'a': [ { 'c': [ { 'e': 4 }, { 'e': 6 } ] } ] } >>


SELECT a[].b * a[].c[].d AS a[].e
FROM   << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >>


<< { 'a': [ { 'e': [ 4, 6 ] } ] } >>

Note that the output type is well-defined - it is wrapped in an array (or bag) if at least one of the inputs is in an array (or bag) relative to the target location.
Additional array levels are flattened, but this can be changed to preserve them if desired (this is a bit more complicated semantically, since there would be a nested implicit hierarchical structure that might be difficult for a user to understand).

If there isn’t a target location (in an inline function call) then the target location is implicitly defined as:

  1. If the operation is wrapped in an aggregate function, then the deepest nested location of the aggregate’s containing function’s other parameters, or the root if there aren’t any.
  2. In any other case, the deepest nested input array, or the first in the case of a tie in depth.

For example:

SELECT VALUES a[].b, a[].b + COLL_SUM(a[].c[].d) AS a[].e
FROM   << { 'a': [ 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] ] } >>

would return

<< { 'a': [ { 'b': 2, 'e': 7 } ] } >>

I’ll try to put together a PartiQL query that achieves the same result; I believe you’d need a double nested PIVOT + UNPIVOT query.

This syntax is easily showcased in the trivial case of selecting a sub-part of a nested part of the current binding (not useful in and of itself, but shows the expansion of expressiveness). For example, if my binding is

<< { 'a': [ { 'b': 1, 'c': 2 } ] } >>

And the resulting binding I want is

<< { 'a': [ { 'b': 1 } ] } >>

I’m not sure what the best practice would be to achieve that currently.

In the case of a cartesian product (unrelated arrays), this syntax could potentially decide to fail to compile, since it’s almost never what people actually want to do.

Does this make more sense to you?

Sorry for the delay in response, I’ve been trying to comprehend your very detailed response, and I am not quite sure I have it properly internalized.

For example:

devices[].measurements[].frequency * devices[].measurements[].voltage

What is going on here exactly? there is a multiplication operator, and two operands that don’t quite look like numbers. There is something going on here that is akin to standard SQL aggregation (where there is sort of this implicit lambda [really sorry if this is too jargon-y]), and I can’t quite wrap my head around the rules that you have (I think there is some kind of context sensitivity, but maybe I am wrong in interpretation).

For your other example, there is a lot to unpack:

SELECT a[].b * a[].c[].d AS a[].c[].e
FROM   << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >>

I sort of read this as a kind of functional edit of the input, is that right?

Not sure if I am missing the point–maybe starting from the core rule/semantics would be helpful.

you can achieve in PartiQL the same with nested queries.
Let’s pick your example

SELECT a[].b * a[].c[].d AS a[].c[].e
FROM   << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >>

The PartiQL way to it is the following. Notice, PartiQL avoids overloading and makes explicit the nesting, by the nesting of the respective SELECT / SELECT VALUE clauses.

          {  'c': ( SELECT VALUES { 'e': va.b * vc.d }
                    FROM va.c AS vc AT ic
                    ORDER BY ic
          FROM vtop.a AS va AT ia
          ORDER BY ia
    ) AS a
FROM << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >> AS vtop

Let me know what you think.

Thanks for the responses!

@yannip, that’s exactly what I meant by the example. Note the complexity of that query in PartiQL - it’s hard for me to imagine a run of the mill business analyst, for example, being able to read that, let alone write it. My goal here is to be able to support common work with nested data without needing a very advanced knowledge of PartiQL, or a very data oriented thought process.

@almann, the purpose of this syntax is to allow users to do the things they would most likely want to do in nested data, without necessarily understanding the underlying semantics. I understand that the statement devices[].measurements[].frequency * devices[].measurements[].voltage is nonsensical in the sense that we’re multiplying two completely different objects, that don’t even have a multiplication operator defined on them.

But to continue with this example, a user has a very clear and concise idea, “I want to multiply the frequency by the voltage”, and wants to write a query to that effect. The syntax I’m offering is a (possibly too implicit) way of letting them do that, without needing to know too much about nested data and its gotchas. Keep in mind that while the sentence is slightly ambiguous, I’m pretty sure that the meaning is to multiply the two values within each array item together.

Maybe as an intermediate step, we could simply define an explicit scoping construct, which we can then choose to omit later on? For example []{ ... } which would allow operations within array objects in a binding (without deconstructing and then reconstructing them).

For example:

SELECT a[]{ b * c[]{ d AS e } }
FROM   << { 'a': [ { 'b': 2, 'c': [ { 'd': 2 }, { 'd': 3 } ] } ] } >>

I think that’s a bit too clunky (especially the b * ..., and d AS e which ends up multiplying b by d) but I’m hoping it would be a starting point to get to something a bit more natural. Perhaps this:

SELECT a[]{ c[]{ b * d AS e } }

Where then b would be shadowed if there was a b defined in the c array.

What do you think?

@yoni , we can achieve all goals. Goal 1 being “a general and efficient functional-based query language”, which is the underlying goals behind @almann’s points and Goal 2 being “a quick way to write implicit array joins that preserve nesting”.

The solution is to introduce special syntax - the easiest way being a new function.

nested_multiply('a[].b', 'a[].c[].d', 'a[].c[].e')

Notice that deliberately, the paths here are string arguments to be interpreted by the nested_multiply. Thus the user who picks his cues from conventional path languages will not assume that each path has a conventional meaning, on its own. Rather, the nested_multiply gives the meaning.
The problems with bringing the proposed path * path in the semantics as implicit join that preserves nesting is that (a) the paths meaning is suddenly different from the meaning that other languages (and PartiQL) assign to paths (b) the semantics explanation becomes long.
Your response to it is that it’s not necessary (or desirable) that everyone who uses a query language really knows its formal semantics. They can just as well be copying templates with only a high-level intuitive sense of what these templates do. This is a fine usage mode but it can be achieved without overloading a syntactic construct. It is actually better achieved by a new syntactic construct cause, this way, you remove the risk of someone having other preconceptions about it in her head.
More broadly, the function approach is not the only approach. Other syntax can also be invented. The bigger point is that instead of overloading syntactic constructs with different semantics, we may do better by adding a new syntax.

I like your direction, and I agree that this should use a new syntax rather than create ambiguity related to what exactly is going on.

My issue is actually a problem I also have with the nested aggregate functions, COLL_SUM, COLL_AVG, etc. From my experience, when you introduce a “new” syntax, it’s an uphill battle to reach adoption. While there isn’t actually any semantic difference between COLL_SUM and SUM (and I understand why it’s 100% necessary to have different names in that case, even though it’s undesirable) I think that this in and of itself is a high barrier to PartiQL adoption.

Similarly, if would be good to allow people to use functions and language constructs they’re familiar with (i.e. infix * instead of nested_multiply in this example).

What about this:

SELECT a[]{ c[]{ a[].b * a[].c[].d AS e } }

I think this is a lot more in line with how PartiQL works at the moment. Logically, this would mean “within each element of a, calculate …”. I added the full path to the value to remove ambiguity, but since it’s explicitly scoped, I don’t think that makes it unclear what’s going on. In the previous message, I wrote b * d, which I probably prefer, and maybe should be supported with standard SQL ambiguity resolution.

The nice thing here is that users can keep using the functions they’re used to. In fact, in this context, we could even revert to normal aggregation functions (similar to how they can be used in a sub-select without making the main query aggregated.

It’s unfortunately still a new syntax that users would need to familiarize themselves with, but maybe that’s the best we can do.

As an aside, the real value of this kind of syntax is within a LET clause, since we’re creating a new value which we’ll then want to use later on. I think it is also applicable to SELECT statements, but I wouldn’t necessarily see myself using it in every query I write, as opposed to LET clauses, where I think pretty much any operation on nested data would want to be this kind of “create in place” construct.