Streaming Joins and Nested Tables (Reactive Database Series)

Real usage produces real problems! So, today, I’m going to a talk about a problem that is emerging with nested tables. This is a looming problem with a current customer, and this illustrates a gap in my offering as I don’t have any joins at the moment. Problematically, I also allow tables to be nested within tables which has unfortunately consequences.

Nested tables

First, within Adama, tables can be nested within tables. This because tables are the primary way to represent collections within Adama, and the philosophy of document stores allows recursive structures. However, much like document stores, this can make queries complicated.

Here is the scenario (made anonymous). A user is part of a group, and groups can participate in an activity. Currently, this is modeled within Adama as


record User {
  public int id;
  private principal who;
  public int group_id;
  public string name;
}
table<User> _users;

record Group {
  public int id;
  public string name;
}
table<Group> _groups;

record SignedUpGroup {
  public int id;
  public int group_id;
}

record Activity {
  public int id;
  public date when;
  public string name;
  table<SignedUpGroup> _signed_up;
  
}
table<Activity> _activities;

The query that wrecks the above model is: “Which activities am I signed up for” and “who else is signed up from my group”. The core reason is that your association to an activity is held with the _activities, and you have to iterate over every Activity. That sucks! Now, this is going to be fast enough within Adama as it is all held in memory, but it eats capacity and is wasteful. However, it’s a good way to have this conversation because it may be a show stopper at a certain point.

Good news and self service.

The good news is that there are many ways to respond to fix the issue. Ultimately, the core issue is a question of whether the schema provides all the query potential that is needed. Thus, the fastest way to fix this is for my client to optimize based on their needs. The path forward is to migrate to a different way to associate activities to groups. For example, the immediate fix is to simply introduce an association table.

record ActivityGroupAssoc {
  public int id;
  public int activity_id;
  public int group_id;

  index activity_id;
  index group_id;
}

table<ActivityGroupAssoc> _activity_group_assoc;

Once you fix how data comes into the system (and it’s perfectly find to replicate data), we can ensure the _activity_group_assoc is loaded using the @load event.

private bool has_converted_activity_group_assoc = false;
@load {
  if (!has_converted_activity_group_assoc) {
    has_converted_activity_group_assoc = true;
    foreach (activity in (iterate _activites)) {
      foreach (group in (iterate activity._signed_up)) {
        _activity_group_assoc <- {activity_id: activity.id, group_id: group.id};
      }
    }
  }
}

And, once the metric work is done it would be possible to track the data upgrade via a handy metric;

metric has_converted_to_activity_group_assoc = 1;

Developers will be able to leverage this to know when it is safe to remove the @load event and clean up old data structures.

Ok, bad news

Beyond complexity smell, there is no way good now beyond manual testing to know if there is a real problem. The excess waste of scanning the table may be OK for now given small scales, but developers must be able to prioritize how the think about performance; this requires visibility. Fortunately, a theme is building in my engineering back-log around providing a certain degree of logging and visibility. Currently, I’m looking at: (1) how do I give developers access to the metrics their documents emit, (2) how do I give developers the ability to inspect how they are using third party services, (3) how do give developers insight into their traffic and availability. Ultimately, I need a logging/metric service that I can then provide APIs/UIs for.

This visibility is becoming crucial as a client is looking towards a launch soon.

Joins and reactivity

At core, Adama doesn’t support any kind of joins and this is intentional. Fundamentally, this is because Adama is a streaming environment and query planning is hard.. This difficulty, in my opinion, stems from an over commitment on the relational model because JOINs naturally require thinking about indices and optimization or bad times are ahead.

My hope in the next year is to dive more into reactivity of tables and optimize this hard! Adama leverages “naive reactivity” where formula X that uses fields A, B, C will invalidate if either A, B, or C change. Invalidation will drop the formula into “compute now” mode until the end of the transactional boundary where the value is cached on next computation. Since tables are much more heavy weight, this means that there is a lot of potential excess to cut out. For example,

table<User> _users;
bubble you = iterate _users where who == @who;

If there are 1,000 people connected then inserting, updating, removing a user will cause all 1,000 viewers to re-compute the you field which will not change. This is wasted compute. Instead, I need to detect when a formula or bubble depends on a table and then change how table updates re-compute the formula. In the above example, the bubble could listen for specific types of updates and filter out the noise. That is, instead of simply invalidating the bubble, the table would publish an event “add_item(id, who)” and the bubble would have a list of checks like “index_change(who, $value)”;

Here, we can create a list of check types:

index_change(field, value)
primary_key(value)

Inserting and removing records will fire a primary_key and index_change for each indexed field. Updates will fire index_change for both the previous and new values. If a formula or bubble requires a table scan, then it will simply invalidate the table as we do now.

This detour into reactivity is fundamentally a requirement for even thinking about joins.

Manifesting joins via a waterfall

The way that I want to deal with joins is by giving a way to automate the conversion of nested tables to a flat virtualized table. For lack of a better word, let’s call it a waterfall table because what is life without whimsy.

waterfall _activity_group_assoc {
    use_table _activites as a -> { int activity_id = a.id; index activity_id; }
    use_table a._signed_up as b -> { int group_id = b.group_id; index group_id; }
}

This waterfall syntax reads top down and mirrors the prior @load event in terms of nest foreach loop. The novel aspect is the translation from the table (or list) to a message fragment type, so this waterfall will behave exactly like the table<ActivityGroupAssoc> _activity_group_assoc

However, this provides an opportunity to have a richer change log from the tables as the construction of this table is going to be expensive without some kind of reduction. The reason for inventing a new thing is to play around with the differentiability. The above is really a “dynamic cross join” where we cross each item with a child item. This begs a question of what is use_table and the syntax of it, and all this requires sorting out.

However, it also creates a fertile place for extension. For example, I could add a use_list

waterfall research {
  use_table _some_table as a -> { int some_id = a.id; }
  use_list
    (iterate another_table where this_field == a.that_field)
    as b -> { int thing_id = b.id; }
}

This starts the path of building something that could provide joins and builds the underlying core while exploring. Isn’t research fun?

September 26th, 2023 Streaming Joins and Nested Tables (Reactive Database Series) By Jeffrey M. Barber

Nested tables

Good news and self service.

Ok, bad news

Joins and reactivity

Manifesting joins via a waterfall