Optimizing batch operations with Ecto

@simonprevAugust 31st, 2022

We had a feature in a product that required us to insert and update lots of data on a user action. Here is how we transformed 1000 SQL queries into 1.

Intro

The way the code is structured is simple: An operations table and a translations table. When the user does an action, we create an operation record and then process it to project changes in the translations table.

operation = Repo.insert!(%Operation{action: "new", key: "A", text: "Foo"})
[new_translation] = ProcessOperation.process([operation])

Use case

The user action is typically a file upload:

{
  "greeting": "Hello, {{username}}",
  "success": "This is a success!",
  "error": "Oops, something went wrong"
}

This generates a list of operations. The first optimization that was made was to generate a single INSERT INTO statement for all the operations. So for the file above, we generate 3 operation records in a single SQL query.

But to process the operations, we simply did a serie of SQL query because it could be an insert, an update, a delete, etc.

operation_1 = Repo.insert!(%Operation{action: "new", key: "A", text: "Foo"})
operation_2 = Repo.insert!(%Operation{action: "new", key: "B", text: "Bar"})
ProcessOperation.process([operation_1, operation_2])

#> INSERT INTO translations VALUES("A", "Foo")
#> INSERT INTO translations VALUES("B", "Bar")

Optimizing the INSERT

First step was to optimize the insert. INSERT INTO is well supported by Ecto so it was trivial to change the way ProcessOperation handles operations. Instead of doing SQL query in the same loop as the operation were processed, we return instructions.

operation_1 = Repo.insert!(%Operation{action: "new", key: "A", text: "Foo"})
operation_2 = Repo.insert!(%Operation{action: "new", key: "B", text: "Bar"})
ProcessOperation.process([operation_1, operation_2])

#> [
#    {:insert, Translation, %{key: "a", text: "Foo"}},
#    {:insert, Translation, %{key: "b", text: "Bar"}}
#  ]

With the list of instructions mapped, we reduce them by schema and batch the final SQL statements.

Repo.insert_all(Translation, [%{key: "a", text: "Foo"}, %{key: "b", text: "Bar"}])

Optimizing the UPDATE

The next step was to optimize the UPDATE statements which is also trivial with Ecto. The "simple" approach was to update a list of records with the same attributes.

operation_1 = Repo.insert!(%Operation{action: "mark_as_reviewed", translation_id: 1})
operation_2 = Repo.insert!(%Operation{action: "mark_as_reviewed", translation_id: 2})
ProcessOperation.process([operation_1, operation_2])

#> [
#    {:update, Translation, 1, %{reviewed: true}},
#    {:update, Translation, 2, %{reviewed: true}}
#  ]

With the list of instructions, like the INSERT, we can still group them by schema (and the attributes), aggregate the ids, so we can build our final SQL statement.

Repo.update_all(
  from(t in Translation, where: t.id in ^ids),
  [set: [reviewed: true]]
)

Optimizing the dynamic UPDATE

This is the tricky part. We want to batch update operations with specific values to each operation.

operation_1 = Repo.insert!(%Operation{action: "update", text: "Baz", translation_id: 1})
operation_2 = Repo.insert!(%Operation{action: "update", text: "Aux", translation_id: 2})
ProcessOperation.process([operation_1, operation_2])

#> [
#    {:update_dynamic, Translation, 1, %{text: "Baz"}},
#    {:update_dynamic, Translation, 2, %{text: "Aux"}}
#  ]

The generated SQL statement is not supported natively by Ecto. Using 2 tricks found on the Elixir forum, (UPDATE wit a join, and unnest) we can do this:

ids = [1, 2]
text = ["Baz", "Aux"]

Repo.update_all(
  from(
    t in Translation,
    join: values in fragment(
      "SELECT * FROM unnest(?::integer[], ?::text[]) as t(id, text)",
      ^ids,
      ^text
    ),
    on: values.id == t.id,
    update: [set: [text: values.text]]
  )
)

Conclusion

By separating the evaluation of an operation and the execution in SQL, we were able to batch a series of statement into a single SQL query. The concept was heavily influenced by the way Absinthe does its resolving of field, with the tagging :ok, :async, :middleware, etc. By delaying the "final" execution until the end of the chain, we can do pretty interesting stuff :)

Check out the complete (and a bit more complicated) code in this pull request