There are several ways for handling large result sets from the database in your application. Each way has its own limitations and pros and cons. All of them have the advantage of reducing memory footprint. In this blogpost we will discuss three ways and see how they compare.
For this blog we will generate a set of test data containing a couple of Order
s with some LineItem
s. For demonstration purposes we generate a small test data set, but it basically behaves the same for thousands of rows.
(1..5).each do |i|
order = Order.create!(description: "Order #{i}")
(1..3).each do |j|
LineItem.create!(order: order, description: "Line item #{j} for order #{i}")
end
end
Amongst similar methods active record by default provides find_each. This means you won't need any other dependencies when handling large result sets! Downside is you can't define a custom order since it relies on the id
column for querying the batches.
Order.find_each(batch_size: 2) do |order|
puts order.description
order.line_items.each {|line_item| nil; }
end
Results in the following queries (I cleaned the log a bit for readability):
SQL: SELECT "orders".* FROM "orders" ORDER BY "orders"."id" ASC LIMIT 2
>>> order 2210 Order 1
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2210]]
>>> order 2211 Order 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2211]]
SQL: SELECT "orders".* FROM "orders" WHERE ("orders"."id" > 2211) ORDER BY "orders"."id" ASC LIMIT 2
>>> order 2212 Order 3
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2212]]
>>> order 2213 Order 4
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2213]]
SQL: SELECT "orders".* FROM "orders" WHERE ("orders"."id" > 2213) ORDER BY "orders"."id" ASC LIMIT 2
>>> order 2214 Order 5
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2214]]
The first query active record does is SELECT "orders".* FROM "orders" ORDER BY "orders"."id" ASC LIMIT 2
. As stated above active record uses the id
to create batches of 2 using >
in the next batch of 2: SELECT "orders".* FROM "orders" WHERE ("orders"."id" > 2152) ORDER BY "orders"."id" ASC LIMIT 2
.
It uses LIMIT
to only retrieve the batch_size
number of orders.
As stated, the downside of using id is you can't define the order of the query. If you add an order
clause it will log a message saying Scoped order and limit are ignored, it's forced to be batch order and batch size
and the order is ignored.
Another problem using this approach is that it creates a n+1 problem since for each order it will do a query to the database to fetch the child relations. With real large result sets this will flood your database. Luckily this is easily fixed by fetching the child relations eagerly using includes
:
Order.includes(:line_items).find_each(batch_size: 2) do |order|
puts order.description
order.line_items.each {|line_item| nil; }
end
Results in the following queries (without n+1 problem):
SQL: SELECT "orders".* FROM "orders" ORDER BY "orders"."id" ASC LIMIT 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2210, 2211)
>>> order 2210 Order 1
>>> order 2211 Order 2
SQL: SELECT "orders".* FROM "orders" WHERE ("orders"."id" > 2211) ORDER BY "orders"."id" ASC LIMIT 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2212, 2213)
>>> order 2212 Order 3
>>> order 2213 Order 4
SQL: SELECT "orders".* FROM "orders" WHERE ("orders"."id" > 2213) ORDER BY "orders"."id" ASC LIMIT 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2214)
>>> order 2214 Order 5
Conclusion
includes
The postgresql_cursor is a nice gem that brings real postgresql cursors to active record. In order to use just require the gem and it will add several methods to you models for iterating over result sets. In this example we use each_instance
:
Order.each_instance(block_size: 2) do |order|
puts ">>> order #{order.id} #{order.description}"
order.line_items.each {|line_item| nil; }
end
Results in the following statements:
SQL: BEGIN
SQL: declare cursor_b1d5a08826e84f2693ce31dfd5b746e8 cursor for SELECT "orders".* FROM "orders"
SQL: fetch 2 from cursor_b1d5a08826e84f2693ce31dfd5b746e8
>>> order 2225 Order 1
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2225]]
>>> order 2226 Order 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2226]]
SQL: fetch 2 from cursor_b1d5a08826e84f2693ce31dfd5b746e8
>>> order 2227 Order 3
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2227]]
>>> order 2228 Order 4
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2228]]
SQL: fetch 2 from cursor_b1d5a08826e84f2693ce31dfd5b746e8
>>> order 2229 Order 5
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2229]]
SQL: fetch 2 from cursor_b1d5a08826e84f2693ce31dfd5b746e8
SQL: close cursor_b1d5a08826e84f2693ce31dfd5b746e8
SQL: COMMIT
Here we see that the postgresql_cursor
creates a postgresql cursor on the fly. Also a transaction is started since the cursor is created without the WITH HOLD
directive. This means that all resources are cleaned after the transaction completes automatically. See the postgresql documentation on declare for more information. Using the postgresql_cursor
we can now add custom ordering to our query like:
Order.order('description desc').each_instance(block_size: 2) do |order|
puts ">>> order #{order.id} #{order.description}"
order.line_items.each {|line_item| nil; }
end
Results in the following statements:
SQL: BEGIN
SQL: declare cursor_0bca06d6e41a436e8b96f8dc765cfa38 cursor for SELECT "orders".* FROM "orders" ORDER BY description desc
SQL: fetch 2 from cursor_0bca06d6e41a436e8b96f8dc765cfa38
>>> order 2259 Order 5
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2259]]
>>> order 2258 Order 4
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2258]]
SQL: fetch 2 from cursor_0bca06d6e41a436e8b96f8dc765cfa38
>>> order 2257 Order 3
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2257]]
>>> order 2256 Order 2
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2256]]
SQL: fetch 2 from cursor_0bca06d6e41a436e8b96f8dc765cfa38
>>> order 2255 Order 1
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" = $1 [["order_id", 2255]]
SQL: fetch 2 from cursor_0bca06d6e41a436e8b96f8dc765cfa38
SQL: close cursor_0bca06d6e41a436e8b96f8dc765cfa38
SQL: COMMIT
Conclusion
includes
which results in n+1 problem when traversing child relationsSince I recently needed both requirements (support for n+1 and ordering) I combined the two approaches and created a cursor to fit my needs.
Cursor.new(Order.order('description desc').includes(:line_items), fetch_size: BATCH_SIZE).find_each do |order|
puts ">>> order #{order.id} #{order.description}"
order.line_items.each {|line_item| nil; }
end
Without going into the internals this cursor basically wraps the query that is passed in and creates a postgresql cursor based on the query requirements. To be able to do this it inspects the active record internals and constructs the queries. It preloads the relations by using the ActiveRecord::Associations::Preloader
the same way active record uses it.
This results in the following statements:
SQL: BEGIN
SQL: declare mycursor3cf9b668771e4ef988a7c989680b104c cursor for SELECT "orders".* FROM "orders" ORDER BY description desc
SQL: fetch 2 in mycursor3cf9b668771e4ef988a7c989680b104c
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2259, 2258)
>>> order 2259 Order 5
>>> order 2258 Order 4
SQL: fetch 2 in mycursor3cf9b668771e4ef988a7c989680b104c
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2257, 2256)
>>> order 2257 Order 3
>>> order 2256 Order 2
SQL: fetch 2 in mycursor3cf9b668771e4ef988a7c989680b104c
SQL: SELECT "line_items".* FROM "line_items" WHERE "line_items"."order_id" IN (2255)
>>> order 2255 Order 1
SQL: fetch 2 in mycursor3cf9b668771e4ef988a7c989680b104c
SQL: close mycursor3cf9b668771e4ef988a7c989680b104c
SQL: COMMIT
Conclusion
includes