Given, you have a very large table with 500 Million rows, and you have to select some random 1000 rows out of the table and you want it to be fast.
Given the specifications:
- You assumed to have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
- Ideally no or few write operations.
- Your ID column should have been indexed! A primary key serves nicely.
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
SELECT count(*) AS ct -- optional , min(id) AS min_id , max(id) AS max_id , max(id) - min(id) AS id_span FROM big;
The only possibly expensive part is the
count(*) (for huge tables). You will get an estimate, available at almost no cost (detailed explanation here):
SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;
As long as
ct isn’t much smaller than
id_span, the query will outperform most other approaches.
WITH params AS ( SELECT 1 AS min_id -- minimum id <= current min id , 5100000 AS id_span -- rounded up. (max_id - min_id + buffer) ) SELECT * FROM ( SELECT p.min_id + trunc(random() * p.id_span)::integer AS id FROM params p ,generate_series(1, 1100) g -- 1000 + buffer GROUP BY 1 -- trim duplicates ) r JOIN big USING (id) LIMIT 1000; -- trim surplus
- Generate random numbers in the
idspace. You have “few gaps”, so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
idcan be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use
- Join the
ids to the big table. This should be very fast with the index in place.
- Finally trim surplus
ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.
You can simplify this query. The CTE in the query above is just for educational purposes:
SELECT * FROM ( SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id FROM generate_series(1, 1100) g ) r JOIN big USING (id) LIMIT 1000;
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs LIMIT 1030 -- hint for query planner ) r JOIN big b USING (id) -- eliminate miss UNION -- eliminate dupe SELECT b.* FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM random_pick r -- plus 3 percent - adapt to your needs LIMIT 999 -- less than 1000, hint for query planner ) r JOIN big b USING (id) -- eliminate miss ) SELECT * FROM random_pick LIMIT 1000; -- actual limit
We can work with a smaller surplus in the base query. If there are too many gaps so we don’t find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached – or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the
UNION in the rCTE.
LIMIT makes the CTE stop as soon as we have enough rows.
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
Wrap into function
For repeated use with varying parameters:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03) RETURNS SETOF big AS $func$ DECLARE _surplus int := _limit * _gaps; _estimate int := ( -- get current estimate from system SELECT c.reltuples * _gaps FROM pg_class c WHERE c.oid = 'big'::regclass); BEGIN RETURN QUERY WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM generate_series(1, _surplus) g LIMIT _surplus -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses UNION -- eliminate dupes SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM random_pick -- just to make it recursive LIMIT _limit -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses ) SELECT * FROM random_pick LIMIT _limit; END $func$ LANGUAGE plpgsql VOLATILE ROWS 1000;
SELECT * FROM f_random_sample(); SELECT * FROM f_random_sample(500, 1.05);
You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use
EXECUTE … But that’s beyond the scope of this post. See:
IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
Postgres 9.5 introduces
TABLESAMPLE SYSTEM (n)
It’s very fast, but the result is not exactly random. The manual:
SYSTEMmethod is significantly faster than the
BERNOULLImethod when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
Where n is a percentage. The manual:
SYSTEMsampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any
Bold emphasis mine.