Best way to select random rows PostgreSQL

Given, you have a very large table with 500 Million rows, and you have to select some random 1000 rows out of the table and you want it to be fast.

Given the specifications:

  • You assumed to have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Ideally no or few write operations.
  • Your ID column should have been indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
            , max(id)  AS max_id
            , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). You will get an estimate, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ct isn’t much smaller than id_span, the query will outperform most other approaches.

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus
  • Generate random numbers in the id space. You have “few gaps”, so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
  • Join the ids to the big table. This should be very fast with the index in place.
  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM  (
    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
    FROM   generate_series(1, 1100) g
    ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
SELECT *
FROM   random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don’t find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached – or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT c.reltuples * _gaps
      FROM   pg_class c
      WHERE  c.oid = 'big'::regclass);
BEGIN

   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   SELECT *
   FROM   random_pick
   LIMIT  _limit;
END
$func$  LANGUAGE plpgsql VOLATILE ROWS 1000;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE … But that’s beyond the scope of this post. See:

Possible alternative

IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

It’s very fast, but the result is not exactly random. The manual:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine.

Related:

Source

Convert Comma separated String to Rows in Oracle SQL

Many times we need to convert a comma separated list of terms in a single string and convert it rows in SQL query.

for example

 India, USA, Russia, Malaysia, Mexico

Needs to be converted to:

 Country
 India
 USA
 Russia
 Malaysia
 Mexico

The following SQL script can help in this:

just replace the required values with your string and your delimiter.

Apache Commons DbUtils Mini Wrapper

This is a very small DB Connector code in Java as a wrapper class to Apache DBUtils.

The Commons DbUtils library is a small set of classes designed to make working with JDBC easier. JDBC resource cleanup code is mundane, error prone work so these classes abstract out all of the cleanup tasks from your code leaving you with what you really wanted to do with JDBC in the first place: query and update data.

Some of the advantages of using DbUtils are:

  • No possibility for resource leaks. Correct JDBC coding isn’t difficult but it is time-consuming and tedious. This often leads to connection leaks that may be difficult to track down.
  • Cleaner, clearer persistence code. The amount of code needed to persist data in a database is drastically reduced. The remaining code clearly expresses your intention without being cluttered with resource cleanup.
  • Automatically populate Java Bean properties from Result Sets. You don’t need to manually copy column values into bean instances by calling setter methods. Each row of the Result Set can be represented by one fully populated bean instance.

DbUtils is designed to be:

  • Small – you should be able to understand the whole package in a short amount of time.
  • Transparent – DbUtils doesn’t do any magic behind the scenes. You give it a query, it executes it and cleans up for you.
  • Fast – You don’t need to create a million temporary objects to work with DbUtils.

DbUtils is not:

  • An Object/Relational bridge – there are plenty of good O/R tools already. DbUtils is for developers looking to use JDBC without all the mundane pieces.
  • A Data Access Object (DAO) framework – DbUtils can be used to build a DAO framework though.
  • An object oriented abstraction of general database objects like a Table, Column, or Primary Key.
  • A heavyweight framework of any kind – the goal here is to be a straightforward and easy to use JDBC helper library.

Wrapper:

How to find nth highest integer value from a table?

For a Sample table

name value
a 1
b 3
c 5
d 2
e 0
f 7

If we want to find the nth highest integer value, then the SQL would be:

  1. SELECT MIN(value) FROM (SELECT * FROM table ORDER BY value DESC)
  2. WHERE ROWNUM <=2

Reset Sequence in Oracle SQL back to 0

Some times we need to reset the sequence values in the database back to 0.
Here is a small procedure to reset any sequence.

create or replace
procedure reset_sequence(p_seq in varchar2)
is
l_value number;
begin
-- Select the next value of the sequence

execute immediate
'select ' || p_seq ||
'.nextval from dual' INTO l_value;

-- Set a negative increment for the sequence,
-- with value = the current value of the sequence

execute immediate
'alter sequence ' || p_seq ||
' increment by -' || l_value || ' minvalue 0';

-- Select once from the sequence, to
-- take its current value back to 0

execute immediate
'select ' || p_seq ||
'.nextval from dual' INTO l_value;

-- Set the increment back to 1

execute immediate
'alter sequence ' || p_seq ||
' increment by 1 minvalue 0';
end;
/

Whatever sequence you wish to reset, call the procedure as:

begin
    reset_sequence('SEQ_1');
    reset_sequence('SEQ_2');
    reset_sequence('SEQ_3');
    reset_sequence('SEQ_4');
end;
/

How to pass Service Name or SID in JDBC URL

Below is the way to send either Service Name or SID in JDBC URL to connect to Oracle SQL

For Service Name

@//host_name:port_number/service_name
For example:
jdbc:oracle:thin:scott/tiger@//myhost:1521/myservicename

For SID

@//host_name:port_number:service_name
For example:
jdbc:oracle:thin:scott/tiger@//myhost:1521:myservicename

Oracle DOC

SQL to convert Comma Separated String to Rows

Convert Comma separated List to rows

SELECT TRIM( SUBSTR ( txt , INSTR (txt, ',', 1, level ) + 1 , INSTR (txt, ',', 1, level+1 ) - INSTR (txt, ',', 1, level) -1 ) ) AS u
FROM
  ( SELECT ','||'comma,separated,text'||',' AS txt FROM dual
  )
  CONNECT BY level <= LENGTH(txt)-LENGTH(REPLACE(txt,',',''))-1;

SQLResult

Source

SQL to Mongo Mapping Chart

This is a PHP-specific version of the » SQL to Mongo mapping chart in the main docs.

SQL Statement Mongo Query Language Statement
CREATE TABLE USERS (a Number, b Number) Implicit or use MongoDB::createCollection().
INSERT INTO USERS VALUES(1,1) $db->users->insert(array(“a” => 1, “b” => 1));
SELECT a,b FROM users $db->users->find(array(), array(“a” => 1, “b” => 1));
SELECT * FROM users WHERE age=33 $db->users->find(array(“age” => 33));
SELECT a,b FROM users WHERE age=33 $db->users->find(array(“age” => 33), array(“a” => 1, “b” => 1));
SELECT a,b FROM users WHERE age=33 ORDER BY name $db->users->find(array(“age” => 33), array(“a” => 1, “b” => 1))->sort(array(“name” => 1));
SELECT * FROM users WHERE age>33 $db->users->find(array(“age” => array(‘$gt’ => 33)));
SELECT * FROM users WHERE age<33 $db->users->find(array(“age” => array(‘$lt’ => 33)));
SELECT * FROM users WHERE name LIKE “%Joe%” $db->users->find(array(“name” => new MongoRegex(“/Joe/”)));
SELECT * FROM users WHERE name LIKE “Joe%” $db->users->find(array(“name” => new MongoRegex(“/^Joe/”)));
SELECT * FROM users WHERE age>33 AND age<=40 $db->users->find(array(“age” => array(‘$gt’ => 33, ‘$lte’ => 40)));
SELECT * FROM users ORDER BY name DESC $db->users->find()->sort(array(“name” => -1));
CREATE INDEX myindexname ON users(name) $db->users->ensureIndex(array(“name” => 1));
CREATE INDEX myindexname ON users(name,ts DESC) $db->users->ensureIndex(array(“name” => 1, “ts” => -1));
SELECT * FROM users WHERE a=1 and b=’q’ $db->users->find(array(“a” => 1, “b” => “q”));
SELECT * FROM users LIMIT 20, 10 $db->users->find()->limit(10)->skip(20);
SELECT * FROM users WHERE a=1 or b=2 $db->users->find(array(‘$or’ => array(array(“a” => 1), array(“b” => 2))));
SELECT * FROM users LIMIT 1 $db->users->find()->limit(1);
EXPLAIN SELECT * FROM users WHERE z=3 $db->users->find(array(“z” => 3))->explain()
SELECT DISTINCT last_name FROM users $db->command(array(“distinct” => “users”, “key” => “last_name”));
SELECT COUNT(*y) FROM users $db->users->count();
SELECT COUNT(*y) FROM users where AGE > 30 $db->users->find(array(“age” => array(‘$gt’ => 30)))->count();
SELECT COUNT(AGE) from users $db->users->find(array(“age” => array(‘$exists’ => true)))->count();
UPDATE users SET a=1 WHERE b=’q’ $db->users->update(array(“b” => “q”), array(‘$set’ => array(“a” => 1)));
UPDATE users SET a=a+2 WHERE b=’q’ $db->users->update(array(“b” => “q”), array(‘$inc’ => array(“a” => 2)));
DELETE FROM users WHERE z=”abc” $db->users->remove(array(“z” => “abc”));

Source

How to fill a increasing values in a column of a table?

How to fill a increasing values in a column of a table?

Lets assume we have a table
Table name : test

Name Value
Yogesh 0
Yogesh 0
Yogesh 0
Yogesh 0
Yogesh 0
Suresh 0
Suresh 0

Requirement : to insert 1, 2, 3 …. corresponding to values Yogesh

Query to Update :

create sequence seq start with 1 increment by 1;
update test set Value=seq.nextval where Name='Yogesh';
commit;

Output :

Name Value
Yogesh 1
Yogesh 2
Yogesh 3
Yogesh 4
Yogesh 5
Suresh 0
Suresh 0