Best way to select random rows PostgreSQL

Given, you have a very large table with 500 Million rows, and you have to select some random 1000 rows out of the table and you want it to be fast.

Given the specifications:

  • You assumed to have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Ideally no or few write operations.
  • Your ID column should have been indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
            , max(id)  AS max_id
            , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). You will get an estimate, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ct isn’t much smaller than id_span, the query will outperform most other approaches.

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus
  • Generate random numbers in the id space. You have “few gaps”, so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
  • Join the ids to the big table. This should be very fast with the index in place.
  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM  (
    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
    FROM   generate_series(1, 1100) g
    ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
SELECT *
FROM   random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don’t find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached – or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT c.reltuples * _gaps
      FROM   pg_class c
      WHERE  c.oid = 'big'::regclass);
BEGIN

   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   SELECT *
   FROM   random_pick
   LIMIT  _limit;
END
$func$  LANGUAGE plpgsql VOLATILE ROWS 1000;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE … But that’s beyond the scope of this post. See:

Possible alternative

IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

It’s very fast, but the result is not exactly random. The manual:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine.

Related:

Source

Convert Comma separated String to Rows in Oracle SQL

Many times we need to convert a comma separated list of terms in a single string and convert it rows in SQL query.

for example

 India, USA, Russia, Malaysia, Mexico

Needs to be converted to:

 Country
 India
 USA
 Russia
 Malaysia
 Mexico

The following SQL script can help in this:

just replace the required values with your string and your delimiter.

Apache Ignite: What is Ignite?

Apache Ignite(TM) In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.

apache-ignite

FEATURES

You can view Ignite as a collection of independent, well-integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include:


Apache Ignite APIs

Apache Ignite has a reach set of APIs that are covered throughout the documentation. The APIs are implemented in a form of native libraries for such major languages and technologies as Java, .NET and C++ and by supporting a variety of protocols like REST, Memcached or Redis.

The documentation that is located under this domain is mostly related to Java. Refer to the following documentation sections and domains to learn more about alternative technologies and protocols you can use to connect to and work with an Apache Ignite cluster:

Fork It on GIT

Apache Commons DbUtils Mini Wrapper

This is a very small DB Connector code in Java as a wrapper class to Apache DBUtils.

The Commons DbUtils library is a small set of classes designed to make working with JDBC easier. JDBC resource cleanup code is mundane, error prone work so these classes abstract out all of the cleanup tasks from your code leaving you with what you really wanted to do with JDBC in the first place: query and update data.

Some of the advantages of using DbUtils are:

  • No possibility for resource leaks. Correct JDBC coding isn’t difficult but it is time-consuming and tedious. This often leads to connection leaks that may be difficult to track down.
  • Cleaner, clearer persistence code. The amount of code needed to persist data in a database is drastically reduced. The remaining code clearly expresses your intention without being cluttered with resource cleanup.
  • Automatically populate Java Bean properties from Result Sets. You don’t need to manually copy column values into bean instances by calling setter methods. Each row of the Result Set can be represented by one fully populated bean instance.

DbUtils is designed to be:

  • Small – you should be able to understand the whole package in a short amount of time.
  • Transparent – DbUtils doesn’t do any magic behind the scenes. You give it a query, it executes it and cleans up for you.
  • Fast – You don’t need to create a million temporary objects to work with DbUtils.

DbUtils is not:

  • An Object/Relational bridge – there are plenty of good O/R tools already. DbUtils is for developers looking to use JDBC without all the mundane pieces.
  • A Data Access Object (DAO) framework – DbUtils can be used to build a DAO framework though.
  • An object oriented abstraction of general database objects like a Table, Column, or Primary Key.
  • A heavyweight framework of any kind – the goal here is to be a straightforward and easy to use JDBC helper library.

Wrapper:

System Design Interview Prep Material

System design is a very broad topic. Even a software engineer with many years of working experience at top IT company may not be an expert on system design. If you want to become an expert, you need to read many books, articles, and solve real large scale system design problems. This repository only teaches you to handle the system design interview with a systematic approach in a short time. You can dive into each topic if you have time. Of course, welcome to add your thoughts!

Table of Contents

System Design Interview Tips:

  • Clarify the constraints and identify the user cases Spend a few minutes questioning the interviewer and agreeing on the scope of the system. Remember to make sure you know all the requirements the interviewer didn’t tell your about in the beginning. User cases indicate the main functions of the system, and constraints list the scale of the system such as requests per second, requests types, data written per second, data read per second.
  • High-level architecture design Sketch the important components and the connections between them, but don’t go into some details. Usually, a scalable system includes web server (load balancer), service (service partition), database (master/slave database cluster plug cache).
  • Component design For each component, you need to write the specific APIs for each component. You may need to finish the detailed OOD design for a particular function. You may also need to design the database schema for the database.

Basic Knowledge about System Design:

Here are some articles about system design related topics.

Of course, if you want to dive into system related topics, here is a good collection of reading list about services-engineering, and a good collection of material about distributed systems.

Company Engineering Blogs:

If you are going to have an onsite with a company, you should read their engineering blog.

Products and Systems:

The following papers/articles/slides can help you to understand the general design idea of different real products and systems.

Hot Questions and Reference:

There are some good references for each question. The references here are slides and articles.
Design a CDN network Reference:

Design a Google document system Reference:

Design a random ID generation system Reference:

Design a key-value database Reference:

Design the Facebook news feed function Reference:

Design the Facebook timeline function Reference:

Design a function to return the top k requests during past time interval Reference:

Design an online multiplayer card game Reference:

Design a graph search function Reference:

Design a picture sharing system Reference:

Design a search engine Reference:

Design a recommendition system Reference:

Design a tinyurl system Reference:

Design a garbage collection system Reference:

Design a scalable web crawling system Reference:

Design the Facebook chat function Reference:

Design a trending topic system Reference:

Design a cache system Reference:

Good Books:

Object Oriented Design:

Tips for OOD Interview

Clarify the scenario, write out user cases Use case is a description of sequences of events that, taken together, lead to a system doing something useful. Who is going to use it and how they are going to use it. The system may be very simple or very complicated. Special system requirements such as multi-threading, read or write oriented.
Define objects Map identity to class: one scenario for one class, each core object in this scenario for one class. Consider the relationships among classes: certain class must have unique instance, one object has many other objects (composition), one object is another object (inheritance). Identify attributes for each class: change noun to variable and action to methods. Use design patterns such that it can be reused in multiple applications.

Useful Websites

Original Source

How to find nth highest integer value from a table?

For a Sample table

name value
a 1
b 3
c 5
d 2
e 0
f 7

If we want to find the nth highest integer value, then the SQL would be:

  1. SELECT MIN(value) FROM (SELECT * FROM table ORDER BY value DESC)
  2. WHERE ROWNUM <=2

Reset Sequence in Oracle SQL back to 0

Some times we need to reset the sequence values in the database back to 0.
Here is a small procedure to reset any sequence.

create or replace
procedure reset_sequence(p_seq in varchar2)
is
l_value number;
begin
-- Select the next value of the sequence

execute immediate
'select ' || p_seq ||
'.nextval from dual' INTO l_value;

-- Set a negative increment for the sequence,
-- with value = the current value of the sequence

execute immediate
'alter sequence ' || p_seq ||
' increment by -' || l_value || ' minvalue 0';

-- Select once from the sequence, to
-- take its current value back to 0

execute immediate
'select ' || p_seq ||
'.nextval from dual' INTO l_value;

-- Set the increment back to 1

execute immediate
'alter sequence ' || p_seq ||
' increment by 1 minvalue 0';
end;
/

Whatever sequence you wish to reset, call the procedure as:

begin
    reset_sequence('SEQ_1');
    reset_sequence('SEQ_2');
    reset_sequence('SEQ_3');
    reset_sequence('SEQ_4');
end;
/

How to pass Service Name or SID in JDBC URL

Below is the way to send either Service Name or SID in JDBC URL to connect to Oracle SQL

For Service Name

@//host_name:port_number/service_name
For example:
jdbc:oracle:thin:scott/tiger@//myhost:1521/myservicename

For SID

@//host_name:port_number:service_name
For example:
jdbc:oracle:thin:scott/tiger@//myhost:1521:myservicename

Oracle DOC

When and how much to mix technologies for a project?

The main idea behind using a technology is to harness the power of code re-usability and libraries that have already been worked on and trusted to be working and functional with minimal or no issues.

The term “Technology” does not just refer to Java or C++ or JavaScript. It also refers to the frameworks that can be used to suffice some or the other requirements with minimum cost and minimal effort investment. One example can be any Application Development Framework Like Oracle ADF, Pega or Drupal, etc. The framework itself provides functionalities that are most common and also takes cares of the issues that people face while developing an application.

Any application which has to be developed from scratch needs a lot of work just to be useful enough to be used in Production environment. Basic Non-functional requirements such as MVC Architecture, Security safeguards, Application Performance, etc. are the basic and most common features of the application. In addition, Secure Login, User Registration and authentication, role mapping and similar features are also mandatory features of the application.

Most frameworks provide features such as configurable work flows, built in UI Elements, security aspects and many other features. Features such as SQL Escaping, HTML Tag Escaping and MVC Pattern architecture are generally already inbuilt to the framework. The developers can simply configure the workflows, create screens and corner out application logic and then the application is good to go.

But is that enough that we may need??

This the main question that we need to ask when trying to select technologies and try to mix them to achieve the application requirement. The Case study can be a sample B-2-B Application say Business-2-Go (B2G).

Advantage of Techs to be selected:

The B2G can be based on 3 major technology products namely an Identity Management System, an Application Development Framework, and a Content Management Service. The technologies in combination provide the architecture to harness the features out of the box.

The IDM would provide Session Maintenance, Role Mappings, Access Authentications and invalid access handling.

CMS on the other hand takes care of the static content of the application. This technology handles the application content which needs to be configurable but would be changed in very rare scenarios. The main usefulness lies in the fact that the CMS portal can be exposed to the customer admin team as well and be comfortable as the code base will not be touched by the Non Development User.

The ADF constitutes to the application flow of the B2G. All user interactions and other business logic is handled by this technology. The flexibility of the framework helps create applications quickly and more efficiently than the older technologies/frameworks such as J2EE Servlets/JSP or struts and so on.

Further technologies like JavaScript, jQuery, CSS have been used to achieve the look and feel that was decided by the client.

Disadvantages of Techs to be selected:

The disadvantages or rather limitations of the tech or the team using the tech is the one that decides whether the tech can be used or not.

The major factors can be:

  • Cost: The technology that has been perfected and supported may be in most cases be licensed. Thus incurring costs to the budget of the project. The alternative may be using Open Source technologies/frameworks. But there the problem exists of the credibility of the source, issue support and whether documentation is enough or not.
  • Resource Availability: Assuming that the cost barrier of the license has been overcome. Now the major concern is whether the resources for the technology are available or not. If not, then can existing resources be trained or can new resources can be acquired. Again the cost factor is affected in this concern.
  • Technology Limitations: Technologies have limitations in themselves also. The limitation may not be a feature that cannot be achieved, but the effort that may be involved in achieving the feature. A simple example may be a particular look and feel of a B2G. Many of the UI Elements may or may not be achievable with the selected core ADF. Or even if they are achievable, it is after a lot of R&D or with lot of hit and trials. Though this is not something that may rule out the technology itself, but may be enough to include other technologies like jQuery into the picture.
  • Interfacing efforts: While mixing technologies, spots/hooks need to be found where 1 tech may latch on to or may be placed in with the other. For example, jQuery is an Independent Tech and in the selected ADF, generally has an internal client side scripts are functioning on its own end. There may be provisions which execute scripts that achieve the UI functionalities. Similarly the CMS may not have stable out of the box connectors to code layer of the application. Thus interfaces are written to implement make this possible. This effort may also turn out to be a concern for using a tech.

An example of tech selection may be PHP. The technology PHP can also be used for creating a complete application and 99% of the same application can be achieved using PHP frameworks. In fact the cost of the tech is 0 (Open Source) and resource costing would be way lower than those of licensed application framework. But the efforts and timescale needed to achieve all the functionalities required will be humongous. Thus ruling out the tech.

Another concern that can be raised is how much the technologies can be mixed. Surely each of the frameworks will be providing some or the other comfort or a feature. Even if they are published in open source or you may have license available. Does that mean that all technologies should be mixed..??

Interfacing technologies uses effort. It also invokes limitations. An example may be the various attempts to integrate popular front-end framework like AngularJS with Oracle ADF. Oracle ADF is a mainly a Server Side Technology, maintaining all functionalities server side and providing a wide palette of features for an application. AngularJS on the other hand is a completely UI framework. It is completely Client Side Intensive Tech. Both frameworks are completely in the opposite directions. Both are unaware of the other. There are blogs showing way how to integrate both the technologies. But all can point out the issues in the integration. This is a small example, but scaling this, similar issues may be faced and thus may be counted as factors in Tech Selection.

In Conclusion, the trade-offs govern the selections of the technologies to be used in a project. Proper selections must be made in order to plan out the architecture. Improper selection may lead to issues, crashes, late deliveries or redundant costs.

References:

Image: https://www.systrends.com/sites/default/files/banner/appdev_banner2.jpg