#StackBounty: #linux #logs #postgresql #journal Tool which stores the journal of systemd in a PostgreSQL table

Bounty: 50

I search a tool which stores the journal (logs) of systemd in a PostgreSQL table.

Required features:

  • open source
  • No single log entry should get lost
  • No duplicates should get created.
  • efficient: N log entries should get inserted to the database table in one SQL statement
  • The db schema should be able handle the logs of several hosts in one database table.

Get this bounty!!!

#StackBounty: #postgresql #query-performance #postgresql-9.5 #gist-index Postgres LIKE query using a GiST index is just as slow as a fu…

Bounty: 100

What I have is a very simple database that stores paths, extensions and names of files from UNC shares. For testing, I inserted about 1,5 mio rows and the below query uses a GiST index as you see, but still it takes 5 seconds to return. Expected would be a few (like 100) milliseconds.

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE 'somevalue' 

enter image description here

When using %% in the query, it takes the not that long, even when using sequential scan (?!)

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE '%a%' 

enter image description here

I also have the same setup for the name (filename) column, when executing a similar query on that one, it only takes half of the time, even when using %%:


enter image description here

What I already tried cannot be written here in short words. Whatever I do, it gets slow starting from about 1 mio rows. As there is basically never anything deleted, of course vacuuming and reindexing does not help at all.
I cannot really use any other type of search than LIKE %% and a GIN or GiST index because I need to be able to find any character in the columns of interest, not only “words for a specific human language”.

Is my expectation that this should work in around 100 milliseconds even for many million more rows that wrong?

Get this bounty!!!

#StackBounty: #postgresql #postgresql-replication Sync Local Postgres Database with Remote Database

Bounty: 50

My company is looking at creating a consolidated server that will serve as a backup for many remote servers who have separate data from one another, and provide an application where the aggregate data can be used for research purposes. I was going to follow the general advice that a front facing application server should be placed in the DMZ, with the database server being behind the DMZ in the local network.

Are there any database synchronization tools for Postgres that can do regularly scheduled syncs of the remote databases and the consolidated database server by going through the front facing application server?

I have looked at Bucardo and it seems like both databases have to be able to “see” each other to work.

My current solution in mind is to have the remotes do a database dump, compress the file, send it (securely) to the front facing server, and have it verify the data and source before doing a restore of the dump on the database server. This sounds inefficient, so I am looking for some sort of program that can possibly assist.

Any suggestions are a big help!

Get this bounty!!!

#StackBounty: #postgresql #backup #centos #pgpool Postgres 9.6 – Taking differenctial backup

Bounty: 100

I have a few CentOS boxes running Postgres 9.6. I am planning a central server which can make local copies of all DBs on the other servers. These servers are accessible via SSH on Internet.

What is the best approach to take differential backup? I am considering below points :

  1. Central Backup Server may not be up 24/7.
  2. Internet Link may fail.
  3. Bandwidth Consumption on the DB server.

And I am considering the below approaches :

  1. Directly Rsync the Postgres data directory.
  2. Some kind of log archive, to be read by Postgres.
  3. Best of open source solution such as the pitrtool or pgpool2.

Which is the most efficient or even possible from above? Any other recommendation?

Get this bounty!!!

Best way to select random rows PostgreSQL

Given, you have a very large table with 500 Million rows, and you have to select some random 1000 rows out of the table and you want it to be fast.

Given the specifications:

  • You assumed to have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Ideally no or few write operations.
  • Your ID column should have been indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
            , max(id)  AS max_id
            , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). You will get an estimate, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ct isn’t much smaller than id_span, the query will outperform most other approaches.

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus
  • Generate random numbers in the id space. You have “few gaps”, so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
  • Join the ids to the big table. This should be very fast with the index in place.
  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
    FROM   generate_series(1, 1100) g
    ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
FROM   random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don’t find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached – or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT c.reltuples * _gaps
      FROM   pg_class c
      WHERE  c.oid = 'big'::regclass);

   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   FROM   random_pick
   LIMIT  _limit;
$func$  LANGUAGE plpgsql VOLATILE ROWS 1000;


SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE … But that’s beyond the scope of this post. See:

Possible alternative

IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

It’s very fast, but the result is not exactly random. The manual:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine.