#StackBounty: #sql #sql-server #namespaces Simplify SQL for SQL Server

Bounty: 50

I have a big SQL request where I compute dates or counts (from other tables), and I have to compute new dates based on conditions on those pre-computed dates and counts.

In the following example, I compute comp_nactive and comp_date_last_completed, and I use those to compute comp_date_next_todo.

SELECT
    pms_id,
    (
        SELECT
            COUNT(DISTINCT date_assigned)
        FROM wrhwr
        WHERE pms_id = outer_pms.pms_id
              AND date_completed IS NULL
    ) AS comp_nactive,
    (
        SELECT
            CONVERT( DATE, MAX(date_completed))
        FROM wrhwr
        WHERE pms_id = outer_pms.pms_id
    ) AS comp_date_last_completed,
    CONVERT( DATE, date_first_todo) AS date_first_todo,
    CASE
        -- dateLastCompleted == null
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
    END AS comp_date_next_todo
FROM pms outer_pms

The only solution I found so far was copy/pasting the code, as I can’t use comp_nactive (for example) in the rest of the request. Though it works, it’s quite ugly and very difficult to manage.

I guess it’s possible to be cleaner and smarter. Any hint?

I want to avoid functions as much as possible, as I don’t always have authorizations to create such. The code should, if possible, work on both SQL Server and Oracle, as I need it for both DB flavors.
Small dataset:

CREATE TABLE pms
([pms_id] varchar(9), [date_first_todo] datetime, [interval] int, [interval_type] varchar(4));

INSERT INTO pms ([pms_id], [date_first_todo], [interval], [interval_type])
VALUES
('CHECK-1M', '2017-01-05 01:00:00', 1, 'm'),
('CHANGE-1Y', '2017-02-06 01:00:00', 1, 'yyyy');

CREATE TABLE wrhwr
([pms_id] varchar(8), [date_assigned] datetime, [date_completed] datetime);

INSERT INTO wrhwr ([pms_id], [date_assigned], [date_completed])
VALUES
('CHECK-1M', '2017-01-05 01:00:00', '2017-01-07 01:00:00'),
('CHECK-1M', '2017-02-05 01:00:00', '2017-02-13 01:00:00'),
('CHECK-1M', '2017-03-05 01:00:00', NULL);

Expected output:

CHECK-1M    1       2016-02-13      2017-01-05      NULL
CHANGE-1Y   0       NULL            2017-02-06      2017-02-06


Get this bounty!!!

#StackBounty: #sql #sql-server #namespaces Simplify SQL for SQL Server (how to reuse intermediate results?)

Bounty: 50

I have a big SQL request where I compute dates or counts (from other tables), and I have to compute new dates based on conditions on those pre-computed dates and counts.

In the following example, I compute comp_nactive and comp_date_last_completed, and I use those to compute comp_date_next_todo.

SELECT
    pms_id,
    (
        SELECT
            COUNT(DISTINCT date_assigned)
        FROM wrhwr
        WHERE pms_id = outer_pms.pms_id
              AND date_completed IS NULL
    ) AS comp_nactive,
    (
        SELECT
            CONVERT( DATE, MAX(date_completed))
        FROM wrhwr
        WHERE pms_id = outer_pms.pms_id
    ) AS comp_date_last_completed,
    CONVERT( DATE, date_first_todo) AS date_first_todo,
    CASE
        -- dateLastCompleted == null
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'd'    AND DATEADD(d, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'ww'   AND DATEADD(ww, interval, date_first_todo)   <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'm'    AND DATEADD(m, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'q'    AND DATEADD(q, interval, date_first_todo)    <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) > GETDATE() THEN CONVERT(DATE, date_first_todo)
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) = 0 THEN CONVERT(DATE, GETDATE())
        WHEN (SELECT CONVERT( DATE, MAX(date_completed)) FROM wrhwr WHERE pms_id = outer_pms.pms_id) IS NULL AND interval_type = 'yyyy' AND DATEADD(yyyy, interval, date_first_todo) <= GETDATE() AND (SELECT COUNT(DISTINCT date_assigned) FROM wrhwr WHERE pms_id = outer_pms.pms_id AND date_completed IS NULL) > 0 THEN CONVERT(DATE, date_first_todo)
    END AS comp_date_next_todo
FROM pms outer_pms

The only solution I found so far was copy/pasting the code, as I can’t use comp_nactive (for example) in the rest of the request. Though it works, it’s quite ugly and very difficult to manage.

I guess it’s possible to be cleaner and smarter.

Any hint?

PS- I want to avoid functions as much as possible, as I don’t always have authorizations to create such.

EDIT 2017-05-25

Small dataset:

CREATE TABLE pms
([pms_id] varchar(9), [date_first_todo] datetime, [interval] int, [interval_type] varchar(4));

INSERT INTO pms ([pms_id], [date_first_todo], [interval], [interval_type])
VALUES
('CHECK-1M', '2017-01-05 01:00:00', 1, 'm'),
('CHANGE-1Y', '2017-02-06 01:00:00', 1, 'yyyy');

CREATE TABLE wrhwr
([pms_id] varchar(8), [date_assigned] datetime, [date_completed] datetime);

INSERT INTO wrhwr ([pms_id], [date_assigned], [date_completed])
VALUES
('CHECK-1M', '2017-01-05 01:00:00', '2017-01-07 01:00:00'),
('CHECK-1M', '2017-02-05 01:00:00', '2017-02-13 01:00:00'),
('CHECK-1M', '2017-03-05 01:00:00', NULL);

Expected output:

CHECK-1M    1       2016-02-13      2017-01-05      NULL
CHANGE-1Y   0       NULL            2017-02-06      2017-02-06


Get this bounty!!!

#StackBounty: #csv #postgresql #r #sql How to select on CSV files like SQL in R?

Bounty: 50

I know the thread How can I inner join two csv files in R which has a merge option, which I do not want.
I have two data CSV files. I am thinking how to query like them like SQL with R.
I really like PostgreSQL so I think it would work here great or similar syntax tools of R.
Two CSV files where primary key is data_id.

data.csv where OK to have IDs not found in log.csv (etc 4)

data_id, event_value
1, 777
1, 666
2, 111
4, 123 
3, 324
1, 245

log.csv where no duplicates in ID column but duplicates can be in name

data_id, name
1, leo
2, leopold
3, lorem

Pseudocode by partial PostgreSQL syntax

  1. Let data_id=1
  2. Show name and event_value from data.csv and log.csv, respectively

Pseudocode like partial PostgreSQL select

SELECT name, event_value 
    FROM data, log
    WHERE data_id=1;

Expected output

leo, 777
leo, 666 
leo, 245

R approach

file1 <- read.table("file1.csv", col.names=c("data_id", "event_value"))
file2 <- read.table("file2.csv", col.names=c("data_id", "name"))

# TODO here something like the SQL query 
# http://stackoverflow.com/a/1307824/54964

Possible approaches where I think sqldf can be sufficient here

  1. sqldf
  2. data.table
  3. dplyr
  4. PostgreSQL database

PostgreSQL thoughts

Schema

DROP TABLE IF EXISTS data, log;    
CREATE TABLE data (
        data_id SERIAL PRIMARY KEY NOT NULL,
        event_value INTEGER NOT NULL
);
CREATE TABLE log (
        data_id SERIAL PRIMARY KEY NOT NULL,
        name INTEGER NOT NULL
);

R: 3.3.3
OS: Debian 8.7


Get this bounty!!!

#StackBounty: #mysql #sql Breaking ties in tournament (order by subquery)

Bounty: 100

In a hockey results database I’ve got a table that stores statistics for each team in each game. For each match there are (usually) two entries – one for each team. The entries store team_id, match_id and the performance of the team in that match – points earned, goals scored and many other stats that are not relevant here.

Here is the approximate query to build the tournament table:

SELECT  team_id, 
        COUNT(match_id) AS total_matches,
        SUM(goals_for) AS total_goals_for, 
        SUM(points) AS total_points
FROM    match_teams
WHERE   match_id IN ([match IDs here])
GROUP BY team_id
ORDER BY total_points DESC;

I select all entries for matches that belong in a tournament and sum points for each team then sorting by the points.

The problem is that the tie breaking rules are quite complex. Let’s consider the rule “if multiple teams are tied order by points earned in games between those team”.

So, if total_points is the same for teams with team_id IN (16,25,36), I should order them by something like this:

SELECT  teamA.team_id,
        SUM(teamA.points) AS total_points_inbetween
FROM    match_teams as teamA 
    JOIN match_teams as teamB
    ON teamA.match_id = teamB.match_id 
        AND teamA.team_id <> teamB.team_id
WHERE   teamA.match_id IN ([match IDs here])
    AND teamA.id IN (16,25,36)
    AND teamB.id IN (16,25,36)
GROUP BY teamA.team_id
ORDER BY total_points_inbetween DESC;

How do I include such a tie-breaker in the ORDER BY clause in the first query? I might also want another rule after that like ORDER BY total_points DeSC, [complicated_rule_1], total_goals_for DESC, [complicated_rule_2]

Example

The following matches are played:

Match 10: team100 vs team200  2-1
Match 12: team100 vs team300  3-0
Match 15: team100 vs team400  1-2
Match 61: team100 vs team500  2-0
Match 62: team200 vs team300  5-1
Match 63: team200 vs team400  2-1
Match 66: team200 vs team500  0-3
Match 70: team300 vs team400  4-0
Match 73: team300 vs team500  5-1
Match 77: team400 vs team500  2-1

The following entries in match_teams represent the results:

match_id    team_id     goals_for   points
10          100         2           3
10          200         1           0
12          100         3           3
12          300         0           0
15          100         1           0
15          400         2           3
61          100         2           3
61          500         0           0
62          200         5           3
62          300         1           0
63          200         2           3
63          400         1           0
66          200         0           0
66          500         3           3
70          300         4           3
70          400         0           0
73          300         5           3
73          500         1           0
77          400         2           3
77          500         1           0

If we now count the total points (first query) for each of the teams, here are the results:

team_id     total_points
100         9
200         6
300         6
400         6
500         3

The middle 3 teams all have the same amount of points therefore the tie has to be broken by the games between them. Here are those:

Match 62: team200 vs team300  5-1
Match 63: team200 vs team400  2-1
Match 70: team300 vs team400  4-0

And these are the corresponding entries in the database that should be taken into account for breaking tie:

match_id    team_id     goals_for   points
62          200         5           3
62          300         1           0
63          200         2           3
63          400         1           0
70          300         4           3
70          400         0           0

In games between these teams the team200 got two wins (6 points), team300 won one match (3 points) and team400 won nothing. So they should be ordered using these points and team200 > team300 > team400.


Get this bounty!!!

#StackBounty: #mysql #sql Add new columns in one table based on new entries in another table in mysql

Bounty: 50

I have 2 tables: sessions and assignments. This assignments table has a column called scriptname with strings as values. The sessions table has column names equal to scriptname+ the columns id, uid, timein and timeout. As I add new instances to assignments I get new values in the scriptname column which I want to add as new columns to sessions with default values of 0. How do I do this?

What I currently do is drop the table and create a new table based on the scriptname column. The problem is of course I lose all my data.

DROP TABLE sessions;
SET SESSION group_concat_max_len = 1000000;
SELECT
  CONCAT(
    'CREATE TABLE sessions (',
    GROUP_CONCAT(DISTINCT
      CONCAT(scriptname, ' BOOL DEFAULT 0')
      SEPARATOR ','),
    ');')
FROM
  assignments
INTO @sql;

PREPARE stmt FROM @sql;
EXECUTE stmt;

ALTER TABLE sessions
ADD COLUMN `timeout` timestamp not null FIRST,
ADD COLUMN `timein` timestamp not null DEFAULT CURRENT_TIMESTAMP FIRST,
ADD COLUMN `uid` VARCHAR(128) not null FIRST,
ADD COLUMN `id` INT(11) AUTO_INCREMENT PRIMARY KEY not null FIRST;

I hope somebody can help me out as I’m really not an expert on sql! Thanks in advance.


Get this bounty!!!

#StackBounty: #sql #apache-spark #hive #hiveql SQL query Frequency Distribution matrix for product

Bounty: 50

i want to create a frequency distribution matrix

1.Create a matrix.**Is it possible to get this in separate columns**

  customer1       p1         p2      p3
  customer 2      p2         p3
  customer 3      p2         p3      p1
  customer 4      p2         p1

2. Then I have to count the number of products that come together the most

   For eg  
    p2 and p3 comes together 3 times
    p1 p3   comes 2 times
    p1 p2  comes  2 times

I want to recommend products to customers ,frequency of products that comes together

 select customerId,product,count(*) from sales group by customerId,product

Can anyone please help me for a solution to this


Get this bounty!!!

#StackBounty: #sql #mysql List users, ordered by accuracy of soccer match predictions

Bounty: 100

I have a database filled with predictions of soccer matches. I need a solution to calculate the rankings from the database. There are 2 rankings: one for the entire season (playday=0) and one for each matchday (called playday in the code).

I have 3 tables:

  1. matches
  2. predictions
  3. predictions_points

To give you a better insight in the database, here’s some example data:

matches: contains soccer matches information.

+----------+--------------+---------------------+------------+-----------+----------------+--------------+-----------------+--------------+-----------------+
| match_id | match_status |   match_datetime    | match_info | league_id | league_playday | home_team_id | home_team_score | away_team_id | away_team_score |
+----------+--------------+---------------------+------------+-----------+----------------+--------------+-----------------+--------------+-----------------+
|        1 |            3 | 2016-07-29 20:30:00 |            |         1 |              1 |            1 |               0 |            2 |               2 |
|        2 |            3 | 2016-07-30 18:00:00 |            |         1 |              1 |            5 |               1 |            4 |               2 |
|        3 |            3 | 2016-07-30 20:00:00 |            |         1 |              1 |            3 |               1 |            6 |               0 |
|        4 |            3 | 2016-07-30 20:00:00 |            |         1 |              1 |            7 |               3 |            8 |               0 |
+----------+--------------+---------------------+------------+-----------+----------------+--------------+-----------------+--------------+-----------------+

predictions: contains users predictions and the amount of points received per prediction.

+---------------+----------+---------+-----------------+-----------------+--------------------+
| prediction_id | match_id | user_id | home_team_score | away_team_score | predictions_points |
+---------------+----------+---------+-----------------+-----------------+--------------------+
|             1 |        1 |       1 |               0 |               1 |                  1 |
|             2 |        2 |       1 |               1 |               2 |                  3 |
|             3 |        3 |       1 |               2 |               0 |                  1 |
|             4 |        4 |       1 |               2 |               0 |                  1 |
|             5 |        1 |       2 |               0 |               2 |                  3 |
|             6 |        2 |       2 |               1 |               2 |                  3 |
|             7 |        3 |       2 |               1 |               0 |                  3 |
|             8 |        4 |       2 |               0 |               0 |                  0 |
+---------------+----------+---------+-----------------+-----------------+--------------------+

predictions_points contains the points per playday (or entire season when playday = 0) and the ranking (which we can not use for the query).

+-----------+---------+-----------+----------------+---------------------+---------------+
| points_id | user_id | league_id | league_playday | league_user_ranking | points_amount |
+-----------+---------+-----------+----------------+---------------------+---------------+
|         1 |       1 |         1 |              0 |                   2 |            51 |
|         2 |       2 |         1 |              0 |                   1 |            59 |
|         3 |       1 |         1 |              1 |                   2 |             6 |
|         4 |       2 |         1 |              1 |                   1 |             9 |
+-----------+---------+-----------+----------------+---------------------+---------------+

If there is a draw (in amount of points between users), I want to order them based on the amount of predictions they had 100% correct (they earned 1 point for a prediction with wrong score but correct win/draw/loss – and earned at least 3 points for a correct score).

(Please note that the league_user_ranking field from the predictions_points table gets updated based on the result set of this query. So we can not use it for the query.)

The following query works, but I feel like there’s room for improvement:

     SELECT *, (
        SELECT COUNT(*) FROM predictions p
            INNER JOIN matches m
            ON m.match_id = p.match_id
        WHERE p.user_id=p_p.user_id 
        AND (m.league_playday=p_p.league_playday OR p_p.league_playday=0)
        AND p.prediction_points>=3
     ) AS correctpredictions_count
     FROM
     predictions_points p_p
     WHERE
     p_p.league_id=:league_id
     ORDER BY
     p_p.league_playday ASC, p_p.points_amount DESC, correctpredictions_count DESC

UPDATE/EDIT: I see that my question got bumped to the homepage. I am live-testing the code with 15 other soccer enthousiasts based on the results of the current Belgian soccer season. At the moment, this query takes about 10 seconds on a database with 3000 predictions (15 users, 8 matches per playday, 30 playdays) on a Raspberry Pi 3 running Raspbian Lite.

Expected result set:

    +-----------+---------+-----------+----------------+---------------------+---------------+--------------------------+
    | points_id | user_id | league_id | league_playday | league_user_ranking | points_amount | correctpredictions_count |
    +-----------+---------+-----------+----------------+---------------------+---------------+--------------------------+
    |         2 |       2 |         1 |              0 |                   1 |            59 |                        7 |
    |         1 |       1 |         1 |              0 |                   2 |            51 |                        6 |
    |         4 |       2 |         1 |              1 |                   1 |             9 |                        2 |
    |         3 |       1 |         1 |              1 |                   2 |             6 |                        1 |
    |         5 |       1 |         1 |              2 |                   1 |             7 |                        2 |
    |         6 |       2 |         1 |              2 |                   2 |             7 |                        1 |
    +-----------+---------+-----------+----------------+---------------------+---------------+--------------------------+


Get this bounty!!!

Best way to select random rows PostgreSQL

Given, you have a very large table with 500 Million rows, and you have to select some random 1000 rows out of the table and you want it to be fast.

Given the specifications:

  • You assumed to have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Ideally no or few write operations.
  • Your ID column should have been indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
            , max(id)  AS max_id
            , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). You will get an estimate, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ct isn’t much smaller than id_span, the query will outperform most other approaches.

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus
  • Generate random numbers in the id space. You have “few gaps”, so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
  • Join the ids to the big table. This should be very fast with the index in place.
  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM  (
    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
    FROM   generate_series(1, 1100) g
    ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
SELECT *
FROM   random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don’t find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached – or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT c.reltuples * _gaps
      FROM   pg_class c
      WHERE  c.oid = 'big'::regclass);
BEGIN

   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   SELECT *
   FROM   random_pick
   LIMIT  _limit;
END
$func$  LANGUAGE plpgsql VOLATILE ROWS 1000;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE … But that’s beyond the scope of this post. See:

Possible alternative

IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

It’s very fast, but the result is not exactly random. The manual:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine.

Related:

Source

Convert Comma separated String to Rows in Oracle SQL

Many times we need to convert a comma separated list of terms in a single string and convert it rows in SQL query.

for example

 India, USA, Russia, Malaysia, Mexico

Needs to be converted to:

 Country
 India
 USA
 Russia
 Malaysia
 Mexico

The following SQL script can help in this:

just replace the required values with your string and your delimiter.

Apache Commons DbUtils Mini Wrapper

This is a very small DB Connector code in Java as a wrapper class to Apache DBUtils.

The Commons DbUtils library is a small set of classes designed to make working with JDBC easier. JDBC resource cleanup code is mundane, error prone work so these classes abstract out all of the cleanup tasks from your code leaving you with what you really wanted to do with JDBC in the first place: query and update data.

Some of the advantages of using DbUtils are:

  • No possibility for resource leaks. Correct JDBC coding isn’t difficult but it is time-consuming and tedious. This often leads to connection leaks that may be difficult to track down.
  • Cleaner, clearer persistence code. The amount of code needed to persist data in a database is drastically reduced. The remaining code clearly expresses your intention without being cluttered with resource cleanup.
  • Automatically populate Java Bean properties from Result Sets. You don’t need to manually copy column values into bean instances by calling setter methods. Each row of the Result Set can be represented by one fully populated bean instance.

DbUtils is designed to be:

  • Small – you should be able to understand the whole package in a short amount of time.
  • Transparent – DbUtils doesn’t do any magic behind the scenes. You give it a query, it executes it and cleans up for you.
  • Fast – You don’t need to create a million temporary objects to work with DbUtils.

DbUtils is not:

  • An Object/Relational bridge – there are plenty of good O/R tools already. DbUtils is for developers looking to use JDBC without all the mundane pieces.
  • A Data Access Object (DAO) framework – DbUtils can be used to build a DAO framework though.
  • An object oriented abstraction of general database objects like a Table, Column, or Primary Key.
  • A heavyweight framework of any kind – the goal here is to be a straightforward and easy to use JDBC helper library.

Wrapper: