#StackBounty: #oracle #join #oracle-12c #view CREATE VIEW WITH LEFT JOIN – FILTER RIGHT DEPENDING ON LEFT

Bounty: 50

I am using Oracle 12.1.
I have the following tables, simplified for the question :
(Note that the XXX for entity ID does represent the same ID being repeated)

Events

date_event | entity_id | used
-------------------------------
2020-09-01 | XXX       | ...
2020-08-15 | XXX       | ...
2020-07-01 | XXX       | ...
...        | ...       | ...

Contract

date_contract | entity_id | capacity
-------------------------------------
2020-08-25    | XXX       | 1000
2020-07-20    | XXX       | 1000
2020-06-22    | XXX       | 1000
...

Modifications

date_modification | entity_id | capacity | month_capacity
-------------------------------------------------------------
2020-08-10        | XXX       | 2000     | 500

I would like to create a view like this :

date_event | entity_id | date_situation | capacity | month_capacity | used
--------------------------------------------------------------------------
2020-09-01 | XXX       | 2020-08-25     | 1000     | NULL           | 200
2020-08-15 | XXX       | 2020-08-10     | 2000     | 500            | 200
2020-07-01 | XXX       | 2020-06-22     | 1000     | NULL           | 200

For one entity_id :

  • Events table has one record per day
  • Contracts has one record per month, but technically could be whenever, there is no business rule about being monthly
  • Modifications table has one record when an edit is made ; can be once a month, once a year, once a week, never, …

The fields date_situation, capacity, and month_capacity come from either Contracts or Modifications depending on the following conditions :

  • The first (ordering by date_contract) record in Contracts that has a date_contract more recent than date_event, if there is no record in Modifications with a more recent date_modification in the same month. If there is such a record, then it comes from Modifications.
  • If there is neither a more recent Contracts or Modifications record, then we want the most recent one between the most recent Contracts record and the most recent Modifications record. But it can be the most recent Modifications record only if it is in the same month as date_event.

So, to sum it up, data comes from the most recent record in Contracts for the same entity_id, unless there is a more recent Modifications record, in the same month as the record from Events.
And if so, we want the most recent Modifications record of the same month.

I hope the explaination and the example view are clear enough. So far we have something with (SELECT) LEFT JOIN (SELECT) LEFT JOIN (SELECT) but it takes the most recent Contracts and Modifications records in the table. I need them relative to date_event, and I can’t seem to "inject" that in the subqueries.

How would one proceed to represent such a view ?


Get this bounty!!!

#StackBounty: #postgresql #performance #join #database-design #database-performance What is the best way to join the same table twice i…

Bounty: 50

The performance with the second join on the same table is degraded nearly half

SELECT * FROM party_party_relationship AS ppr 
    LEFT JOIN party_role AS r1 ON r1.party_role_uid = ppr.party_role_uid
    LEFT JOIN party_role AS r2 ON r2.party_role_uid = ppr.party_role_uid_related

Performance with first Join

"Hash Left Join  (cost=288.18..547.72 rows=10972 width=144) (actual time=5.281..17.781 rows=11192 loops=1)"
"  Hash Cond: (ppr.party_role_uid = r1.party_role_uid)"
"  ->  Seq Scan on party_party_relationship ppr  (cost=0.00..230.72 rows=10972 width=98) (actual time=0.020..2.438 rows=11192 loops=1)"
"  ->  Hash  (cost=181.97..181.97 rows=8497 width=46) (actual time=5.186..5.187 rows=9946 loops=1)"
"        Buckets: 16384  Batches: 1  Memory Usage: 823kB"
"        ->  Seq Scan on party_role r1  (cost=0.00..181.97 rows=8497 width=46) (actual time=0.010..2.073 rows=9946 loops=1)"
"Planning Time: 0.472 ms"
"Execution Time: 18.765 ms"

With two join

Performance with the second join on the same table almost doubled

"Hash Left Join  (cost=576.37..864.71 rows=10972 width=190) (actual time=9.871..31.986 rows=11192 loops=1)"
"  Hash Cond: (ppr.party_role_uid_related = r2.party_role_uid)"
"  ->  Hash Left Join  (cost=288.18..547.72 rows=10972 width=144) (actual time=5.163..18.437 rows=11192 loops=1)"
"        Hash Cond: (ppr.party_role_uid = r1.party_role_uid)"
"        ->  Seq Scan on party_party_relationship ppr  (cost=0.00..230.72 rows=10972 width=98) (actual time=0.015..2.735 rows=11192 loops=1)"
"        ->  Hash  (cost=181.97..181.97 rows=8497 width=46) (actual time=5.091..5.092 rows=9946 loops=1)"
"              Buckets: 16384  Batches: 1  Memory Usage: 823kB"
"              ->  Seq Scan on party_role r1  (cost=0.00..181.97 rows=8497 width=46) (actual time=0.008..2.030 rows=9946 loops=1)"
"  ->  Hash  (cost=181.97..181.97 rows=8497 width=46) (actual time=4.644..4.644 rows=9946 loops=1)"
"        Buckets: 16384  Batches: 1  Memory Usage: 823kB"
"        ->  Seq Scan on party_role r2  (cost=0.00..181.97 rows=8497 width=46) (actual time=0.014..1.810 rows=9946 loops=1)"
"Planning Time: 0.925 ms"
"Execution Time: 32.920 ms"

With one join

The above query is just a part of the whole query.

SELECT * FROM party_party_relationship AS ppr 
    INNER JOIN party_role AS r1 ON r1.party_role_uid = ppr.party_role_uid
        INNER JOIN party AS p1 ON p1.party_uid = r1.party_uid
                LEFT JOIN party_name AS n1 ON n1.party_uid = p1.party_uid AND n1.end_date IS NULL
                LEFT JOIN business_number AS b1 ON b1.party_uid = p1.party_uid AND b1.business_number_cd = p1.business_number_cd AND b1.end_date IS NULL

    INNER JOIN party_role AS r2 ON r2.party_role_uid = ppr.party_role_uid_related
        INNER JOIN party AS p2 ON p2.party_uid = r2.party_uid
                LEFT JOIN party_name AS n2 ON n2.party_uid = p2.party_uid AND n2.end_date IS NULL
                LEFT JOIN business_number AS b2 ON b2.party_uid = p2.party_uid AND b2.business_number_cd = p2.business_number_cd AND b2.end_date IS NULL
                
                WHERE ppr.case_uid = 9

Execution Plan

"Nested Loop Left Join  (cost=1113.46..3576.37 rows=915 width=772) (actual time=19.687..76.911 rows=919 loops=1)"
"  ->  Nested Loop Left Join  (cost=1113.31..3270.33 rows=915 width=694) (actual time=19.616..56.253 rows=919 loops=1)"
"        Join Filter: (n1.end_date IS NULL)"
"        ->  Hash Left Join  (cost=1113.03..2415.51 rows=915 width=547) (actual time=19.588..51.236 rows=915 loops=1)"
"              Hash Cond: (r1.party_uid = p2.party_uid)"
"              ->  Hash Left Join  (cost=856.60..2156.68 rows=915 width=481) (actual time=15.192..45.391 rows=915 loops=1)"
"                    Hash Cond: (ppr.party_role_uid_related = r2.party_role_uid)"
"                    ->  Nested Loop Left Join  (cost=568.42..1866.09 rows=915 width=435) (actual time=9.743..38.415 rows=915 loops=1)"
"                          ->  Nested Loop Left Join  (cost=568.27..1560.05 rows=915 width=357) (actual time=9.665..17.956 rows=915 loops=1)"
"                                ->  Hash Left Join  (cost=567.99..705.23 rows=915 width=210) (actual time=9.639..12.460 rows=915 loops=1)"
"                                      Hash Cond: (r1.party_uid = p1.party_uid)"
"                                      ->  Hash Left Join  (cost=311.56..446.40 rows=915 width=144) (actual time=5.314..7.056 rows=915 loops=1)"
"                                            Hash Cond: (ppr.party_role_uid = r1.party_role_uid)"
"                                            ->  Bitmap Heap Scan on party_party_relationship ppr  (cost=23.38..155.81 rows=915 width=98) (actual time=0.111..0.536 rows=915 loops=1)"
"                                                  Recheck Cond: (insolvency_case_uid = 9)"
"                                                  Heap Blocks: exact=18"
"                                                  ->  Bitmap Index Scan on ixfk_party_party_relationship_insolvency_case  (cost=0.00..23.15 rows=915 width=0) (actual time=0.097..0.097 rows=926 loops=1)"
"                                                        Index Cond: (insolvency_case_uid = 9)"
"                                            ->  Hash  (cost=181.97..181.97 rows=8497 width=46) (actual time=5.149..5.149 rows=9960 loops=1)"
"                                                  Buckets: 16384  Batches: 1  Memory Usage: 824kB"
"                                                  ->  Seq Scan on party_role r1  (cost=0.00..181.97 rows=8497 width=46) (actual time=0.009..1.979 rows=9960 loops=1)"
"                                      ->  Hash  (cost=161.19..161.19 rows=7619 width=66) (actual time=4.290..4.290 rows=7449 loops=1)"
"                                            Buckets: 8192  Batches: 1  Memory Usage: 701kB"
"                                            ->  Seq Scan on party p1  (cost=0.00..161.19 rows=7619 width=66) (actual time=0.013..1.680 rows=7449 loops=1)"
"                                ->  Index Scan using ixfk_party_name_party on party_name n1  (cost=0.28..0.92 rows=1 width=147) (actual time=0.004..0.005 rows=1 loops=915)"
"                                      Index Cond: (party_uid = p1.party_uid)"
"                                      Filter: (end_date IS NULL)"
"                                      Rows Removed by Filter: 0"
"                          ->  Index Scan using ex_business_number_end_date on business_number b1  (cost=0.15..0.32 rows=1 width=78) (actual time=0.020..0.021 rows=1 loops=915)"
"                                Index Cond: ((party_uid = p1.party_uid) AND (business_number_cd = p1.business_number_cd))"
"                    ->  Hash  (cost=181.97..181.97 rows=8497 width=46) (actual time=5.293..5.293 rows=9960 loops=1)"
"                          Buckets: 16384  Batches: 1  Memory Usage: 824kB"
"                          ->  Seq Scan on party_role r2  (cost=0.00..181.97 rows=8497 width=46) (actual time=0.010..1.799 rows=9960 loops=1)"
"              ->  Hash  (cost=161.19..161.19 rows=7619 width=66) (actual time=4.313..4.314 rows=7449 loops=1)"
"                    Buckets: 8192  Batches: 1  Memory Usage: 701kB"
"                    ->  Seq Scan on party p2  (cost=0.00..161.19 rows=7619 width=66) (actual time=0.011..1.587 rows=7449 loops=1)"
"        ->  Index Scan using ixfk_party_name_party on party_name n2  (cost=0.28..0.92 rows=1 width=147) (actual time=0.003..0.003 rows=1 loops=915)"
"              Index Cond: (party_uid = p2.party_uid)"
"  ->  Index Scan using ex_business_number_end_date on business_number b2  (cost=0.15..0.32 rows=1 width=78) (actual time=0.020..0.020 rows=1 loops=919)"
"        Index Cond: ((party_uid = p2.party_uid) AND (business_number_cd = p2.business_number_cd))"
"Planning Time: 4.499 ms"
"Execution Time: 77.433 ms"

Plan in Graph

Part of Execution plan - graph

Is there any better way to do it? The table is expected to grow very fast.


Get this bounty!!!

#StackBounty: #java #performance #hibernate #join #hibernate-criteria How to get batching using the old hibernate criteria?

Bounty: 200

I’m still using the old org.hibernate.Criteria and get more and more confused about fetch modes. In various queries, I need all of the following variants, so I can’t control it via annotations. I’m just switching everything to @ManyToOne(fetch=FetchType.LAZY), as otherwise, there’s no change to change anything in the query.

What I could find so far either concerns HQL or JPA2 or offers just two choices, but I need it for the old criteria and for (at least) the following three cases:

  • Do a JOIN, and fetch from both tables. This is OK unless the data is too redundant (e.g., the master data is big or repeated many times in the result). In SQL, I’d write
    SELECT * FROM item JOIN order on item.order_id = order.id
    WHERE ...;
  • Do a JOIN, fetch from the first table, and the separation from the other. This is usually the more efficient variant of the previous query. In SQL, I’d write
    SELECT item.* FROM item JOIN order on item.order_id = order.id
    WHERE ...;

    SELECT order.* FROM order WHERE ...;
  • Do a JOIN, but do not fetch the joined table. This is useful e.g., for sorting based on data the other table. In SQL, I’d write
    SELECT item.* FROM item JOIN order on item.order_id = order.id
    WHERE ...
    ORDER BY order.name, item.name;

It looks like without explicitly specifying fetch=FetchType.LAZY, everything gets fetched eagerly as in the first case, which is sometimes too bad. I guess, using Criteria#setFetchMode, I can get the third case. I haven’t tried it out yet, as I’m still missing the second case. I know that it’s somehow possible, as there’s the @BatchSize annotation.

  • Am I right with the above?
  • Is there a way how to get the second case with the old criteria?

Update

It looks like using createAlias() leads to fetching everything eagerly. There are some overloads allowing to specify the JoinType, but I’d need to specify the fetch type. Now, I’m confused even more.


Get this bounty!!!

#StackBounty: #mysql #query-performance #join #mariadb Slow performance when joining a small table and filtering out on a non-key colum…

Bounty: 100

I am fairly new to MariaDB and I am struggling with one issue that I cannot get to the bottom of it. This is the query:

SELECT SQL_NO_CACHE STRAIGHT_JOIN
    `c`.`Name` AS `CategoryName`, 
    `c`.`UrlSlug` AS `CategorySlug`, 
    `n`.`Description`, 
    IF(n.OriginalImageUrl IS NOT NULL, n.OriginalImageUrl, s.LogoUrl) AS `ImageUrl`, 
    `n`.`Link`, 
    `n`.`PublishedOn`, 
    `s`.`Name` AS `SourceName`, 
    `s`.`Url` AS `SourceWebsite`, 
   s.UrlSlug AS SourceUrlSlug,
    `n`.`Title`
FROM `NewsItems` AS `n`
INNER JOIN `NewsSources` AS `s` ON `n`.`NewsSourceId` = `s`.`Id`
LEFT JOIN `Categories` AS `c` ON `n`.`CategoryId` = `c`.`CategoryId`
WHERE s.UrlSlug = 'slug'
#WHERE s.Id = 52
ORDER BY `n`.`PublishedOn` DESC
LIMIT 50

NewsSources is a table with about 40 rows and NewsItems has ~1 million. Each news item belongs to one source and one source can have many items. I’m trying to get all items for a source identified by URL slug of the source.

  1. In case when I use STRAIGHT_JOIN and when I query for a source that has lots of news items, the query returns immediately.
    However, if I query for a source that has low number of items (~100) OR if I query for a URL slug that doesn’t belong to any source (result set is 0 rows), the query runs for 12 seconds.

  2. In case when I remove STRAIGHT_JOIN, I see the opposite performance from the first case – it runs really slow when I query for a news source with many items and returns immediately for sources with low number of items or result set is 0, because the URL slug doesn’t belong to any news source.

  3. In case when I query by news source ID (the commented out WHERE s.Id = 52), the result comes immediately, regardless of whether there are lots of items for that source or 0 items for that source.

I want to point out again that the NewsSources table contains only about 40 rows.

Here is the analyzer results for the query above: Explain Analyzer

What can I do to make this query to run fast always?

Here are tables and indexes definitions:

-- --------------------------------------------------------
-- Server version:               10.4.13-MariaDB-1:10.4.13+maria~bionic - mariadb.org binary distribution
-- Server OS:                    debian-linux-gnu
-- --------------------------------------------------------

-- Dumping structure for table Categories
CREATE TABLE IF NOT EXISTS `Categories` (
  `CategoryId` int(11) NOT NULL AUTO_INCREMENT,
  `Name` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
  `Description` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `UrlSlug` varchar(30) COLLATE utf8mb4_unicode_ci NOT NULL,
  `CreatedOn` datetime(6) NOT NULL,
  `ModifiedOn` datetime(6) NOT NULL,
  PRIMARY KEY (`CategoryId`)
) ENGINE=InnoDB AUTO_INCREMENT=16 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


-- Dumping structure for table NewsItems
CREATE TABLE IF NOT EXISTS `NewsItems` (
  `Id` bigint(20) NOT NULL AUTO_INCREMENT,
  `NewsSourceId` int(11) NOT NULL,
  `Title` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `Link` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `Description` longtext COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `PublishedOn` datetime(6) NOT NULL,
  `GlobalId` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `CategoryId` int(11) DEFAULT NULL,
  PRIMARY KEY (`Id`),
  KEY `IX_NewsItems_CategoryId` (`CategoryId`),
  KEY `IX_NewsItems_NewsSourceId_GlobalId` (`NewsSourceId`,`GlobalId`),
  KEY `IX_NewsItems_PublishedOn` (`PublishedOn`),
  KEY `IX_NewsItems_NewsSourceId` (`NewsSourceId`),
  FULLTEXT KEY `Title` (`Title`,`Description`),
  CONSTRAINT `FK_NewsItems_Categories_CategoryId` FOREIGN KEY (`CategoryId`) REFERENCES `Categories` (`CategoryId`),
  CONSTRAINT `FK_NewsItems_NewsSources_NewsSourceId` FOREIGN KEY (`NewsSourceId`) REFERENCES `NewsSources` (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=649802 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


-- Dumping structure for table NewsSources
CREATE TABLE IF NOT EXISTS `NewsSources` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `Name` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL,
  `Url` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL,
  `UrlSlug` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `LogoUrl` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=55 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


Get this bounty!!!

#StackBounty: #sql #regex #scala #join #apache-spark Spark Scala: SQL rlike vs Custom UDF

Bounty: 50

I’ve a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using “spark sql rlike” method as below and it was able to hold the load until incoming record counts were less than 50K

PS: The regular expression reference data is a broadcasted dataset.

dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")

Then I wrote a custom UDF to transform them using Scala native regex search as below,

  1. Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array[(Int, Regex)] = {
        regexDataset.value
            .select( "col_1", "regex_column")
            .collect
            .map(row => (row.get(0).asInstanceOf[Int],row.get(1).toString.r))
    }

Implementation of Regex matching UDF,

    def findMatchingPatterns(regexDSArray: Array[(Int,Regex)]): UserDefinedFunction = {
        udf((input_column: String) => {
            for {
                text <- Option(input_column)
                matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
                if matches.nonEmpty
            } yield matches.map(x => x._1).min
        }, IntegerType)
    }

Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,

dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")

But this too gets very slow with skew in execution [1 executor task runs for a very long time] when record count increases above 1M. Spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here or if there’s a better API for Scala regex match than what I’ve written here? or any suggestions to do this efficiently would be very helpful.


Get this bounty!!!

#StackBounty: #mysql #join mysql – find duplicate sets of data grouped by foreign key

Bounty: 50

I have a large mysql table (500,000 records). I need to find sets of data that have all the same attributes (KEY, NAME AND VALUE). Some sets have 20 attributes others have 397 per each KEY.
The table looks like this. KEY is a foreign key.

ID, KEY, NAME, VALUE
1   87   Color  Red
2   87   Size   Big
3   87   Weight 6

4   85   Color  Red
5   85   Size   Big
6   85   Weight 6

7   96   Color  Red
8   96   Size   Small
8   96   Weight 7

I’m trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87. There are 397 different attributes rather than just the 3 I show here. I can do this in php but it’s sloppy and I want to learn more about mysql but I can’t get my head around it. Appreciate any help in advance.


Get this bounty!!!

#StackBounty: #ruby #ruby-on-rails #join #active-record Update join table using list of checkboxes in Rails

Bounty: 50

I have Gig and Singer Active Record models (standard–no customization just yet) with a many-to-many relationship through a generic join table which holds nothing but the respective ids of Gig and Singer. My form sends a given gig id and all the singers who will be attending, denoted with check boxes. I need to have the ability to check or uncheck singers. The following code works, but it does so by removing all the singers from a gig and re-adding them. This feels hacky… is there a better way? (I think this is all the code necessary but let me know if you need me to add anything)

class GigSingersController < ApplicationController

    def create
        gig = Gig.find(params[:gig_id])
        singer_ids = params[:singer_ids] # [1, 4, 5,]
        gig.singers = []
        singer_ids.each do |id|
            singer = Singer.find(id)
            gig.singers << singer
        end
        redirect_to gigs_path
    end
end

EDIT:

As requested in the comments, here are the schema and relevant models, although as I said, they are completely generic. Perhaps I didn’t do a good job of making my question clear: Is the best way to create these relationships when using a checkbox to remove all existing ones and recreate them from the boxes currently checked, thereby removing any that the user unchecked on an edit?

ActiveRecord::Schema.define(version: 2019_07_19_195106) do

  create_table "gig_singers", force: :cascade do |t|
    t.integer "gig_id"
    t.integer "singer_id"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
  end

  create_table "gigs", force: :cascade do |t|
    t.string "name"
    t.text "notes"
    t.datetime "datetime"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
  end

  create_table "singers", force: :cascade do |t|
    t.string "name"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.boolean "active"
  end

class Gig < ApplicationRecord
    has_many :gig_singers
    has_many :singers, through: :gig_singers
end

class GigSinger < ApplicationRecord
    belongs_to :gig
    belongs_to :singer
end

class Singer < ApplicationRecord
    has_many :gig_singers
    has_many :gigs, through: :gig_singers

end


Get this bounty!!!

#StackBounty: #scala #join #activerecord #group-by Write join query with groupby in Scala ActiveRecord

Bounty: 150

I am trying to write a specific query in scala Active record. But it always returns nothing. I have read the wiki on the github page but it does not contain a lot of info on it. The query I am trying to write is

SELECT e.name, e.id, COUNT(pt.pass_id) as pass_count, e.start_date, e.total_passes_to_offer
FROM events e inner join passes p on e.id = p.event_id inner join pass_tickets pt on p.id = pt.pass_id where e.partner_id = 198 group by e.name, e.id

What I have tried is

Event.joins[Pass, PassTicket](
                (event, pass, passTicket) => (event.id === pass.eventId, pass.id === passTicket.passId)
            ).where(
                (event, _, _) => event.partnerId === partnerId
            ).select(
                (event, pass, _) => (event.name, event.id, PassTicket.where(_.passId === pass.id).count, event.startDate, event.totalPassesToOffer)
            ).groupBy( data => data._2)

But first, the return type becomes a map, not a list. And second when executed, it doesnt return anything even though the data exists and the SQL executes fine.


Get this bounty!!!