#StackBounty: #relational-theory #database-design Database design: Normalizing a "(many-to-many)-to-many" relationship

Bounty: 150

Short version

I have to add a fixed number of additional properties to each pair in an existing many-to-many join. Skipping to the diagrams below, which of Options 1-4 is the best way, in terms of advantages and disadvantages, to accomplish this by extending the Base Case? Or, is there a better alternative I haven’t considered here?

Longer version

I currently have two tables in a many-to-many relationship, via an intermediate join table. I now need to add additional links to properties that belong to the pair of existing objects. I have a fixed number of these properties for each pair, though one entry in the property table may apply to multiple pairs (or even be used multiple times for one pair). I’m trying to determine the best way to do this, and am having trouble sorting out how to think of the situation. Semantically it seems as if I can describe it as any of the following equally well:

  1. One pair linked to one set of a fixed number of additional properties
  2. One pair linked to many additional properties
  3. Many (two) objects linked to one set of properties
  4. Many objects linked to many properties

Example

I have two object types, X and Y, each with unique IDs, and a linking table objx_objy with columns x_id and y_id, which together form the primary key for the link. Each X can be related to many Ys, and vice versa. This is the setup for my existing many-to-many relationship.

Base Case

Base case

Now additionally I have a set of properties defined in another table, and a set of conditions under which a given (X,Y) pair should have property P. The number of conditions is fixed, and the same for all pairs. They basically say “In situation C1, pair (X1,Y1) has property P1”, “In situation C2, pair (X1,Y1) has property P2”, and so on, for three situations/conditions for each pair in the join table.

Option 1

In my current situation there are exactly three such conditions, and I have no reason to expect that to increase, so one possibility is to add columns c1_p_id, c2_p_id, and c3_p_id to featx_featy, specifying for a given x_id and y_id, which property p_id to use in each of the three cases.

Option 1

This doesn’t seem like a great idea to me, because it complicates the SQL to select all properties applied to a feature, and doesn’t readily scale to more conditions. However, it does enforce the requirement of a certain number of conditions per (X,Y) pair. In fact, it is the only option here that does so.

Option 2

Create a condition table cond, and add the condition ID to the primary key of the join table.

Option 2

One downside to this is that it doesn’t specify the number of conditions for each pair. Another is that when I am only considering the initial relationship, with something such as

SELECT objx.*, objy.* FROM objx
  INNER JOIN objx_objy ON objx_objy.x_id = objx.id
  INNER JOIN objy ON objy.id = objx_objy.y_id

I then have to add a DISTINCT clause to avoid duplicate entries. This seems to have lost the fact that each pair should exist only once.

Option 3

Create a new ‘pair ID’ in the join table, and then have a second link table between the first one and the properties and conditions.

Option 3

This seems to have the fewest disadvantages, other than the lack of enforcing a fixed number of conditions for each pair. Does it make sense though to create a new ID that identifies nothing other than existing IDs?

Option 4 (3b)

Basically the same as Option 3, but without the creation of the additional ID field. This is accomplished by putting both original IDs in the new join table, so it contains x_id and y_id fields, instead of xy_id.

Option 4

An additional advantage to this form is that it doesn’t alter the existing tables (though they aren’t in production yet). However, it basically duplicates an entire table multiple times (or feels that way, anyway) so also doesn’t seem ideal.

Summary

My feeling is that Options 3 and 4 are similar enough that I could go with either one. I probably would have by now if not for the requirement of a small, fixed number of links to properties, which makes Option 1 seem more reasonable than it otherwise would be. Based on some very limited testing, adding a DISTINCT clause to my queries doesn’t seem to impact performance in this situation, but I’m not sure that Option 2 represents the situation as well as the others, because of the inherent duplication caused by placing the same (X,Y) pairs in multiple rows of the link table.

Is one of these options my best way forward, or is there another structure I should consider?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.