#StackBounty: #machine-learning #ranking #xgboost What is exactly query group "qid" in XGBoost

Bounty: 50

In XGBoost documentation it’s said that for ranking applications we can specify query group ID’s qid in the training dataset as in the following snippet:

1 qid:1 101:1.2 102:0.03
0 qid:1 1:2.1 10001:300 10002:400
0 qid:2 0:1.3 1:0.3
1 qid:2 0:0.01 1:0.3
0 qid:3 0:0.2 1:0.3
1 qid:3 3:-0.1 10:-0.3
0 qid:3 6:0.2 10:0.15

I have a couple of questions regarding qid‘s (standard LTR setup set of search queries and documents, they are represented by query, document and query-document features):

1) Let’s say we have qid‘s in our training file. Does it mean that the optimization will be performed only on a per query basis, all other features specified will be considered as document features and cross-query learning won’t happen?

2) Let’s assume that queries are represented by query features. Should we still have qid‘s specified in the training file or we should just list query, document and query-document features?

UPDATE:

So far, I have the following explanation, but how correct or incorrect it is I don’t know:

Each row in the training set is for a query-document pair, so in each row we have query, document and query-document features. If we specify “qid” as a unique query ID for each query (=query group) then we can assign weight to each of these query groups. If the weight in some query group is large, then XGBoost will try to make the ranking correct for this group first.

From a file in XGBoost repo:

weights = np.array([1.0, 2.0, 3.0, 4.0])
...
dtrain = xgboost.DMatrix(X, label=y, weight=weights)
...
# Since we give weights 1, 2, 3, 4 to the four query groups,
# the ranking predictor will first try to correctly sort the last query group
# before correctly sorting other groups.

and also:

In ranking task, one weight is assigned to each query group
(not each data point). This is because we only care about the
relative ordering of data points within each group, so it
doesn't make sense to assign weights to individual data points.

UPDATE 2:

Found this link. Given

3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A 
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D 
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D

the following set of pairwise constraints is generated (examples are referred to by the info-string after the # character):

1A>1B, 1A>1C, 1A>1D, 1B>1C, 1B>1D, 2B>2A, 2B>2C, 2B>2D, 3C>3A, 3C>3B, 3C>3D, 3B>3A, 3B>3D, 3A>3D

So qid seems to specify groups such that within each group relevance values can be compared to each other and between groups relevance values can’t be directly compared (inc. during the training procedure). So during training we need to have qid‘s and during inference we don’t need them as input.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.