*Bounty: 50*

*Bounty: 50*

I am trying to predict what percentage (or proportion) of a task is completed by various workers, given the time left until the deadline to complete the task and I’m looking for help on how to approach modeling this.

I have historic data which contains “worker ids” (`WorkerID`

) that uniquely identify each worker, the number of days left to complete the task or `DaysToDeadline`

(e.g. 25, 24, 23, etc.), and the Percentage of work completed at the given number of days to deadline (`PercentComplete`

).

Generally speaking the percentage completed will always increase, but can sometimes revert to smaller percentage completed, if for example, the worker makes a mistake during the task and has to redo previously completed work. If a worker completes a task early, he can begin work on another task, so his “percent completed” can actually go above 100% and is recorded as such. In addition, there is not necessarily an equal number of data points for each worker since some workers could start on the task earlier or later than others.

My sample data looks like this:

```
WorkerID DaysToDeadline PercentComplete
1 25 0
1 24 2
1 23 2
1 22 5
1 21 10
2 25 5
2 24 6
2 23 7
2 22 10
2 21 7
2 20 10
3 25 0
3 24 5
3 23 0
4 25 10
4 24 20
4 23 25
4 22 26
4 21 30
4 20 50
4 19 66
4 18 80
4 17 96
4 16 100
4 15 106
```

Since I need to make individual level predictions and obtain confidence intervals for these predictions, I was thinking about possibly using some sort of generalized linear mixed model where I treat worker ID as a random effect, Days to deadline and percent complete as fixed effects. I thought about using a logistic or beta family model, but since I get get things like 105%, I don’t think this would be appropriate. So, **I’m looking for some suggestions how how to possibly approach this?** I’m ideally looking for a regression approach, but would be open to others such as machine learning approaches too — I’m just more familiar with the regression approach. Thanks.

**UPDATE:**

If it’s too difficult to suggest a modeling approach to this problem due to the fact that the percentages can exceed 100 (e.g. 105%), I’d be amendable to simply truncating or modifying the definition of the task completion percentage so that 100% is the highest percentage complete that would be possible.