#StackBounty: #mixed-model #multiple-regression #survival #panel-data #time-varying-covariate Include step function into Joint longitud…

Bounty: 50

I am fitting a joint longitudinal and time to event model on production data with the aim of making dynamic predictions of the time of assembly of a machine. I am using JMbayes R package.

Among the time-dependent variables in the longitudinal part of the model I have a dummy variable that tells whether the mechanical part of the assembly is finished or not, this variable is 0 up to a certain time point, thereafter it becomes one from the time the mechanical part is completed to the time of the event (the assembly has been completed). So basically it is a one step function.

Now I am fitting a binomial(link="logit") longitudinal model for this variable using time fixed effect and a random effect for the order number related to the machine being assembled.
I am linking this longitudinal model to the time event component using the "current value" association.

The idea is that if I complete the mechanical part of the assembly ahead of the historical average, it is likely that the total time of assembly will be lower, coversely if I spend much more time, it is likely that it will be higher.

I am sure this is not the right way to include this kind of variable in the model, both in modelling it in the longitudinal part of the model without taking into account that this is a binary step function over time monotone non-decreasing, and also for the way I linked it to the time component i.e. using the current value association, since the impact of the value of this variable depends on time: its value is always 0 during the first days of assembly and only from a certain time onwards it is informative.

Could you help me correct the model definition?


Get this bounty!!!

#StackBounty: #econometrics #panel-data #causality #fixed-effects-model Do unbalanced observations contribute to identification in fixe…

Bounty: 50

I am quite confused on how to interpret my regression results.

I want to estimate the effect of tariff changes $tau_{st}$ in sector $s$, period $t$ on firm $i$ employment. I have two time periods (4 years apart) and want to use time and/or firm fixed effects. This is the base model:

$$y_{ist} = beta tau_{st} + lambda_t + mu_i + epsilon_{ist}$$

The thing is that not all $i$ firms are observed in both time periods. There are either leavers, entrants, or continuing firms.

I have two questions related to the sources of identification of $beta$ – Ideally, I would want to use cross-sectoral variation in tariff changes to identify it. Here it goes:

1 – Do leavers and entrants contribute to the estimation if I include $mu_i$? How so? I understand they would in case I had $mu_s$ instead. If not, would I get the same estimates by just dropping such observations?

2 – There a few firms that, although present in both $t=1$ and $t=2$, change sectors $s$ from period to period. Do these units also help with the identification of $beta$ if I use the model above?

If I wanted to consider cases in which a firm may react to the policy change by moving to a sector with higher tariffs on the following period, which fixed effect structure would do so?


Get this bounty!!!

#StackBounty: #panel-data #dataset #multilevel-analysis Changing the time metric for longitudinal data

Bounty: 50

I have some longitudinal data. I’ve done longitudinal analysis before but I have never changed the time metric so I wanted to run the process of that by you.

Edits for clarity:
I have repeated measures data collected over about 2 months but the study has to do with COVID – thus, time (and time passing) is an important component. People beginning the study, for example, on May 14th may be quite different than people coming in on June 1st in terms of our variables. I want to restructure the analysis to examine the effects of time. So I want to go from a scenario where I have a relatively time balanced (time 1, time 2, time 3) agnostic to the actual intake time, and restructure the analysis to take into account the specific dates on which each of the individuals 5 time points were collected – an individually varying times of observation scenario. I propose restructuring the data by indexing the analysis by recoding for each participant their 5 timepoints into ‘days since the beginning of the study’ and to include that as my time metric. I plan on using a linear mixed-effects model and using this new time metric as my ‘time’ covariate in the model.

I go into a few more details of the specific way I want to go about restructuring this below. But TLDR: I want to know a) whether this is defensible and b) whether my method of doing so makes sense below.

Original:
Details:

5 data collections, spaced equally every 7 days. So t1= intake, t2= day 7, t3 = day 14, t4 = day 21, t5 = day 28.
Sample size ~1500, of course some missing data due to attrition as time goes on.
Participants were allowed to begin the study over the course of approximately a month – and there is a fairly good distribution of intakes across that month where the survey was open.

Instead of analyzing change just across measurement occasion, where the X-axis is t1, t2, t3, t4, t5, I would like to rescale the time-metric to capture actual day within this whole time period that data was collected and to analyze change across time that way as opposed to just being agnostic to the actual date. Turning the X-axis into Day 1, Day 2…, Day 60". This is because I have reason to believe that change on my outcome variable will be a function of time passing.

But as you might imagine, when conceptualized this way (as days) not every day will be common to all participants (i.e., some started on day 3, and some on day 30, and everything in between). So more like a time-unstructured data set – thus I will examine change over time using growth curve using a mixed- effects model.

Here is how I intend to go about doing this time metric change:
Step 1: create variables that show y scores across all ~60 possible days.
Step 2: recode existing 5 measurement occasion data from each participant into data organized by ‘day’ rather than (t1, t2, t3 ,t4, t5) based on date of intake. E.g., someone who began the study on day 1 has their first timepoint now labelled as ‘day 1 Y’, whereas someone who began the study on day 15 has their first timepoint labelled as ‘day 15 y’ in the data set (and their subsequent timepoints 7 days later i.e., ‘day 21’).
Step 3: restructure data to person period format (using participant IDs).
Step 4: run growth curve (with time now representing day and ranges from 1-60), with intercept and time as random effects using mixed effects model.

TLDR: I want to switch to an ‘individually varying time metric’ (Grim et al., 2017). I’ve recoded my data to change the time-metric from measurement occasion to ‘day’ to capture change over time. Is what I have done appropriate/correct?

OR would it just make more sense to include date (operationalized as day1, day2…etc.) as a covariate using the original metric?

Any help would be very much appreciated!

Below is a visual example of what I did using made up some random numbers:

enter image description here

Then pairwise restructure.


Get this bounty!!!

#StackBounty: #r #econometrics #panel-data #difference-in-difference #treatment-effect Panel data 'binary' treatment with multi…

Bounty: 50

I am investigating the impact of a county-level policy on crime outcomes. I am using a two-way fixed effects estimator. My exposure (i.e., treatment) is a static binary variable. The binary treatment dummy switches on and off for some units. Of the subset of treated counties, some were exposed multiple times, and some were exposed only once or twice. I do have a large subset of "never-receivers" as well. I observe all counties across 120 months. Here are my concerns.

The policy variable turns on and off at rather odd times. In other words, treatment might ‘turn on’ (i.e., switch from 0 to 1) for some counties for several months before switching off (i.e., treatment reverses). For a subset of adopter counties, treatment might begin on March 1st 2005 and end on June 30th 2005. Now, in the following year, some counties were treated again—but the intervention starts on April 14th 2006 and ends on August 7th 2006. This pattern repeats with irregular on and off periods.

In previous evaluations I have assessed county-level crime outcomes across months. But with the irregular exposure periods, I suppose I could disaggregate the time dimension to a lower level such as a weekly series. However, doing so introduces more 0 counts across weeks. This isn’t necessarily a problem, but using a linear model to assess crime rates (i.e., a log-transformed crime rate) becomes problematic.

Question 1: Should I simply disaggregate down to a smaller sub-unit of time (i.e., week/day) so the main treatment dummy better delineates each intervention’s start and end time? If I use a county-month panel then I won’t capture all treatment epochs precisely, which deliberately introduces measurement error.

Question 2: Some crime outcomes show low cardinality across weeks. Assessing a log-transformed outcome with many zero counts will likely affect estimation if I use a linear panel data model. Thus, I suppose if the county-week panel is the way to go then I should substitute the linear model with a Poisson model and use the residential county population size as an offset. Any thoughts? My decision to do this is rather ad hoc.

Question 3: A model like the one I am proposing will, in theory, get a lot of feedback from the previous period. I don’t necessarily want to add a lagged version of the outcome on the right-hand side because I am aware of the dangers. But, is it safe to argue that some of the endogeneity concerns wane as $T$ gets very large? Also, what concerns do I face by including a second order lag only? The basic specification is below:

$$
text{log}(y_{it}) = theta y_{i,t-1} + sum_itext{County}i + sum_ttext{Week}_t + delta text{Policy}{it} + text{log}(text{Pop}_{it}),
$$

where $y_{it}$ is a very rare crime outcome; it is even more rare, so to speak, if I use a weekly time series. The right-hand side includes a lag of that outcome, fixed effects for counties and weeks, respectively, and the main policy variable. The last term is simply a population offset.

In sum, the lag might introduce more problems than it alleviates. I also found few fixes, in practice, using glm() or pglm() in R.

If it’s of any help, I mainly work in R and the lubridate package is my best friend.


Get this bounty!!!

#StackBounty: #r #econometrics #panel-data #difference-in-difference #treatment-effect Panel data 'binary' treatment with multi…

Bounty: 50

I am investigating the impact of a county-level policy on crime outcomes. I am using a two-way fixed effects estimator. My exposure (i.e., treatment) is a static binary variable. The binary treatment dummy switches on and off for some units. Of the subset of treated counties, some were exposed multiple times, and some were exposed only once or twice. I do have a large subset of "never-receivers" as well. I observe all counties across 120 months. Here are my concerns.

The policy variable turns on and off at rather odd times. In other words, treatment might ‘turn on’ (i.e., switch from 0 to 1) for some counties for several months before switching off (i.e., treatment reverses). For a subset of adopter counties, treatment might begin on March 1st 2005 and end on June 30th 2005. Now, in the following year, some counties were treated again—but the intervention starts on April 14th 2006 and ends on August 7th 2006. This pattern repeats with irregular on and off periods.

In previous evaluations I have assessed county-level crime outcomes across months. But with the irregular exposure periods, I suppose I could disaggregate the time dimension to a lower level such as a weekly series. However, doing so introduces more 0 counts across weeks. This isn’t necessarily a problem, but using a linear model to assess crime rates (i.e., a log-transformed crime rate) becomes problematic.

Question 1: Should I simply disaggregate down to a smaller sub-unit of time (i.e., week/day) so the main treatment dummy better delineates each intervention’s start and end time? If I use a county-month panel then I won’t capture all treatment epochs precisely, which deliberately introduces measurement error.

Question 2: Some crime outcomes show low cardinality across weeks. Assessing a log-transformed outcome with many zero counts will likely affect estimation if I use a linear panel data model. Thus, I suppose if the county-week panel is the way to go then I should substitute the linear model with a Poisson model and use the residential county population size as an offset. Any thoughts? My decision to do this is rather ad hoc.

Question 3: A model like the one I am proposing will, in theory, get a lot of feedback from the previous period. I don’t necessarily want to add a lagged version of the outcome on the right-hand side because I am aware of the dangers. But, is it safe to argue that some of the endogeneity concerns wane as $T$ gets very large? Also, what concerns do I face by including a second order lag only? The basic specification is below:

$$
text{log}(y_{it}) = theta y_{i,t-1} + sum_itext{County}i + sum_ttext{Week}_t + delta text{Policy}{it} + text{log}(text{Pop}_{it}),
$$

where $y_{it}$ is a very rare crime outcome; it is even more rare, so to speak, if I use a weekly time series. The right-hand side includes a lag of that outcome, fixed effects for counties and weeks, respectively, and the main policy variable. The last term is simply a population offset.

In sum, the lag might introduce more problems than it alleviates. I also found few fixes, in practice, using glm() or pglm() in R.

If it’s of any help, I mainly work in R and the lubridate package is my best friend.


Get this bounty!!!

#StackBounty: #r #econometrics #panel-data #difference-in-difference #treatment-effect Panel data 'binary' treatment with multi…

Bounty: 50

I am investigating the impact of a county-level policy on crime outcomes. I am using a two-way fixed effects estimator. My exposure (i.e., treatment) is a static binary variable. The binary treatment dummy switches on and off for some units. Of the subset of treated counties, some were exposed multiple times, and some were exposed only once or twice. I do have a large subset of "never-receivers" as well. I observe all counties across 120 months. Here are my concerns.

The policy variable turns on and off at rather odd times. In other words, treatment might ‘turn on’ (i.e., switch from 0 to 1) for some counties for several months before switching off (i.e., treatment reverses). For a subset of adopter counties, treatment might begin on March 1st 2005 and end on June 30th 2005. Now, in the following year, some counties were treated again—but the intervention starts on April 14th 2006 and ends on August 7th 2006. This pattern repeats with irregular on and off periods.

In previous evaluations I have assessed county-level crime outcomes across months. But with the irregular exposure periods, I suppose I could disaggregate the time dimension to a lower level such as a weekly series. However, doing so introduces more 0 counts across weeks. This isn’t necessarily a problem, but using a linear model to assess crime rates (i.e., a log-transformed crime rate) becomes problematic.

Question 1: Should I simply disaggregate down to a smaller sub-unit of time (i.e., week/day) so the main treatment dummy better delineates each intervention’s start and end time? If I use a county-month panel then I won’t capture all treatment epochs precisely, which deliberately introduces measurement error.

Question 2: Some crime outcomes show low cardinality across weeks. Assessing a log-transformed outcome with many zero counts will likely affect estimation if I use a linear panel data model. Thus, I suppose if the county-week panel is the way to go then I should substitute the linear model with a Poisson model and use the residential county population size as an offset. Any thoughts? My decision to do this is rather ad hoc.

Question 3: A model like the one I am proposing will, in theory, get a lot of feedback from the previous period. I don’t necessarily want to add a lagged version of the outcome on the right-hand side because I am aware of the dangers. But, is it safe to argue that some of the endogeneity concerns wane as $T$ gets very large? Also, what concerns do I face by including a second order lag only? The basic specification is below:

$$
text{log}(y_{it}) = theta y_{i,t-1} + sum_itext{County}i + sum_ttext{Week}_t + delta text{Policy}{it} + text{log}(text{Pop}_{it}),
$$

where $y_{it}$ is a very rare crime outcome; it is even more rare, so to speak, if I use a weekly time series. The right-hand side includes a lag of that outcome, fixed effects for counties and weeks, respectively, and the main policy variable. The last term is simply a population offset.

In sum, the lag might introduce more problems than it alleviates. I also found few fixes, in practice, using glm() or pglm() in R.

If it’s of any help, I mainly work in R and the lubridate package is my best friend.


Get this bounty!!!

#StackBounty: #r #econometrics #panel-data #difference-in-difference #treatment-effect Panel data 'binary' treatment with multi…

Bounty: 50

I am investigating the impact of a county-level policy on crime outcomes. I am using a two-way fixed effects estimator. My exposure (i.e., treatment) is a static binary variable. The binary treatment dummy switches on and off for some units. Of the subset of treated counties, some were exposed multiple times, and some were exposed only once or twice. I do have a large subset of "never-receivers" as well. I observe all counties across 120 months. Here are my concerns.

The policy variable turns on and off at rather odd times. In other words, treatment might ‘turn on’ (i.e., switch from 0 to 1) for some counties for several months before switching off (i.e., treatment reverses). For a subset of adopter counties, treatment might begin on March 1st 2005 and end on June 30th 2005. Now, in the following year, some counties were treated again—but the intervention starts on April 14th 2006 and ends on August 7th 2006. This pattern repeats with irregular on and off periods.

In previous evaluations I have assessed county-level crime outcomes across months. But with the irregular exposure periods, I suppose I could disaggregate the time dimension to a lower level such as a weekly series. However, doing so introduces more 0 counts across weeks. This isn’t necessarily a problem, but using a linear model to assess crime rates (i.e., a log-transformed crime rate) becomes problematic.

Question 1: Should I simply disaggregate down to a smaller sub-unit of time (i.e., week/day) so the main treatment dummy better delineates each intervention’s start and end time? If I use a county-month panel then I won’t capture all treatment epochs precisely, which deliberately introduces measurement error.

Question 2: Some crime outcomes show low cardinality across weeks. Assessing a log-transformed outcome with many zero counts will likely affect estimation if I use a linear panel data model. Thus, I suppose if the county-week panel is the way to go then I should substitute the linear model with a Poisson model and use the residential county population size as an offset. Any thoughts? My decision to do this is rather ad hoc.

Question 3: A model like the one I am proposing will, in theory, get a lot of feedback from the previous period. I don’t necessarily want to add a lagged version of the outcome on the right-hand side because I am aware of the dangers. But, is it safe to argue that some of the endogeneity concerns wane as $T$ gets very large? Also, what concerns do I face by including a second order lag only? The basic specification is below:

$$
text{log}(y_{it}) = theta y_{i,t-1} + sum_itext{County}i + sum_ttext{Week}_t + delta text{Policy}{it} + text{log}(text{Pop}_{it}),
$$

where $y_{it}$ is a very rare crime outcome; it is even more rare, so to speak, if I use a weekly time series. The right-hand side includes a lag of that outcome, fixed effects for counties and weeks, respectively, and the main policy variable. The last term is simply a population offset.

In sum, the lag might introduce more problems than it alleviates. I also found few fixes, in practice, using glm() or pglm() in R.

If it’s of any help, I mainly work in R and the lubridate package is my best friend.


Get this bounty!!!

#StackBounty: #r #econometrics #panel-data #difference-in-difference #treatment-effect Panel data 'binary' treatment with multi…

Bounty: 50

I am investigating the impact of a county-level policy on crime outcomes. I am using a two-way fixed effects estimator. My exposure (i.e., treatment) is a static binary variable. The binary treatment dummy switches on and off for some units. Of the subset of treated counties, some were exposed multiple times, and some were exposed only once or twice. I do have a large subset of "never-receivers" as well. I observe all counties across 120 months. Here are my concerns.

The policy variable turns on and off at rather odd times. In other words, treatment might ‘turn on’ (i.e., switch from 0 to 1) for some counties for several months before switching off (i.e., treatment reverses). For a subset of adopter counties, treatment might begin on March 1st 2005 and end on June 30th 2005. Now, in the following year, some counties were treated again—but the intervention starts on April 14th 2006 and ends on August 7th 2006. This pattern repeats with irregular on and off periods.

In previous evaluations I have assessed county-level crime outcomes across months. But with the irregular exposure periods, I suppose I could disaggregate the time dimension to a lower level such as a weekly series. However, doing so introduces more 0 counts across weeks. This isn’t necessarily a problem, but using a linear model to assess crime rates (i.e., a log-transformed crime rate) becomes problematic.

Question 1: Should I simply disaggregate down to a smaller sub-unit of time (i.e., week/day) so the main treatment dummy better delineates each intervention’s start and end time? If I use a county-month panel then I won’t capture all treatment epochs precisely, which deliberately introduces measurement error.

Question 2: Some crime outcomes show low cardinality across weeks. Assessing a log-transformed outcome with many zero counts will likely affect estimation if I use a linear panel data model. Thus, I suppose if the county-week panel is the way to go then I should substitute the linear model with a Poisson model and use the residential county population size as an offset. Any thoughts? My decision to do this is rather ad hoc.

Question 3: A model like the one I am proposing will, in theory, get a lot of feedback from the previous period. I don’t necessarily want to add a lagged version of the outcome on the right-hand side because I am aware of the dangers. But, is it safe to argue that some of the endogeneity concerns wane as $T$ gets very large? Also, what concerns do I face by including a second order lag only? The basic specification is below:

$$
text{log}(y_{it}) = theta y_{i,t-1} + sum_itext{County}i + sum_ttext{Week}_t + delta text{Policy}{it} + text{log}(text{Pop}_{it}),
$$

where $y_{it}$ is a very rare crime outcome; it is even more rare, so to speak, if I use a weekly time series. The right-hand side includes a lag of that outcome, fixed effects for counties and weeks, respectively, and the main policy variable. The last term is simply a population offset.

In sum, the lag might introduce more problems than it alleviates. I also found few fixes, in practice, using glm() or pglm() in R.

If it’s of any help, I mainly work in R and the lubridate package is my best friend.


Get this bounty!!!

#StackBounty: #machine-learning #time-series #cross-validation #panel-data #validation Model validation with multiple time series (sort…

Bounty: 50

I have a dataset of the following form:

client_id | date       | client_attr_1 | client_attr_2 | client_attr3 | money_spend
1         | 2020-01-01 |           123 |           321 |          188 |      150.24
1         | 2020-01-02 |           123 |           321 |          188 |       18.25
1         | 2020-01-03 |           123 |           321 |          188 |       12.34
2         | 2020-01-02 |           233 |           421 |          181 |       10.10
2         | 2020-01-03 |           233 |           421 |          181 |       20.00
2         | 2020-01-04 |           233 |           421 |          181 |       11.12
2         | 2020-01-01 |           233 |           421 |          181 |       18.36
3         | 2020-02-01 |           723 |           301 |          255 |        1.14
3         | 2020-02-01 |           723 |           301 |          255 |        1.19

My goal is to predict money spend for new clients, day by day.

The goal of the validation procedure is to get a model performance that is not biased by group/time leakage.

I can imagine that an ideal validation scheme that would reflect the actual prediction time situation for that problem would take the following into account:

  1. Groups – clients, ensure client’s observations are not in train and validation sets at the same time.
  2. Time – make sure that the model is not training on future clients and predicting on clients from the past to avoid look-ahead bias.

I find it a bit inconvenient as it requires implementing custom validation procedure that could cause some additional problems (e.g. highly different train/test sizes with repeated validation). Therefore, I’d like to drop the second assumption. For that to be a reasonable thing to do, I believe that what I need to check is whether the actual time series (spend given date) are somehow dependent (correlated) on the same dates for different clients (I assume it will not be the case).

Now the questions are:

  1. Is it the right thing to check?
  2. Is comparing time series of different clients on the same dates enough?
  3. Is there a better/proper way to asses such dependency?
  4. Perhaps I need not to validate that or anything else for the reasons I’m not seeing?


Get this bounty!!!

#StackBounty: #regression #time-series #panel-data Time series model for multiple different series observations

Bounty: 50

I have a series of $n$ machines that are going to emit some sensor data. The machines are going to be started at some point, telemetry collected every minute for some time and then stopped. So, instead of one long time-series, I have many short time series. Note that some machines might run for an hour, others for 50 minutes, others for 40 minutes and so on. Within the hour, there will be some seasonal patterns.

Now, I want to fit a time-series model that gives me some 95% confidence bands for each time instant since a machine starts (for the sensor value). Also, tomorrow I will get some new machine I haven’t seen before but one that is expected to behave like the $n$ machines I trained on. For each minute and sensor value it produces, I want to estimate the p_value, denoting what the probability of seeing that observation would be if the machine were no different from the $n$ machines I saw in the past.

What time series models can be good fits for this use-case? As a stretch goal, each of the $n$ machines might have some feature vectors associated with them. Is it possible to take the features into account?


Some thoughts: perhaps we can combine the $n$ time series into one long time series and consider the intervals for which the shorter ones don’t have values as missing? But then the question becomes what order we should combine them in?


Some examples of the panel data:

https://1drv.ms/x/s!AiY4k2EqE618gakcA0pRg0SAvQl_UA?e=ir93Db


Get this bounty!!!