#StackBounty: #time-series #hypothesis-testing #panel-data #geography Testing Hypothesis with Time series and Location Data

Bounty: 50

I have Data on Prices of house. Along with these variables.

  • 1) Location
     i)Latitude-Longitude
     ii)City and State 
    

    2) Attribute of house.

    i) No. of bedrooms and bathrooms (could be taken as proxy for size?)
    

    3)Year Built and prices of house on the same year. (so basically i have price of house when it was built hence time series data)

Now, I want to test the effect of this variables on price individually and combined.

What statistics Can be used for such test? Say, have prices increased in last 3-4 years?

How exactly I can use the lat-long (since this is more granular than just cities) of data to check whether prices of house are dependent on location?

Looking for some suggestions.


Get this bounty!!!

#StackBounty: #hypothesis-testing #statistical-significance #canonical-correlation Statistical significance test for comparing two cano…

Bounty: 100

I have a colleague who is comparing several different treatments of data via canonical correlation analysis. In other words, given some time-varying signal, $a(t)$, he extracting some vector of features $v_1(t)$. He then supposes that this is a predictor for some other vector $p(t)$. To check this he computes the [first] canonical correlation coefficient, $R_1 = text{CCA}(v_1(t),p(t))$. And then he tries some new improved feature extractor, $v_2(t)$, and compares again $R_2 = text{CCA}(v_2(t),p(t))$.

I can find tests for the situation that $R_1$, and $R_2$ are different from zero. But what about the test that $R_1$ and $R_2$ are significantly different from each other?

Asked on behalf of a colleague, but I’d also be interested in the answer.


Get this bounty!!!

#StackBounty: #hypothesis-testing #multiple-comparisons #fishers-exact Statistical testing for cohort analysis and retention: did I do …

Bounty: 50

A common problem in tech-startups is cohort analysis when A/B testing new app features: within a certain time period between two different dates, two different sets of users joining between those dates experience two different versions of the app (A and B). In order to decide which version is better, the statistic we collect is the number of users $N_{t_0,Delta t}$ which joined on day $t_0$ and used the app $Delta t$ days later. The data typically looks likes this

Group A

Cohort  Day0    Day1    Day2    Day3
25 July 2614    351     140     152
26 July 2819    571     210
27 July 2261    415

Group B

Cohort  Day0    Day1    Day2    Day3
25 July 2608    411     151     148
26 July 2822    592     264
27 July 2301    444

Now we would like to run a significance test to find out whether one feature is better than the other or not. The way I did this was to run Fisher’s exact test for each $(t_0, Delta t)$ pair by constructing the corresponding contingency table and obtain a p-value, and then combine all of the p-values using Fisher’s method.

The result looks like this:

Testing Group A BETTER AT TIMES than Group B:

Cohort 1, day 1 test p-value: 0.992
Cohort 1, day 2 test p-value: 0.772
Cohort 1, day 3 test p-value: 0.437
Cohort 2, day 1 test p-value: 0.759
Cohort 2, day 2 test p-value: 0.996
Cohort 3, day 1 test p-value: 0.803

Combined p-value: 0.994


Testing if Group A WORSE AT TIMES than Group B:

Cohort 1, day 1 test p-value: 0.009
Cohort 1, day 2 test p-value: 0.267
Cohort 1, day 3 test p-value: 0.609
Cohort 2, day 1 test p-value: 0.262
Cohort 2, day 2 test p-value: 0.006
Cohort 3, day 1 test p-value: 0.219

Combined p-value: 0.004

Does this seem like a reasonable way to do this analysis and decide significance? We want to write an article and share the code, and would like to make sure I’m not doing anything wrong. (One thing I’d like to note here is that it’s not quite survival analysis, as the number of users in a cohort can actually increase from one day to another, which I have observed in real data)


Get this bounty!!!

#StackBounty: #time-series #hypothesis-testing #gaussian-process #kolmogorov-smirnov #model-comparison How to test if the process that …

Bounty: 100

Problem

I have time-series data generated by a machine over two disjoint periods of time – roughly one month in 2016 and another month in 2018.

It is hypothesized by domain experts that at each time step $t$,
an observed variable $Y^t$ can be explained by another set of observed variables, $X_1^t, ldots, X_d^t$.

How can I test whether this process has changed over time?
Note that I am not trying to test if the distribution for the variable $Y$ has changed over time. I want to test if the relationship between the $X_i$s and $Y$ has changed over time.

Current approach

Suppose I fit a time-series model (e.g., a Gaussian Process) on the data from 2016 to predict $Y^t$ given $X_1^t, ldots, X_d^t$ as a way to model the underlying process that generated $Y^t$.

The domain experts have suggested that maybe we can try to use this model to predict the variables $Y^t$s given the $X^t$s from 2018 and use the residuals somehow to infer that the model (representing the process in 2016) is or is no longer the same in 2018. I am uncertain how to continue after this point.

What I’m considering

  1. Should I test if the residues from 2016 and 2018 are generated from the same distribution, or perform a goodness of fit test using something like Kolmogorov-Smirnov test? My concern with this approach is that the out-of-sample data from 2018 is likely to have larger errors than the in-sample training data from 2016, so this test will likely give rise to false positives. Is there any way to adjust/account for this effect?

  2. Should I fit two models, one for 2016 and another for 2018, and use some way to test that these two models are “same” or “different”? If so, how can I do this?

  3. I saw some posts on cointegration. But I do not fully understand this concept. Is this relevant?

In general, how might one approach this type of problem? I’ve tried searching for this online, but maybe due to a lack of precision of my query (I’m not familiar in this area), I’m not getting many relevant results. I appreciate even simple hints/comments on the topic(s)/keywords to search, or books/papers to look through.


Get this bounty!!!

#StackBounty: #hypothesis-testing #classification #confidence-interval How to measure confidence in classifier of non-independent data?

Bounty: 100

I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some scores are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.

I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).

I separate the points with known score into two disjoint sets for training and validation respectively.

Then, repeatedly (say k times), I do the following:

  • Randomly select some data points with positive score and some points with negative score from the training set (for some fixed positive values for m and n).
  • Use the black box to (try to) get a separating hyperplane for these sampled points.
  • If I get a hyperplane back, save it.

Now I have some hyperplanes (say I have 0 < k’ <= k of them).

I use these hyperplanes to separate the validation set. I select the hyperplane which correctly classifies the most points as having positive or negative score (number of correct positives + number of correct negatives).

My question is: How can I measure the statistical confidence that the finally selected hyperplane is better than random?

Here’s what I’ve done so far:

Say there are n points in the validation set. If a hyperplane correctly classifies a point with probability p, and this is independent for all the points, we can use a binomial distribution.

Let F be the cdf of the binomial distribution. Let X be the number of correctly classified points in the validation set (so we are assuming X ~ B(n, p)). Then P(X <= x) = F(x).

Now, we have k’ hyperplanes. Let’s assume these can be represented as k’ IID variables X1, X2, …, Xk’.

Now P(max(X1, X1, …, Xk’) <= x) = F(x) ^ k’.

Let’s say a random hyperplane is one as above where p equals the proportion of positive scores in the total (so if it’s three quarters positive, p = 0.75).

Sticking some numbers in, I ran these numbers. Let p = 0.5 for simplicity. Suppose I want to check if the selected hyperplane is better than random with probability > 0.95.

If n = 2000, I need to classify 1080 correctly to have confidence greater than 0.95 that this classifier is better than random (I think, unless I did the calculation wrong).

However, if the points themselves are not independent, this doesn’t work. Suppose many of the points are identical so the effective size of the set is much smaller than n. If n = 20, you need to get 18 correct for 0.95 confidence; extrapolating that suggests you’d need 1800/2000.

I am sure that the points are not independent, but I’m not sure in what way, or how to go about measuring that and accounting for it in a calculation similar to the above.


Get this bounty!!!

#StackBounty: #hypothesis-testing Multiple hypothesis testing

Bounty: 50

Suppose I have time series data for $(X,Y_i)$, independent and dependent variables, respectively, which vary over time $t=1,…,1000$. I have data on $Y_i$ for 100 individuals/cities/etc $i=1,…100$. So, each of these 100 $Y$’s is a time series and has 1000 data points. $X$ is common across all individuals and also varies over time.

I run 100 regressions of $Y_i$ on $X$, one for each $i$. I find that 1 one of those regressions is significant at, say, $alpha=10%$ level.

Since the probability under a binomial with m=100 trials and probability of success ($alpha = 10%$) of having at least one success (one false positive) is almost 1, this does not mean much. The probability that there is at least one false positive in one of my 100 regressions is almost 1.

A common way to go about this is to use a Bonferroni kind of correction that would demand the use of a $alpha / m$ significance level.

However, the probability of finding more than 15 false positives is $7.25%$, which is actually lower than $10%$. Therefore, if I would have found more than 15 significant coefficients across my 100 regressions, I could be relatively confident that I was not facing a chance result.

Is this reasoning justified or not? In fact, in the sense of decreasing the type I error, my reasoning would suggest that I would be more confident that I am not getting false positives if I found 15/100 significant coefficients at the $10%$ level compared to the situation in which I get 1/1 signifcant coefficient in one single regression at the $10%$ level. In such a case, I would be inclined to say that a Bonferroni correction would not make sense (it would be unncessary).


Get this bounty!!!

#StackBounty: #hypothesis-testing #classification #confidence-interval How to measure confidence in classifier chosen from several avai…

Bounty: 100

I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.

I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).

I separate the points with known score into two disjoint sets for training and validation respectively.

Then, repeatedly (say k times), I do the following:

  • Randomly select m data points with positive score and n points with negative score from the training set (for some fixed positive values for m and n).
  • Use the black box to (try to) get a separating hyperplane for these sampled points.
  • If I get a hyperplane back, save it.

Now I have some hyperplanes (say I have 0 < k’ <= k of them).

I use these hyperplanes to separate the validation set. I rank them either by the average score of validation data points categorised as positive, or by recall (total positively categorised/total positive validation points)[1]. Then I select the top ranking hyperplane, and use this to label my data with unknown scores.

My question is: How can I measure the statistical confidence that the finally selected hyperplane is better than random?

I have a vague idea how to test the significance of a single classifier (using a T test maybe?) but am not sure how it’s affected by being the “best” of several.

[1]: I’m not sure if the choice of ranking scheme between these two makes a difference to the confidence calculation. I haven’t decided which ranking method to use, so I mentioned both as possibilities.


Get this bounty!!!

#StackBounty: #hypothesis-testing #sample-size Intuition – Impact of baseline conversion rate on sample size

Bounty: 50

In an A/B test we calculate the needed sample size before we run the test. The required sample size is dependent on the significance level, the power, the minimum detectable effect (MDE) and the baseline conversion rate.

Let’s say we set those value to

  • Significance level: 5 %
  • Power: 80 %
  • Relative MDE: 2 %

And plug them into a sample size calculator.

For different baseline conversion rates we get different sample sizes. The higher the baseline, the lower the sample size.

  • Baseline 10 %: 354,139 per variant
  • Baseline 20 %: 157,328 per variant
  • Baseline 30 %: 91,725 per variant

The relative change we are trying to detect stays the same. I am trying to get an inituition for why we need bigger samples if the baseline is higher.


Get this bounty!!!

#StackBounty: #r #hypothesis-testing #repeated-measures #nested-data #plm comparing groups in repeated measures FE models, with a neste…

Bounty: 50

I have estimated some repeated measures Fixed Effects models, with a nested error component, using . I am now interested to

  1. test if the full models are significantly different, i.e. $$H_o: beta_{Female} = beta_{Male}$$ where $beta_{Female}$ is the full model for Females and $beta_{Male}$ is the full model for Males and
  2. subsequently test selected regression coefficients between two groups, i.e. $$H_o: beta_{Female == year1.5} = beta_{Male == year1.5}$$ where $beta_{Female == year1.5}$ is the regression coefficient for females at year1.5, and $beta_{Male == year1.5}$ is the regression coefficient for males at year1.5.

I will illustrate the situation using the below working example,

First, some packages needed,

# install.packages(c("plm","texreg","tidyverse","lmtest"), dependencies = TRUE)
library(plm); library(lmtest); require(tidyverse)

Second, some data preparation,

data(egsingle, package = "mlmRev")
dta <-  egsingle %>% mutate(Female = recode(female,.default = 0L,`Female` = 1L))

Third, I estimate a set of models for each gender in data

MoSpc <- as.formula(math ~ Female + size + year)
dfMo = dta %>% group_by(female) %>%
    do(fitMo = plm(update(MoSpc, . ~ . -Female), 
       data = ., index = c("childid", "year", "schoolid"), model="within") )

Forth, lets look at the two estimated models,

texreg::screenreg(dfMo[[2]], custom.model.names = paste0('FE: ', dfMo[[1]]))
#> ===================================
#>            FE: Female   FE: Male   
#> -----------------------------------
#> year-1.5      0.79 ***     0.88 ***
#>              (0.07)       (0.10)   
#> year-0.5      1.80 ***     1.88 ***
#>              (0.07)       (0.10)   
#> year0.5       2.51 ***     2.56 ***
#>              (0.08)       (0.10)   
#> year1.5       3.04 ***     3.17 ***
#>              (0.08)       (0.10)   
#> year2.5       3.84 ***     3.98 ***
#>              (0.08)       (0.10)   
#> -----------------------------------
#> R^2           0.77         0.79    
#> Adj. R^2      0.70         0.72    
#> Num. obs.  3545         3685       
#> ===================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05    #> 

Now, I want to test if these two (linear OLS) models are significantly different, cf. point1 above. I looked around SO and the internet and some suggest that I need to use plm::pFtest(), also suggested here, which I have tried, but I’m not convinced and wonder if someone here has experience and could possibly help me.

I tried,

plm::pFtest(dfMo[[1,2]], dfMo[[2,2]])
# >
# > F test for individual effects
# >
# >data:  update(MoSpc, . ~ . - Female)
# >F = -0.30494, df1 = 113, df2 = 2693, p-value = 1
# >alternative hypothesis: significant effects

Second, I am interested to compare regression coefficients between two groups. Say, is the estimate for year1.5 of 3.04 significantly different from 3.17? Cf. point 2 above.

Please ask if any of the above is not clear and I will be happy to elaborate. Any help will be greatly appreciated!

I realize this question is a bit programming like, but I initially posted it in SO. However, DWin was kind enough to point out that the question belonged in CrossValidated and migrated it here.



Get this bounty!!!

#StackBounty: #hypothesis-testing #loss-functions Utility or loss functions and statistical testing

Bounty: 50

Which would be better, a statistical method that yields false positive errors on 5% of occasions and false negative on 20%, or a statistical method that yields 10% false positives but only 5% false negatives?

The answer, of course, has to begin with “It depends…”. But what factors is it dependent upon, and how are those factors taken into account in real world application of hypothesis testing?

I would say that it depends upon some kind of utility function that takes into account circumstances and consequences of decisions, but I have never noticed any specification or discussion of such a function in research papers in my area of basic pharmacology, and I assume that they are similarly absent from research papers from many other areas of science. Does that matter?

It would be safe to assume that researchers are responsible for the experimental design and analysis in most of the research papers that I read, but at least sometimes a statistician will be consulted (usually after the data are in hand). Do statisticians discuss loss functions with researchers before advising on or performing a data analysis, or do they just use one that is unconsidered and implicit?


Get this bounty!!!