#StackBounty: #r #statistics #distribution #sampling #statistical-sampling How to implement rejection sampling in R?

Bounty: 200

I have a dataset of rows of genes each with their gene length, I am looking to sample from these genes by their gene length distribution using rejection sampling – as I have too many genes in this dataset that are too small to enter further analysis (but I can’t just set a cut-off on my own to remove them). I have 1 dataset of genes with gene lengths to sample from, and another proposal distribution of gene lengths that I want to use in order to do the rejection sampling of the first dataset.

An example of my data looks like this:

#df1 data to sample from:
Gene  Length
Gene1  5
Gene2  6
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10
Gene7  50
Gene8  4
Gene9  100
Gene10 2000

My proposal dataset:

Gene  Length
Gene1  5000
Gene2  60000
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10000
Gene7  50000
Gene8  4000
Gene9  1000
Gene10 2000

I don’t have any statistics background, and I am trying to perform rejection sampling (my overall aim is to get a sample of genes with less extremely small genes in length to take onto further analysis).

To do rejection sampling I am trying this from a tutorial I found here:

X = df1$Length
U = df2$Length

accept = c()
count = 1

pi_x <- function(x) {
  new_x = (3/2)*(x^3)+(11/8)*(x^2)+(1/6)*(x)+(1/12)

while(count <= 50 & length(accept) < 50){
  test_u = U[count]
  test_x = pi_x(X[count])/(3.125*dunif(X[count],0,1))
  if (test_u <= test_x){
    accept = rbind(accept, X[count])
    count = count + 1
  count = count + 1

My problem with this is that it only selects 25 genes (my ideal sampling range for further analysis is 50-100 genes being selected) and most of these 25 genes are still very small in size after sampling. Is it that I need to transform X in some way before running this rejection sampling code? My actual data for df1 is 800 genes with a skewed/beta distribution of gene length (most are very small). Or am I completely missing something else in my understanding? Any guidance at all would be appreciated.

Input data:

df1 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L, 
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5000L, 
60000L, 400000L, 1000L, 25000L, 10000L, 50000L, 40000L, 1000L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))


I have also tried:

sampled <- data.frame(proposal = df2$Length)
sampled$targetDensity <- dbeta(sampled$proposal, 3,6)

maxDens = max(sampled$targetDensity, na.rm = T)
sampled$accepted = ifelse(df1$Length < sampled$targetDensity / maxDens, TRUE, FALSE)

hist(sampled$proposal[sampled$accepted], freq = F, col = "grey", breaks = 100)
curve(dbeta(x, 3,6),0,1, add =T, col = "red")

But I am certain I am not using dbeta() correctly since sampled$targetDensity outputs all zeros – is there a way I can fix this? I have tried changing the values in dbeta() but haven’t had any success.

Get this bounty!!!

#StackBounty: #php #math #statistics #logic Scoring system for collection of numbers (serial#)

Bounty: 50

I’m looking for a formula (PHP) that I can use to assign a score base on rarity of a number inside a collection. Serial#1 being the rarest.

For example, I have few sets of collections.

  • Collection 1 – Total number of items = 10 (Serial #1 to Serial #10)
  • Collection 2 – Total number of items = 100 (Serial #1 to Serial #100)
  • Collection 3 – Total number of items = 3500 (Serial #1 to Serial #3500)

Based on the 3 example sets. Collection 1 is considered the rarest collection because of only 10 items available in the set.

In addition to this. Each item is assigned with its serials, #1 being the best (rarest)

Here is the table of how I visualize the scoring system.

Collection 1

| Serial#| Score  |
|:------:| :-----:|
| 1      | 1000   |
| 2      | 900    |
| 3      | 800    |
| 4      | 700    |
| 5      | 600    |
| 6      | 500    |
| 7      | 400    |
| 8      | 300    |
| 9      | 200    |
| 10     | 100    |

Collection 2

| Serial#| Score |
|:------:| :----:|
| 1      | 800   |
| 2      | 700   |
| 3      | 600   |
| 4      | 500   |
| ...    | ...   |
| 99     | 12    |
| 100    | 10    |

I just made up those scores just for representation.

With the tables above, Serial #1 in Collection 1 has a higher score compared to Serial #1 in Collection 2 because of the rarity of Collection 1.

  1. I have few collections ranging from 1 of a kind (rarest of them all),10, 20, 150, 350, 1000, 5000, 10000, 50000” What “score” is this item supposed to get then?

for the rarest 1 of 1 the score will be based on the score of the
other Serial #1. If for example Collection with 10 items the serial#
get 1000 points then the 1 of 1 i can give 5000 points (this one no
need to calculate)
2. Are all the scores inside a collection supposed to add up to the same value?
No. It doesn’t need to add up to the same value as the other collections

  1. Your “made up” scores aren’t really helpful in explaining what kind of scoring logic you want to apply here in the first place

In the example table (made up scores). I just want to show that the different serial number scores. The higher serial number will have a lower score compare to the lower serial#. But the scores will differ from the other collections with the same serial number. Thus I categorized them by collections. Lower item count collections are considered rarer than those with higher item count.
4. But Serial #1 is supposed to have a higher score, than Serial #2, inside the collection? If so, then what would that be based on? Just the ascending order of those #1, #2, etc.?
Maybe it will be based on the ascending order.
5. All 10 items in collection 1 could get the same score?

I don’t have any preference on the score the serial number will get. I mean Serial #1 can get X score as long as it will be relative to the rarity of the collection. Serial #1 score will not be the same across collection unless they belong to the same collection rarity.

Get this bounty!!!

#StackBounty: #postgresql #statistics #database-size How do I measure the overhead of a Postgres extended statistics object?

Bounty: 50

Some solutions to poor query performance suggest creating extended statistics objects to help the query planner better understand the shape of the data. Because these statistics objects are not created by default, I assume that they have some non-neglible overhead.

What kinds of overhead are present, what statistics object parameters factor in to the overhead, and can I quantitatively measure what the overhead of a specific statistics object is?

Some possible kinds of overhead I’ve thought of:

  • Increased CPU usage
    • when inserting
    • when querying
    • when analyzing
  • Increased memory usage
  • Increased disk usage

Get this bounty!!!

#StackBounty: #google-apps #statistics #regression-analysis Perfrom a multiple linear regression in Google Sheets

Bounty: 100

I’m attempting to run a multiple linear regression in Google Sheets, but the only add-on I can find to do this (XLMiner) has been disabled by Google pending review. It also looks like the review process has taken over 8 months, so I don’t anticipate it will be available anytime soon.

Are there alternative add-ons, or a formula I could create in the sheet itself in order to perform a multiple linear regression?

Get this bounty!!!

#StackBounty: #machine-learning #statistics #random-forest #prediction Understand the equations of quantile regression forest (Meinshau…

Bounty: 50

I am trying to implement a quantile regression forest (https://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf).

But, I have some difficulties to understand how the quantiles are computed. I will try to summarize the part of interest in order to then explain exactly what I don’t understand.

Let be $n$ independent observations $(X_i, Y_i)$. A tree $T$ parametrized with a realization $theta$ of a random variable $Theta$ is denoted by $T(theta)$.

  • Grow $k$ trees $T(theta_t)$, $t = 1, . . . , k$, as in random forests. However, for every leaf of every tree, take note of all observations in this leaf, not just their average.
  • For a given $X = x$, drop $x$ down all trees. Compute the weight $omega_i(x, theta_t)$ of observation $i in {1, . . . , n}$ for every tree as in (4). Compute weight $omega_i(x)$ for every observation $i in {1, . . . , n}$ as an average over $omega_i(x, theta_t)$, $t = 1, . . . , k$, as in (5).
  • Compute the estimate of the distribution function as in (6) for all $y in mathbb{R}$.

Where the equations (4), (5), (6) are given below.

$$ omega_i(x, theta_t) = frac{ 1 { X_i in R(x, theta_t) } }{text{#} { j : X_j in R(x, theta_t) } } (4)$$

$$ omega_i(x) = k^{-1} sum_{t=1}^k omega_i(x, theta_t) (5)$$

$$ hat{F}(y|X=x) = sum_{i=1}^n omega_i (x) 1{Y_i leq y} (6) $$

Where $R(x, theta_t)$ denotes the rectangular area corresponding to the unique leaf of the tree $T(theta_t)$ that $x$ belongs to.

I can compute (4) and (5) but I don’t understand how to compute (6) and then estimate quantiles. I would also add that I don’t know where all observations in leaves (first step of the algorithm) are used.

Can someone give some elements to understand this algorithm ? Any help would be appreciated.

Get this bounty!!!

#StackBounty: #charts #libreoffice #libreoffice-calc #cells #statistics Libre Office – Single Column Pie Chart

Bounty: 50

I want to create a pie chart, for a single column that looks like this:

I tried to select the range, click on Insert -> Chart -> Pie-Chart, but regardless of what I try, I always get an empty chart?
Not quiet sure what I am doing wrong.

Can someone guide me the steps to create a pie chart, that shows how many % are Open-Source, how many % are proprietary and how many are ??? (Unknown). Empty cells should be ignored.

Also – is it possible to process the entry that has both Open-Source, Proprietary as one for each? Basically treat is as two entries, if that makes sense.
I know there are already plenty of posts for similar things, but most of them address either multi column data, and the other ones didn’t help me much.

Get this bounty!!!

#StackBounty: #intelligence #statistics #iq Why does IQ have a truncated normal distribution?

Bounty: 50

I know that the IQ statistic is designed to give it a mean of 100 and that you’ll certainly never find someone with an IQ below 1 or above 300, but that tells us very little about the variance or general shape of the distribution. So why is it that every graph of IQ scores that I’ve seen appears to be a truncated normal distribution? Is it some property of the test design, some property of the test subjects, or some deep theorem in statistics that I’ve overlooked?

Get this bounty!!!

#StackBounty: #python #pandas #statistics #scipy A/B testing using chi-square to calculate the significance in an elegant way

Bounty: 50

The definition of ABtest


For input data, each pair row is a group such as 0 and 1, 2 and 3. I want to test whether each group is significant using chi-square.

  • If result is significant and any value in group is more than or equals 5 , it returns p-value, else it returns ‘not significant’.
  • If any value in group is less than 5, it returns ‘sample size too small’.

My question is:

  • My code seems so long and not so elegant and want to be more simple.

Notice: please don’t use column grp as raw input to rewrite like groupby method because the raw data is not named a1 ca1, b1 cb1 . But I make sure each pair row (e.g. 0 and 1, 2 and 3) is a group.


import numpy as np
from scipy.stats import chi2_contingency
from itertools import chain, repeat

def abtest(df,tot,convert):
    df = df.copy()
    df['nonconvert'] = df[tot] - df[convert]
    grp = np.split(df.index.values,df.index.size//2)
    rst = []
    for g in grp:
        obs = df.loc[g][[convert,'nonconvert']].values
        if (obs>=5).all():
             _, p, _, _=chi2_contingency(obs)
             if p<0.05:
                rst.append('not significant')
            rst.append('sample size too small')
    rate = tot + '_' + convert + '_'+'test'
    df[rate] = list(chain.from_iterable(zip(*repeat(rst, 2))))
    del df['nonconvert']
    return df 

df = abtest(df=df,tot='tot',convert='read')
df = abtest(df=df,tot='tot',convert='lgn')
df = abtest(df=df,tot='lgn',convert='read')


   grp    tot  lgn  read
0   a1   6300  360    30
1  ca1   2300   60     7
2   b1  26300  148     6
3  cb1  10501   15     3
4   c1  74600   36     2
5  cc1  26000    6     1


   grp    tot  lgn  read          tot_read_test     tot_lgn_test  
0   a1   6300  360    30        not significant      4.68208e-09   
1  ca1   2300   60     7        not significant      4.68208e-09   
2   b1  26300  148     6  sample size too small      7.01275e-08   
3  cb1  10501   15     3  sample size too small      7.01275e-08   
4   c1  74600   36     2  sample size too small  not significant   
5  cc1  26000    6     1  sample size too small  not significant   

0        not significant  
1        not significant  
2  sample size too small  
3  sample size too small  
4  sample size too small  
5  sample size too small 

Get this bounty!!!

#StackBounty: #user-management #statistics Getting dates associated to acct sa output stats

Bounty: 150

I’m trying to get stats on global CPU usage of a server on a per-user basis so I installed GNU Accounting (acct) (on Ubuntu server 16.04).

When asking for stats with sa -m, it takes data from the current /var/log/account/pacct, which is rotated by the acct service. I understood that I can give the archived rotated files as input to it, if I want to get statistics on older entries.

The problem is that sa gives no dates on the statistics that are displayed, and I couldn’t find a way to have dates on it? This would allow me get statistics on a day or week manner.

Get this bounty!!!

#StackBounty: #python #statistics #visualization #simulation Find if the two datasets are close to each other

Bounty: 50

I have the following three datasets.


data_a is real data and the other two are the simulated ones. Here I am trying to check which one (data_b or data_c) is closest or closely resembles to data_a.
Currently I am doing it visually and with ks_2samp test (python).


I graphed the cdf of real data vs cdf of simulated data and try to see visually that which one is the closest.

cdf of data_a vs data_b

Above is the cdf of data_a vs cdf of data_b
enter image description here

Above is the cdf of data_a vs cdf of data_c

So by visually seeing it one might can say that data_c is more closer to data_a then data_b but it is still not accurate.

KS Test

Second method is the KS test where I tested data_a with data_b as well as data_a with data_c.

>>> stats.ks_2samp(data_a,data_b)
Ks_2sampResult(statistic=0.5923076923076923, pvalue=0.02134674813035231)
>>> stats.ks_2samp(data_a,data_c)
Ks_2sampResult(statistic=0.4692307692307692, pvalue=0.11575018162481227)

From above we can see that statistic is lower in when we tested data_a with data_c so data_c should be closer data_a than data_b. I didn’t consider the pvalue as it would not be appropriate to think of it as a hypotheses test and use the p-value obtained because the test is designed with the null hypothesis predetermined.

So my question here is that if I am doing this correctly and also is there any other better way to do it??? Thank You

Get this bounty!!!