#StackBounty: #classification #dataset #active-learning How to do Data acquistion focused on improving accuracy on hold-out test set?

Bounty: 50

I have the task of coming up with a model of 95% accuracy for a classification problem. I have training data and a hold-out data set. I have the opportunity to request data of a particular class with desired characteristics to achieve this objective.

What method shall I use to plan the data acquisition through another team? I am currently at 86% accuracy. I use LightGBM for the model development. Would consider parameter tuning and ensemble with XGBoost and TabNet. But I think I need better data to achieve higher accuracy. Feature engineering is also in play.

Also note that it is a multi-class classification problem.

Get this bounty!!!

#StackBounty: #dataset #data-cleaning #convolutional-neural-network Do I need to manually trim 300 videos?

Bounty: 50

I wish to train a model that detects the breed of a dog based on video input. I have a dataset containing 10 classes with 30 videos in each class. The problem is that for each of these videos, the dog is not present throughout the course of the video. The following are examples of 2 videos from the dataset:

Video 1: Video of backyard (first 5 seconds) –> Dog appears (15 seconds) –> Video of surrounding buildings (3 seconds)

Video 2: Video of grass (first 8 seconds) –> Dog appears (3 seconds) –> Video of nearby people (4 seconds)

I presume that my CNN would detect redundant features and hence give incorrect outputs if I trained my model on the videos as is. Hence, do I need to manually trim each of the 300 videos to show only the part where the dog appears or is there an easier way to approach this problem?

Get this bounty!!!

#StackBounty: #neural-networks #dataset #reinforcement-learning Feeding "parallel" dataset during the training phase

Bounty: 50

I have some plans in working with Reinforcement Learning in order to predict the stock price movement. For a stock like TSLA, some training features might be the pivot price values and the set of the difference between two consecutive pivot points.

I would be interested that my model captures the general essence of the stock market. In other words, if I want my model to predict the stock price movement for TSLA, then my dataset will be built only on TSLA stock. If I try to predict the price movement on FB stock using that model, then it won’t work for many reasons. So if I want my model to predict the price movement of any stock, then I have to build a dataset using all type of stock prices. For the purpose of this quesiton, instead of taking an example of dataset using all the stocks, I will use only three stocks, i.e. TSLA, FB and AMZN. So I will generate the dataset for two years for TSLA, two years of FB and two years of AMZN and passing it back to back to my model. So in this example, I pass 6 years of data to my model for training purpose. If start with FB, then the model will learn and memorize some patterns from the FB features. The problem is when the model is made to train on the AMZN features, it already starts to forget the information of the training on the FB dataset.

Is there a way to parallelised the training on several stock in the same time to avoid the memory issue? Instead that my action being a real value, it will be a action vector where the size is depending on the number of parallel stocks.

enter image description here

Get this bounty!!!

#StackBounty: #dataset #object-detection #multi-instance-learning Where can I find free multi-instance single-label datasets for object…

Bounty: 50

I’m trying to find free multi-instance single-label datasets for object detection online.

By "multi-instance and single-label" I mean that each image contains only object belonging to one class, but can contain more than one object of a certain class.

I found a lot of datasets for multi-label, but none for single-label

Any ideas are highly appreciated,

thanks in advance

Get this bounty!!!

#StackBounty: #panel-data #dataset #matching How to get the 3-digit SIC code from Datastream and merge ISIC to SIC?

Bounty: 50

When trying to replicate a paper, I faced the problem in merging two datasets.

When merging an ISIC-related dataset (dataset A) to a SIC-related Datastream dataset (Dataset B), I faced a problem with the three-digit SIC code.

A paper shows how to match these two datasets:

Export data used to construct Export market leniency laws measure
comes from CEPII TradeProd Database that has bilateral trade flows for
more than 200 countries at ISIC industry level over 1980-2006. We
match them to the three-digit SIC and average over the respective
values within the three-digit SIC in case multiple three-digit ISIC
codes match to three-digit SIC codes.

I have two questions as below:

1> How to get the 3-digit SIC from Datastream. Thompson Reuter described that item WC07023 is SIC Code 3, but when looking at this, I saw mainly 4-digit code. I show a part of my dataset B as below

enter image description here

As can be seen, in column WC07023, even column WC07022 (SIC Code 2), we have 4-digit codes rather than three.
And this is the ISIC 3-digit code from the dataset A

enter image description here

2> And could you explain to me the author’s approach based on this sentence, it is quite ambiguous to me:

We match them to the three-digit SIC and average over the respective
values within the three-digit SIC in case multiple three-digit ISIC
codes match to three-digit SIC codes.

Get this bounty!!!

#StackBounty: #dataset #data Public dataset for account to account payments

Bounty: 50

I’m looking for a dataset which contains account to account payments (bank transfers). Ideally, this dataset would contain labeled data for transactions or accounts known to be victims of phishing attacks. In this scenario, the account holder enters and authenticates the transfer but has been tricked into making an undesired purchase or into sending the funds to an undesired recipient.

This could be a public dataset, or optionally I could collaborate confidentially on a private dataset and would sign the necessary confidentiality agreements.

I’ve looked at the repositories listed here already:
Publicly Available Datasets

I do know this credit card fraud dataset well, and it’s the closest to what I’m searching for, but does not fill the requirements above: https://www.kaggle.com/mlg-ulb/creditcardfraud

For experts in this area, a more technical way to describe this fraud scenario is "authorized push payments".

Get this bounty!!!

#StackBounty: #probability #distributions #sampling #dataset How to sample from different datasets such that they have similar distribu…

Bounty: 50

I have data from multiple datasets with the boxplot given below

enter image description here

In the above figure, I have data from 7 different datasets. I am looking for a sampling strategy without replacement such that samples from each dataset have similar distribution (probably they have the same mean and standard deviation or some other sort of similarity). It should be fine if the strategy loses a number of data from multiple sites.

Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #dataset (Crowdsourced) Dataset with label/annotation metadata like duration/quality

Bounty: 50

I’m looking for a research project for datasets (the more the better), potentially crowdsourced, which go beyond basic feature vectors + labels, and include additionally some metadata about the labels. Specifically I’d like the annotation time, or some other cost metric, and if the dataset was crowdsourced, the individual labels per annotator. On top, some measure of the quality of the labels by different annotators would be helpful as well. The Dataset domain is not relevant.

Also I’m not sure if https://datascience.stackexchange.com is the right place to ask for that, but maybe you can help me.

So far I could only find this dataset.

Get this bounty!!!

#StackBounty: #normal-distribution #dataset #stratification Stratified shuffle for normally distributed target variable

Bounty: 50

When splitting data for a classification problem one is advised to use stratified shuffling in case the target variable is skewed toward a certain class. Indeed, Sklearn has a function for that.

Suppose now that we are splitting the data w.r.t to a target variable T that is normally distributed. Is there any similar tool or technique that could split a the data set into train/val sets so that the mean/variance of T is preserved as much as possible?

I understand that for large enough sets that’s already the case but I am interested in practical applications where splitting the data into train/tes/val sets skews the mean and variance by a lot.

Get this bounty!!!

#StackBounty: #panel-data #dataset #multilevel-analysis Changing the time metric for longitudinal data

Bounty: 50

I have some longitudinal data. I’ve done longitudinal analysis before but I have never changed the time metric so I wanted to run the process of that by you.

Edits for clarity:
I have repeated measures data collected over about 2 months but the study has to do with COVID – thus, time (and time passing) is an important component. People beginning the study, for example, on May 14th may be quite different than people coming in on June 1st in terms of our variables. I want to restructure the analysis to examine the effects of time. So I want to go from a scenario where I have a relatively time balanced (time 1, time 2, time 3) agnostic to the actual intake time, and restructure the analysis to take into account the specific dates on which each of the individuals 5 time points were collected – an individually varying times of observation scenario. I propose restructuring the data by indexing the analysis by recoding for each participant their 5 timepoints into ‘days since the beginning of the study’ and to include that as my time metric. I plan on using a linear mixed-effects model and using this new time metric as my ‘time’ covariate in the model.

I go into a few more details of the specific way I want to go about restructuring this below. But TLDR: I want to know a) whether this is defensible and b) whether my method of doing so makes sense below.


5 data collections, spaced equally every 7 days. So t1= intake, t2= day 7, t3 = day 14, t4 = day 21, t5 = day 28.
Sample size ~1500, of course some missing data due to attrition as time goes on.
Participants were allowed to begin the study over the course of approximately a month – and there is a fairly good distribution of intakes across that month where the survey was open.

Instead of analyzing change just across measurement occasion, where the X-axis is t1, t2, t3, t4, t5, I would like to rescale the time-metric to capture actual day within this whole time period that data was collected and to analyze change across time that way as opposed to just being agnostic to the actual date. Turning the X-axis into Day 1, Day 2…, Day 60". This is because I have reason to believe that change on my outcome variable will be a function of time passing.

But as you might imagine, when conceptualized this way (as days) not every day will be common to all participants (i.e., some started on day 3, and some on day 30, and everything in between). So more like a time-unstructured data set – thus I will examine change over time using growth curve using a mixed- effects model.

Here is how I intend to go about doing this time metric change:
Step 1: create variables that show y scores across all ~60 possible days.
Step 2: recode existing 5 measurement occasion data from each participant into data organized by ‘day’ rather than (t1, t2, t3 ,t4, t5) based on date of intake. E.g., someone who began the study on day 1 has their first timepoint now labelled as ‘day 1 Y’, whereas someone who began the study on day 15 has their first timepoint labelled as ‘day 15 y’ in the data set (and their subsequent timepoints 7 days later i.e., ‘day 21’).
Step 3: restructure data to person period format (using participant IDs).
Step 4: run growth curve (with time now representing day and ranges from 1-60), with intercept and time as random effects using mixed effects model.

TLDR: I want to switch to an ‘individually varying time metric’ (Grim et al., 2017). I’ve recoded my data to change the time-metric from measurement occasion to ‘day’ to capture change over time. Is what I have done appropriate/correct?

OR would it just make more sense to include date (operationalized as day1, day2…etc.) as a covariate using the original metric?

Any help would be very much appreciated!

Below is a visual example of what I did using made up some random numbers:

enter image description here

Then pairwise restructure.

Get this bounty!!!