We are building a tool to assess the risk of language disorders in children. We have 600 participants (100 in each of the 6 age categories of 4 to 4.5, 4.5 to 5, and so on). This is the training data set. We have built a report engine that generates over 100+ feature results; an ex of features are – total words spoken, sentences used, nouns, adverbs, fundamental frequency, decibels, and so on.
We have now generated 1. an average for each feature 2. st dev for each feature. As you can imagine some features are more reliable than others.
Here is what we want to so;
A.1. Figure out a mechanism to identify which features are more reliable than others. We are thinking smaller the first st dev, more clustered the signal for that measure, hence more reliable. We are open to any other methods of identifying reliability for each feature.
A.2. Figure out a mechanism to attribute weighting to features found to be more reliable based on step 1.
A.3. Apply a method to extract an aggregate score for all 100+ features for that age bracket (4 to 4.5 years old separately to 4.5 to 5 years olds) that could be represented on a scale of 1 to 10. Ex – for children in the age bracket of 5 to 5.5 years olds for all 100+ features the average score is 5.25, while the score for kids of 6 to 6.5 years old is 6.15 when put on a scale of 1 to 10.
B. Scoring new test takers on a scale of 1 to 10;
When a new test-taker undertakes the same stimulus test, we can now generate the 100+ features. We can also generate for each feature individually, the difference between test taker’s score for that feature, and the average for that feature from 100 samples for that age bracket. We would like to
B.1 We would like to find a way to aggregate that test-takers score across 100+ features on the same scale of 1 to 10 as we have done for the 100 training samples for that age bracket.
B.2 We want to display the deviation of the test-taker’s performance to the standard performance of training samples.