本文共 4304 字,大约阅读时间需要 14 分钟。
Is there a common method used to determine how many training samples are required to train a classifier (an LDA in this case) to obtain a minimum threshold generalization accuracy?
I am asking because I would like to minimize the calibration time usually required in a brain-computer interface.
The search term you are looking for is "learning curve", which gives the (average) model performance as function of the training sample size.
Learning curves depend on a lot of things, e.g.
(I think for two-class LDA you may be able to derive some theoretical power calculations, but the crucial fact is always whether your data actually meets the "equal COV multivariate normal" assumption. I'd go for some simulation on for both LDA assumptions and resampling of your already existing data).
There are two aspects of the performance of a classifier trained on a finite sample size nn (as usual),
Another aspect that you may need to take into account is that it is usually not enough to train a good classifier, but you also need to prove that the classifier is good (or good enough). So you need to plan also the sample size needed for validation with a given precision. If you need to give these results as fraction of successes among so many test cases (e.g. producer's or consumer's accuracy / precision / sensitivity / positive predictive value), and the underlying classification task is rather easy, this can need more independent cases than training of a good model.
As a rule of thumb, for training, the sample size is usually discussed in relation to model complexity (number of cases : number of variates), whereas absolute bounds on the test sample size can be given for a required precision of the performance measurement.
Here's a paper, where we explained these things in more detail, and also discuss how to constuct learning curves:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.This is the "teaser", showing an easy classification problem (we actually have one easy distinction like this in our classification problem, but other classes are far more difficult to distinguish):
We did not try to extrapolate to larger training sample sizes to determine how much more training cases are needed, because the test sample sizes are our bottleneck, and larger training sample sizes would let us construct more complex models, so extrapolation is questionable. For the kind of data sets I have, I'd approach this iteratively, measuring a bunch of new cases, showing how much things improved, measure more cases, and so on.
This may be different for you, but the paper contains literature references to papers using extrapolation to higher sample sizes in order to estimate the required number of samples.
Asking about training sample size implies you are going to hold back data for model validation. This is an unstable process requiring a huge sample size. Strong internal validation with the bootstrap is often preferred. If you choose that path you need to only compute the one sample size. As @cbeleites so nicely stated this is often an "events per candidate variable" assessment, but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].
It is important to consider proper scoring rules for accuracy assessment (e.g., Brier score and log likelihood/deviance). Also make sure you really want to classify observations as opposed to estimating membership probability. The latter is almost always more useful as it allows a gray zone.
转载地址:http://pzuii.baihongyu.com/