others - Numpy: 如何将数据集(数组) 分割为训练和测试数据集,例如交叉验证?

如何将numpy数组随机分割为训练和测试/验证数据集?

时间:

如果要将数据集分成两个部分,则可以使用numpy.random.shuffle,如果需要跟踪索引,则可以使用numpy.random.permutation


import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

或者


import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有很多方法可以重复分割相同数据集进行交叉验证 ,一种策略是重采样数据集:


import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后,scikits.learn包含几个交叉验证方法(k-fold,leave-n-out,stratified-k-fold ... )。

还有另一个选择就是使用scikit-learn ,scikitWiki描述,你可以只使用以下指令:


from sklearn.cross_validation import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

我为自己的项目编写了一个函数(尽管它不使用numpy):


def partition(seq, chunks):
"""Splits the sequence into equal sized chunks and them as a list"""
 result = []
 for i in range(chunks):
 chunk = []
 for element in seq[i:len(seq):chunks]:
 chunk.append(element)
 result.append(chunk)
 return result

你也可以把分层划分为训练和测试集,这使得训练和测试集更好地反映了原始数据集的属性。


import numpy as np 

def get_train_test_inds(y,train_proportion=0.7):
 '''Generates indices, making random stratified split into training set and testing sets
 with proportions train_proportion and (1-train_proportion) of initial sample.
 y is any iterable indicating classes of each observation in the sample.
 Initial proportions of classes inside training and 
 testing sets are preserved (stratified sampling).
 '''

 y=np.array(y)
 train_inds = np.zeros(len(y),dtype=bool)
 test_inds = np.zeros(len(y),dtype=bool)
 values = np.unique(y)
 for value in values:
 value_inds = np.nonzero(y==value)[0]
 np.random.shuffle(value_inds)
 n = int(train_proportion*len(value_inds))

 train_inds[value_inds[:n]]=True
 test_inds[value_inds[n:]]=True

 return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出:


[1 2 3]
[1 2 3]

...