python-3.x - python如何检查列表中的所有元素是否都存在于Pandas列中

我有一个dataframe和一个list:


df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 


 'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})



names = ['a','c']



只有在 ac 都存在于 char 列中时,才要获取行。( 顺序不重要)

预期的输出为


 char id 


1 [a, b, c] 2 


2 [a, c] 3 


5 [c, a, d] 6 



我努力


true_indices = []


for idx, row in df.iterrows():


 if all(name in row['char'] for name in names):


 true_indices.append(idx)



ids = df[df.index.isin(true_indices)]



它给我正确的输出,但是对于大型数据集来说太慢了,所以我在寻找更有效的解决方案。

时间:

可以在 df.char 中迭代这些行,并保留 namesubset的那些行:


names = set(['a','c'])


m = [name.issubset(i) for i in df.char.values.tolist()]



print(df[m])



id char


1 2 [a, b, c]


2 3 [a, c]


5 6 [c, a, d]



使用 pd.DataFrame.apply:


df[df['char'].apply(lambda x: set(names).issubset(x))]



输出:


 id char


1 2 [a, b, c]


2 3 [a, c]


5 6 [c, a, d]



试试这个


df['char']=df['char'].apply(lambda x: x if ("a"in x and"c" in x) else np.nan)


print(df.dropna())



输出:


 id char


1 2 [a, b, c]


2 3 [a, c]


5 6 [c, a, d]



使用 issubset的列表理解:


mask = [set(names).issubset(x) for x in df['char']]


df = df[mask]


print (df)


 id char


1 2 [a, b, c]


2 3 [a, c]


5 6 [c, a, d]



另一个带有 Series.map 插件的解决方案:


df = df[df['char'].map(set(names).issubset)]


print (df)


 id char


1 2 [a, b, c]


2 3 [a, c]


5 6 [c, a, d]



性能取决于行数和匹配值的数目:


df = pd.concat([df] * 10000, ignore_index=True)



In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]


45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [271]: %%timeit


. . .: names = set(['a','c'])


. . .: [names.issubset(set(row)) for _,row in df.char.iteritems()]


. . .: 


46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [272]: %%timeit


. . .: df[[set(names).issubset(x) for x in df['char']]]


. . .: 


45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [273]: %%timeit


. . .: df[df['char'].map(set(names).issubset)]


. . .: 


18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [274]: %%timeit


. . .: n = set(names)


. . .: df[df['char'].map(n.issubset)]


. . .: 


16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [279]: %%timeit


. . .: names = set(['a','c'])


. . .: m = [name.issubset(i) for i in df.char.values.tolist()]


. . .: 


19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



...