pandas - 在json字段中,python 如何排序 Pandas dataframe

  显示原文与译文双语对照的内容

我在 Pandas dataframe中有这样的数据


 id import_id investor_id loan_id meta


 35736 unremit_loss_100312 Q05 0051765139 {u'total_paid': u'75', u'total_expense': u'75'}


 35737 unremit_loss_100313 Q06 0051765140 {u'total_paid': u'77', u'total_expense': u'78'}


 35739 unremit_loss_100314 Q06 0051765141 {u'total_paid': u'80', u'total_expense': u'65'}



基于total_expense值的排序方法
例如:元字段上的total_expense

输出应该是


id import_id investor_id loan_id meta


35739 unremit_loss_100314 Q06 0051765141 {u'total_paid': u'80', u'total_expense': u'65'}


35736 unremit_loss_100312 Q05 0051765139 {u'total_paid': u'75', u'total_expense': u'75'}


35737 unremit_loss_100313 Q06 0051765140 {u'total_paid': u'77', u'total_expense': u'78'}



时间:

设置和预处理


import ast


import numpy as np



if isinstance(x.at[0, 'meta'], str):


 df['meta'] = df['meta'].map(ast.literal_eval)



Series.argsortstr.get


df.iloc[df['meta'].str.get('total_expense').astype(int).argsort()]



 id import_id investor_id loan_id meta


2 35739 unremit_loss_100314 Q06 51765141 {'total_paid': '80', 'total_expense': '65'}


0 35736 unremit_loss_100312 Q05 51765139 {'total_paid': '75', 'total_expense': '75'}


1 35737 unremit_loss_100313 Q06 51765140 {'total_paid': '77', 'total_expense': '78'}



清单理解


df.iloc[np.argsort([int(x.get('total_expense', '-1')) for x in df['meta']])]



 id import_id investor_id loan_id meta


2 35739 unremit_loss_100314 Q06 51765141 {'total_paid': '80', 'total_expense': '65'}


0 35736 unremit_loss_100312 Q05 51765139 {'total_paid': '75', 'total_expense': '75'}


1 35737 unremit_loss_100313 Q06 51765140 {'total_paid': '77', 'total_expense': '78'}



如果需要处理 nan/missing数据,请使用


u = [ 


 int(x.get('total_expense', '-1')) if isinstance(x, dict) else -1 


 for x in df['meta']


]


df.iloc[np.argsort(u)]



 id import_id investor_id loan_id meta


2 35739 unremit_loss_100314 Q06 51765141 {'total_paid': '80', 'total_expense': '65'}


0 35736 unremit_loss_100312 Q05 51765139 {'total_paid': '75', 'total_expense': '75'}


1 35737 unremit_loss_100313 Q06 51765140 {'total_paid': '77', 'total_expense': '78'}



使用:


print (df)


 id import_id investor_id loan_id 


0 35736 unremit_loss_100312 Q05 51765139 


1 35736 unremit_loss_100312 Q05 51765139 


2 35736 unremit_loss_100312 Q05 51765139 



 meta 


0 {u'total_paid': u'75', u'total_expense': u'75'} 


1 {u'total_paid': u'75', u'total_expense': u'20'} 


2 {u'total_paid': u'75', u'total_expense': u'100'} 



import ast



df['meta'] = df['meta'].apply(ast.literal_eval)



df = df.iloc[df['meta'].str['total_expense'].astype(int).argsort()]



print (df)


 id import_id investor_id loan_id 


1 35736 unremit_loss_100312 Q05 51765139 


0 35736 unremit_loss_100312 Q05 51765139 


2 35736 unremit_loss_100312 Q05 51765139 



 meta 


1 {'total_paid': '75', 'total_expense': '20'} 


0 {'total_paid': '75', 'total_expense': '75'} 


2 {'total_paid': '75', 'total_expense': '100'} 



如果某些行缺少 total_expense 键,则将丢失的值转换为某些整数,如 -1的第一个位置的第一个位置:


print (df)


 id import_id investor_id loan_id 


0 35736 unremit_loss_100312 Q05 51765139 


1 35736 unremit_loss_100312 Q05 51765139 


2 35736 unremit_loss_100312 Q05 51765139 



 meta 


0 {u'total_paid': u'75', u'total_expense': u'75'} 


1 {u'total_paid': u'75', u'total_expense': u'20'} 


2 {u'total_paid': u'75'} 



df['meta'] = df['meta'].apply(ast.literal_eval)



df = df.iloc[df['meta'].str.get('total_expense').fillna(-1).astype(int).argsort()]


print (df)


 id import_id investor_id loan_id 


2 35736 unremit_loss_100312 Q05 51765139 


1 35736 unremit_loss_100312 Q05 51765139 


0 35736 unremit_loss_100312 Q05 51765139 



 meta 


2 {'total_paid': '75'} 


1 {'total_paid': '75', 'total_expense': '20'} 


0 {'total_paid': '75', 'total_expense': '75'} 



另一个解决方案:


df['new'] = df['meta'].str.get('total_expense').astype(int)


df = df.sort_values('new').drop('new', axis=1)




df = pd.concat([df, df['meta'].apply(pd.Series)], axis = 1).drop(columns ='meta').sort_values(by = 'total_expense')



df['meta'].apply(pd.Series) 使元列中的dicts成为它自己的df 。 我们可以将它与它的原始值连接起来,删除元列( 因为它的冗余),然后按"总费用"对值进行排序。'

编辑:


df = pd.concat([df, df['meta'].apply(pd.Series)], axis = 1).sort_values(by = 'total_expense').drop(columns = ['total_paid', 'total_expense'])



如果你希望它看起来像原始的,只需在排序后删除所连接的列。

编辑2:

找到了一种在不使用应用的情况下执行这里操作的更好方法:


from pandas.io.json import json_normalize



df = pd.concat([df, json_normalize(df['meta'])], axis = 1)


. sort_values(by = 'total_expense')


. drop(columns = ['total_paid', 'total_expense'])



使用 正规表达式:


df = pd.read_clipboard(r'ss+')


pattern = r"""u'total_expense': u'([0-9.]+)'"""


df['total_expense'] = df.meta.str.extract(pattern)


df.sort_values('total_expense')



使用应用:


df['total_expense'] = df.meta.apply(eval).apply(


 lambda x: x.get('total_expense', -1))


df.sort_values('total_expense')



输出:


 id import_id investor_id loan_id 


2 35739 unremit_loss_100314 Q06 51765141 


0 35736 unremit_loss_100312 Q05 51765139 


1 35737 unremit_loss_100313 Q06 51765140 



 meta total_expense 


2 {u'total_paid': u'80', u'total_expense': u'65'} 65 


0 {u'total_paid': u'75', u'total_expense': u'75'} 75 


1 {u'total_paid': u'77', u'total_expense': u'78'} 78 



...