top of page
Search

CDC in Pandas - Or how do I get the last relevant row?

Updated: Nov 23, 2022

If you have a case of CDC - change data capture, in pandas and you want to have the last and most updated row per entity,

CDC means that you have initial data and on top of it delta of changes, this terminology comes from database management (Change_data_capture)


you can use the following example to do so:


import pandas as pd 
df = pd.read_csv('/Users/geva/Documents/data/business-price-indexes-september-2022-quarter-csv.csv',low_memory=False)
idx = df.groupby(['Series_reference'])['Period'].transform(max) == df['Period']
df[idx]

pls. note sometimes the date of the row is not in the data, yet you can take it from the date of the file or the S3 bucket and add it to the data (thanks Gidi K. for supplying the use-case)...


the full example can be found here:




ree


 
 
 

Comments


Subscribe Form

©2019 by Big Data. Proudly created with Wix.com

bottom of page