Skip to content

Instantly share code, notes, and snippets.

@conormm
Last active December 3, 2024 17:11
Show Gist options
  • Save conormm/fd8b1980c28dd21cfaf6975c86c74d07 to your computer and use it in GitHub Desktop.
Save conormm/fd8b1980c28dd21cfaf6975c86c74d07 to your computer and use it in GitHub Desktop.
R to Python: Data wrangling with dplyr and pandas

R to python data wrangling snippets

The dplyr package in R makes data wrangling significantly easier. The beauty of dplyr is that, by design, the options available are limited. Specifically, a set of key verbs form the core of the package. Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe. Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R. The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).

dplyr is organised around six key verbs:


  • filter : subset a dataframe according to condition(s) in a variable(s)
  • select : choose a specific variable or set of variables
  • arrange : order dataframe by index or variable
  • group_by : create a grouped dataframe
  • summarise : reduce variable to summary variable (e.g. mean)
  • mutate : transform dataframe by adding new variables

The excellent pandas package in Python easily allows you to implement all of these actions (and much, much more!). Below are some snippets to highlight some of the more basic conversions.

@Conormacd

June 8th 2018 update: All of this code still works in pandas and should ease the transition from R, but for those interested in getting the most out of the package I strongly recommend this series on modern pandas https://tomaugspurger.github.io/modern-1-intro.html

Filter

R

filter(df, var > 20000 & var < 30000) 
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)

Python

df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)

Select

R

select(df, var1, var2)
select(df, -var3)

Python

df[['var1', 'var2']]
df.drop('var3', 1)

Arrange

R

arrange(df, var1)
arrange(df, desc(var1))

Python

df.sort_values('var1')
df.sort_values('var1', ascending=False)

Grouping

R

df %>% group_by(group) 
df %>% group_by(group1, group2)
df %>% ungroup()

Python

df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)

Summarise / Aggregate df by group

R

df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>% 
  group_by(group1, group2) %>% 
  summarise(mean_var1 = mean(var1), 
            sum_var1 = sum(var1), 
            count_var1 = n())
                                              
df %>% 
  group_by(group1, group2) %>% 
  summarise(mean_var1 = mean(var1),
            sum_2 = sum(var2),
            var3 = first(var3))

Python

df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specifiy column name

df.groupby(['group1', 'group2'])['var1].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({
  "var1" : ["mean"], 
  "var2" : ["sum"], 
  "var3" : ["first"]
  })
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

Mutate / transform df by group

R

df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))

Python

df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)

Distinct

R

df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1

Python

df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1

Sample

R

sample_n(df, 100)
sample_frac(df, 0.5)

Python

df.sample(100)
df.sample(frac=0.5)       
@vincentei
Copy link

vincentei commented Jun 11, 2018

Nice work!

line 77:

df.groupby(['group1', 'group2'])['var1]'].agg(['mean', 'sum', 'count'])

should it be?

df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count'])

@adrienpacifico
Copy link

The spirit of filter(df, var > 20000 & var < 30000) in R would be closer to df.query("(var>2000) & (var < 3000)") in pandas than to
df[(df['var'] > 20000) & (df['var'] < 30000)] (see pandas comparison with R doc ).

@conormm big thank you for that gist !

@conormm
Copy link
Author

conormm commented Jul 13, 2018

hey @adrienpacifico Yes, agree on .query() - it wasn't a function when I put this together. I'll update this soon.

@adrienpacifico
Copy link

Great, btw it would be nice to have performance comparison between the two solutions (I'm a pandas user, I don't know a lot about R an dplyer ;-) ).

@rfche704
Copy link

rfche704 commented Aug 11, 2018

Thanks so much - really helpful post!!!
With this and GGplot for python - a huge amount of the R functionality can be used

@astrowonk
Copy link

Super useful. Thank you!

@plnnr
Copy link

plnnr commented Sep 3, 2018

Thank you so much! I'm learning Python after having just learned R, and this is really helpful to quickly get me off the ground and keep me feeling motivated! Also wow @rfche704, I had no idea ggplot was available in Python :)

@pybokeh
Copy link

pybokeh commented Sep 25, 2018

Could always use plotnine https://github.com/has2k1/plotnine Also by same repo, there is dplyr clone: https://github.com/has2k1/plydata

@Dhirendramohanjha
Copy link

Thank You so much! This is very useful us

@sanjaynagi
Copy link

As an R user that is Learning Python, this is so useful. Thanks.

@JerryWho
Copy link

JerryWho commented Jun 1, 2019

Thank you so much!
How do I summarize several values, i.e.

df %>% group_by(group1, group2) %>% summarise(mean_var1 = mean(var1),
sum_2 = sum(var2),
var3 = first(var3))

@conormm
Copy link
Author

conormm commented Jun 5, 2019

In pandas:

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean"], "var2" : ["sum"], "var3" : ["first"]})
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

@epetrovski
Copy link

df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)

Unfortunately, I don't think this will work since grouped data frames do not have an .assign() method. The same thing could be done with .apply() however. And if you want to reassign the resulting column to the original data frame - like dplyr does - you wold also need to do .reset_index(drop = True).

@ffgama
Copy link

ffgama commented Dec 19, 2019

Congrats

@manuel-zambelli-sdg
Copy link

I love you

@sandropenha
Copy link

Thank you!
I'm a R user and this is helping me a lot in python.

@paladinic
Copy link

thumbs up!!

@stephzhang3
Copy link

stephzhang3 commented Jul 8, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:

df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)

But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

@epetrovski
Copy link

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:

df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)

But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.

A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

@samukweku
Copy link

In pandas:

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean"], "var2" : ["sum"], "var3" : ["first"]})
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

In Pandas 1.0, you can use named aggregation, reducing the number of steps for new column names:

df.groupby(["group1", "group2"]).agg(var1_mean=("var1","mean"), var2_sum=("var2","sum),var3_first=("var3","first"))

@samukweku
Copy link

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.

A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')

@epetrovski
Copy link

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.
A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')

Fair point. But using .pipe() will allow you to transform several columns at once making it analogous to mutate() which the post is about.

@samukweku
Copy link

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.
A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')

Fair point. But using .pipe() will allow you to transform several columns at once making it analogous to mutate() which the post is about.

I respectfully disagree. I'll show some examples below on what I believe should be the way in Pandas to do it; please correct me where the code is wrong or suboptimal :

print(df )
      tag     val1    val2
0	B	0	0
1	A	1	2
2	A	2	4
3	B	3	6
4	B	4	8

grp = df.groupby("tag")

Let's create new columns and assign to the dataframe :

df.assign(val1_mean=grp["val1"].transform("mean"), 
          val2_sum=grp["val2"].transform("sum")
          )
	tag	val1	val2	val1_mean	val2_sum
0	B	0	0	2.333333	14
1	A	1	2	1.500000	6
2	A	2	4	1.500000	6
3	B	3	6	2.333333	14
4	B	4	8	2.333333	14

If you are applying the same function to the columns, you can use the unpacking method :
df[["val1_mean", "val2_mean"]] = grp[["val1", "val2"]].transform("mean")

print(df)

tag	val1	val2	val1_mean	val2_mean
0	B	0	0	2.333333	4.666667
1	A	1	2	1.500000	3.000000
2	A	2	4	1.500000	3.000000
3	B	3	6	2.333333	4.666667
4	B	4	8	2.333333	4.666667

I do not see the need for pipe for these scenarios. Pipe, in my opinion, comes into play when you need to reuse the groupby object. like
grp.pipe(lambda x: x.sum() - x.mean()).

The groupby object is reused in the sum and mean before the difference is applied. I dont see the effectiveness of calling transform inside pipe when transform can do the job easily. Of course, you may have a use case where pipe does the job better than transform - I would love to learn about that. Cheers.

@epetrovski
Copy link

@samukweku, you say that "Pipe, in my opinion, comes into play when you need to reuse the groupby object."

So in my use case I need to do something like this, where I take the sum of one variable and add to it the minimum of another but using the same grouping:

df['res'] = df.groupby('id').pipe(lambda x: x.var1.transform('sum') + x.var2.transform('min'))

Anyways, this is going into too much depth here. I think it's fair to say that there are several ways of accomplishing a group_by() %>% mutate() in pandas but df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1) which the post suggests isn't one of them and that's the most important point.

@samukweku
Copy link

Ahh! @epetrovski, totally correct there. Your example makes sense. Cheers.

@minhsphuc12
Copy link

Hi
df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)
would result in
AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
Can you update your code? @conormm

@pwwang
Copy link

pwwang commented Apr 30, 2021

datar ports dplyr and other packages to python for you.

It follows the original R packages' API design.

@minhsphuc12
With datar, you can do:

from datar import f
from datar.dplyr import group_by, mutate
from datar.base import mean

df >> group_by(f.group) >> mutate(mean_var1 = mean(f.var1))
# compared to the R syntax:
# df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))

@roboes
Copy link

roboes commented Jun 27, 2021

For "Mutate / transform df by group" I do the following:

# df example
df = pd.DataFrame(data = {'group': ['A', 'A', 'B', 'B'], 'var1': [1, 4, 2, 2]})

# Mutate / transform df by group
df.assign(mean_var1 = lambda x: x.groupby('group')['var1'].transform('mean'))

@samukweku
Copy link

@RobertoG09, i think selecting var1 before transform might be faster, since you are transforming one column, as against all columns before selecting your traget column : df.assign(mean_var1 = lambda x: x.groupby('group')['var1'].transform('mean'))

@roboes
Copy link

roboes commented Aug 3, 2021

@samukweku this makes sense. I have updated the original comment to reflect your feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment