Dataframe manipulation sequence - GroupBy Agg, Melt, Unstack

Problem

From a Pandas DataFrame, massage the DataFrame into a format where order Count and Total Amount could be determined for each Vendor and each Vendor-Buyer combination.

:::python

>> df = pd.DataFrame(data=
    {'Vendor': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C',
            'D', 'D', 'E', 'E', 'E', 'E', 'E'],      
    'Buyer':['BU1', 'BU2', 'BU2', 'BU2', 'BU2', 'BU1', 'BU2', 'BU2', 'BU2',
            'BU1', 'BU2', 'BU1', 'BU2', 'BU2', 'BU1', 'BU2'],
    'Amount':[1, 5, 10, 20, 40, 103, 234, 13, 15, 30,
        23, 156, 60, 97, 23, 14]})

>> df

   Vendor Buyer  Amount
0       A   BU1       1
1       A   BU2       5
2       A   BU2      10
3       B   BU2      20
4       B   BU2      40
5       C   BU1     103
6       C   BU2     234
7       C   BU2      13
8       C   BU2      15
9       D   BU1      30
10      D   BU2      23
11      E   BU1     156
12      E   BU2      60
13      E   BU2      97
14      E   BU1      23
15      E   BU2      14

What I did

First, compute the aggregate values (Count, Total Amount ($)) for Vendor-Buyer combinations by using GroupBy aggregation (groupby().agg()):

:::python

>> agg_df = \
    df.groupby(
        ['Vendor', 'Buyer']).agg(**{
            'Count': pd.NamedAgg(column='Amount', aggfunc='count'),
            'Total Amount ($)': pd.NamedAgg(column='Amount', aggfunc=np.sum)
        })

>> agg_df

              Count  Total Amount ($)
Vendor Buyer
A      BU1        1                 1
       BU2        2                15
B      BU2        2                60
C      BU1        1               103
       BU2        3               262
D      BU1        1                30
       BU2        1                23
E      BU1        2               179
       BU2        3               171

Reset index of DataFrame in order to sort the data by Buyer and Vendor using sort_values.

>> agg_df_2 = agg_df.reset_index()

>> agg_df_2

    Vendor Buyer  Count  Total Amount ($)
0      A   BU1      1                 1
1      A   BU2      2                15
2      B   BU2      2                60
3      C   BU1      1               103
4      C   BU2      3               262
5      D   BU1      1                30
6      D   BU2      1                23
7      E   BU1      2               179
8      E   BU2      3               171

>> agg_df_3 = agg_df_2.sort_values(['Buyer', 'Vendor'])

>> agg_df_3

    Vendor Buyer  Count  Total Amount ($)
0      A   BU1      1                 1
3      C   BU1      1               103
5      D   BU1      1                30
7      E   BU1      2               179
1      A   BU2      2                15
2      B   BU2      2                60
4      C   BU2      3               262
6      D   BU2      1                23
8      E   BU2      3               171

DataFrame.melt "unpivots" a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis. This leaves two non-identifier columns, 'variable' and 'value', where 'variable' contains the measured variables and 'value' contains the values corresponding to the measured variables.

>> df_melt = agg_df_3.melt(
        id_vars=['Vendor', 'Buyer'],
        value_vars=['Count', 'Total Amount ($)']
        )

>> df_melt

    Vendor Buyer          variable  value
0       A   BU1             Count      1
1       C   BU1             Count      1
2       D   BU1             Count      1
3       E   BU1             Count      2
4       A   BU2             Count      2
5       B   BU2             Count      2
6       C   BU2             Count      3
7       D   BU2             Count      1
8       E   BU2             Count      3
9       A   BU1  Total Amount ($)      1
10      C   BU1  Total Amount ($)    103
11      D   BU1  Total Amount ($)     30
12      E   BU1  Total Amount ($)    179
13      A   BU2  Total Amount ($)     15
14      B   BU2  Total Amount ($)     60
15      C   BU2  Total Amount ($)    262
16      D   BU2  Total Amount ($)     23
17      E   BU2  Total Amount ($)    171

DataFrame.unstack "pivots" a level of the hierarchial index labels to the column axis and returns a reshaped DataFrame with a new level of column labels, whose inner-most level consists of the pivoted index labels. This implies that a DataFrame which does not have indexing on the row axis will be "pivoted" to a Series.

>> df_multiindex = df_melt.set_index(
                    ['Vendor', 'Buyer', 'variable']
                    )

>> df_multiindex

                                value
Vendor Buyer variable
A      BU1   Count                 1
C      BU1   Count                 1
D      BU1   Count                 1
E      BU1   Count                 2
A      BU2   Count                 2
B      BU2   Count                 2
C      BU2   Count                 3
D      BU2   Count                 1
E      BU2   Count                 3
A      BU1   Total Amount ($)      1
C      BU1   Total Amount ($)    103
D      BU1   Total Amount ($)     30
E      BU1   Total Amount ($)    179
A      BU2   Total Amount ($)     15
B      BU2   Total Amount ($)     60
C      BU2   Total Amount ($)    262
D      BU2   Total Amount ($)     23
E      BU2   Total Amount ($)    171

>> df_multiindex.unstack(level=-2)

                             value
Buyer                      BU1    BU2
Vendor variable
A      Count               1.0    2.0
       Total Amount ($)    1.0   15.0
B      Count               NaN    2.0
       Total Amount ($)    NaN   60.0
C      Count               1.0    3.0
       Total Amount ($)  103.0  262.0
D      Count               1.0    1.0
       Total Amount ($)   30.0   23.0
E      Count               2.0    3.0
       Total Amount ($)  179.0  171.0

>>> df_multiindex.unstack(level=-1)

                 value
variable     Count Total Amount ($)
Vendor Buyer
A      BU1       1                1
       BU2       2               15
B      BU2       2               60
C      BU1       1              103
       BU2       3              262
D      BU1       1               30
       BU2       1               23
E      BU1       2              179
       BU2       3              171

>>> df_multiindex.unstack(level=0)

                       value
Vendor                     A     B      C     D      E
Buyer variable
BU1   Count              1.0   NaN    1.0   1.0    2.0
      Total Amount ($)   1.0   NaN  103.0  30.0  179.0
BU2   Count              2.0   2.0    3.0   1.0    3.0
      Total Amount ($)  15.0  60.0  262.0  23.0  171.0

Since I am interested in calculating the overall Count and Total Amount for each Vendor, I pivot the Buyer level to the column axis and retain the Vendor level in the row axis, and calculate the sum along the column axis.

>> df_unstack = df_multiindex.unstack(level=-2)

>> df_unstack['Total'] = df_unstack.sum(axis=1)

>>> df_unstack
                         value         Total
Buyer                      BU1    BU2
Vendor variable
A      Count               1.0    2.0    3.0
       Total Amount ($)    1.0   15.0   16.0
B      Count               NaN    2.0    2.0
       Total Amount ($)    NaN   60.0   60.0
C      Count               1.0    3.0    4.0
       Total Amount ($)  103.0  262.0  365.0
D      Count               1.0    1.0    2.0
       Total Amount ($)   30.0   23.0   53.0
E      Count               2.0    3.0    5.0
       Total Amount ($)  179.0  171.0  350.0

Today I Learnt - Data Engineering Diaries

Dataframe manipulation sequence - GroupBy Agg, Melt, Unstack

Problem

What I did

References

Problem

What I did

References

links

social