Retention with Google Analytics Data API and plotly

User retention is one of the most important metrics for any app. To define retention we need to define a cohort. In the case of weekly retention, a cohort is all users who started using the app in a given week, and the users will be retained only if they are actively using the app the following weeks. So if 100 users started using the app last week and from those users only 50 of them have used the app this week, we’d have a 50% retention for last week’s cohort.

Google Analytics exposes the retention data through the GA4 API, which we will use for this blog. Here’s the Quickstart Guide to set up the API access and credentials. Once the API is set up, we’ll install the Python client library google-analytics-data and we can start making API calls. Note that this will only work for a GA4 property, if you have a UA property, see this guide.

First we need to import the Client and the Types we need for a weekly retention report. We also need pandas for dataframe manipulation and plotly.express for plotting.

from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import RunReportRequest
from google.analytics.data_v1beta.types import DateRange
from google.analytics.data_v1beta.types import Dimension
from google.analytics.data_v1beta.types import Metric
from google.analytics.data_v1beta.types import CohortSpec, CohortsRange, Cohort
import pandas as pd
import plotly.express as px

Once we have all imports ready, we need to set up the client and our report request.

property_id = "YOUR-PROPERTY-ID"
client = BetaAnalyticsDataClient(credentials=your_credentials)
end_offset = 5
# Week starts on sunday, ends on saturday
saturdays = pd.date_range('2022-01-01', dt.date.today(), freq='W-SAT')
end_dates = [saturday.strftime("%Y-%m-%d") for saturday in saturdays]
start_dates = [(saturday - dt.timedelta(6)).strftime("%Y-%m-%d") for saturday in saturdays]
cohorts = [Cohort(name=start,
                  dimension='firstSessionDate',
                  date_range=DateRange(start_date=start, end_date=end))
           for start, end in zip(start_dates, end_dates)]
request = RunReportRequest(
    property=f"properties/{property_id}",
    dimensions=[Dimension(name="cohort"), Dimension(name='cohortNthWeek')],
    metrics=[Metric(name="cohortActiveUsers")],
    cohort_spec=CohortSpec(cohorts=cohorts,
		     cohorts_range=CohortsRange(granularity='WEEKLY',
                            end_offset=end_offset),
        )
    )

The RunReportRequest constructor will construct a report request with the parameters we supply. For weekly retention we need the cohortNthWeek dimension and the cohortActiveUsers metric. Next we need to define the cohort specification. for that we use the CohortSpec object. We supply the cohorts we are interested in using the cohorts parameter. This takes a list of Cohort objects that we define as follows. We want to display a 5 week retention and the week starts on Sunday and ends on Saturday. Our first week will start on 26-12-2021. From that we can create start and end dates we need for the date_range parameter. The cohort dimension we’re interested in is the firstSessionDate and the name of the cohorts will be the start date of each cohort, which we will use later on to index our dataframe. For the cohorts_range parameter we need to specify the granularity and the end_offset. The granularity can be DAILY ,WEEKLY or MONTHLY. We will use WEEKLY as that’s the granularity we’re interested in. The end_offset will be 5 weeks, so we will see at most 5 weeks of retention for each cohort. Once we’re done setting up the report we can run the report. The response will be limited to 10000 rows unless we set the limit parameter, which can be set to 100000 at most. For more rows, use pagination.

response = client.run_report(request)

The response will be in JSON format and will include rows of the resulting report. Each row will have the metrics and dimensions for each cohort and for each date range. In order to visualise the retention data we need a dataframe indexed by the cohort start date and with columns from Week 0 to Week 5. We will build a simple list of lists dataframe using the response data. Each list will represent one cohort.

columns = ['cohort_date'] + [f'Week {n}' for n in range(end_offset + 1)]
df = []
for date in start_dates:
    cohort = [date]
    for row in response.rows:
        if row.dimension_values[0].value == date:
            cohort.append(row.metric_values[0].value)
    df.append(cohort)
df = pd.DataFrame(df, columns=columns)
df = df.set_index('cohort_date')

Because the response is not sorted by cohort, we look up each start date and the metric values to our row. The dataframe will look like this.

cohort_date Week 0 Week 1 Week 2 Week 3 Week 4 Week 5                                       
2021-12-26     619    239    168    146    143     84
2022-01-02     709    217    167    139    100   None
2022-01-09     678    217    156    102   None   None
2022-01-16     899    261    138   None   None   None
2022-01-23    1415    303   None   None   None   None

We want to show retention as a percentage but we also want to keep the number of users because that is important information to show on a retention plot. We will keep a copy of the original dataframe to use as custom data for the plot and divide all columns by the first column to get the fraction of users retained.

df = df.astype('float')
df_custom = df.copy()
df = df.divide(df['Week 0'], axis = 0)

The final dataframe will look like this:

cohort_date  Week 0    Week 1    Week 2    Week 3    Week 4    Week 5                                                     
2021-12-26      1.0  0.386107  0.271405  0.235864  0.231018  0.135703
2022-01-02      1.0  0.306065  0.235543  0.196051  0.141044       NaN
2022-01-09      1.0  0.320059  0.230088  0.150442       NaN       NaN
2022-01-16      1.0  0.290323  0.153504       NaN       NaN       NaN
2022-01-23      1.0  0.214134       NaN       NaN       NaN       NaN

Now the fun part: visualisation. We will use plotly to visualise this table as a heatmap. Plotly express has a great plot for this called imshow(). Alternatively you can use go.Heatmap, which is the underlying plot in imshow()

fig = px.imshow(df,
                labels=dict(y='Cohort Start Date', x='Weeks', color='Retention'),
                x=df.columns)
fig.update(data=[dict(customdata=df_custom,
                      hovertemplate='Week: %{x} <br>'
                                    'Retention: %{z:.2%} <br>'
                                    'Cohort Start Date: %{y}<br>'
                                    'Users: %{customdata}'
			 	    '<extra></extra>')
		  ])
fig.update_layout(dict(title='Weekly Retention',
		 plot_bgcolor='rgba(0,0,0,0)',
		 height = 500,
		 width = 800))
fig.update_traces(showscale = False,
		coloraxis = None,
		colorscale = 'burgyl',
		xgap = 2,
		ygap = 2)
fig.show()

In order to display the number of users as well as the percentage retained, we need to add custom data to our plot and use it in the hovertemplate. The <extra></extra>part removes the trace name from the hover box. After some formatting we get the final plot.

In this blog, we saw how to work with cohort retention data from the GA4 API and how to visualise the retention data using plotly. There are many other metrics and dimensions that can be explored with the GA4 API, hope this blog was useful and will help you with your own GA4 property analytics.

By Zeynep Bicer
Published
Categorized as blog

Leave a comment

Your email address will not be published. Required fields are marked *