[go: up one dir, main page]

Skip to content

Web-scraping tool to extract public activities data from Strava Clubs (without Strava's API) using Selenium library in Python.

Notifications You must be signed in to change notification settings

roboes/strava-club-scraper

Repository files navigation

Strava Club Scraper

Description

This web-scraping tool aims to extract activities data from Strava Club to complete the lack of features of the standard Strava API. The main features are:

  • Strava Club Activities scraper: imports "Recent Activity" for public or activities that the user has access to a dataset (requires a Strava account).
  • Strava Club Leaderboard scraper: imports current and previous week leaderboard information (including athletes' id) to a dataset (requires a Strava account).
  • Strava Club Members scraper: imports all members that joined a Strava Club (including athletes' id) to a dataset (requires a Strava account).
  • Strava Club to Google Sheets importer: automatically retrieves data and updates Strava Club Activities, Leaderboard and/or Members dataset(s) into a Google Sheets (requires a Google API key).

Strava API

This tool does not rely on the Strava API. Strava's API turned to be very limited in the recent years. For getting List Club Activities, it returns only the following variables: athlete variables: resource_state, firstname and lastname (first letter only); activity variables: name, distance, moving_time, elapsed_time, total_elevation_gain, type and workout_type.

Given that Strava does not offer an athlete id variable, athletes with the same first name and first digit of the last name would not be distinguishable.

Limitations

  • Strava Club Activities scraper: the main drawback/limitation of this tool is that Strava's dashboard activity feed is very limited in the number of activities shown. Scrolling until the bottom of the page is not endless; after some scrolls the warning "No more recent activity available. To see your full activity history, visit your Profile or Training Calendar." is shown. Strava has the num_entries URL query string (e.g. https://www.strava.com/dashboard?club_id=319098&feed_type=club&num_entries=1000), but still this string does not necessarily load the requested number of activity entries to the feed. This tool also requires that the athletes' activities to be scraped are either public or that the account that is scraping the club activities data has access to the activities to be scraped (by either following the athlete or by owning the activity).

  • Strava Club Leaderboard scraper: the club leaderboards include only data for current and previous week; no historical data is provided by Strava. Additionally, club leaderboards display only the weekly top 100 members (Source).

To avoid these limitations, this tool offers an integration to Google Sheets, updating/incrementing specified scraped Strava Club(s) data for activities/leaderboard/members, keeping previously scraped data that cannot be accessed anymore in Strava Club.

Usage

Use case

Strava allows users to create a Group Challenge, which is limited to up to 25 participants. To circumvent this limitation, one possible use case is to create one or multiple Strava Clubs (e.g. Cycling, Multisport, Run/Walk/Hike), adapt this script to update/increment an existing Google Sheets sheet with the club(s) activities, leaderboard and members information data. The script can be set up to run automatically on a scheduled basis on cloud platform services such as GitHub Actions (see GitHub Actions Workflow .yaml template) and Railway (see Dockerfile template). To connect the script to a Google Sheets file, a Google Sheets API .json key is required and the file needs to be shared with a Service Account email address. The Google Sheets can then be connected to a dashboard tool (e.g. Google Data Studio, Microsoft PowerBI).

Strava settings

This tool assumes that Strava's Display Preferences are set to: Units & Measurements = "Kilometers and Kilograms" Temperature = "Celsius" Feed Ordering = "Latest Activities" (chronological feed)

And that your Strava display language is English (US). To change the language, log in to Strava and on the bottom right-hand corner of any page, select English (US) from the drop-down menu (more on this here).

Python dependencies

python -m pip install python-dateutil geopy google-api-python-client google-auth lxml pandas selenium webdriver-manager

Functions

strava_club_activities

strava_club_activities(club_ids, filter_activities_type, filter_date_min, filter_date_max, timezone='UTC')

Description

  • Scraps and imports activities belonging to one or multiple Strava Club(s) (public activities or activities that the account that is scraping the data has access to) to a dataset.

Parameters

  • club_ids: str list. List of Strava Club ids in which the tool should scrap data from (e.g. club_ids=['445017', '1045852']).
  • filter_activities_type: str list, default: None. List of activities type filter (e.g. filter_activities_type=['E-Bike Ride', 'Hike', 'Ride', 'Run', 'Walk']).
  • filter_date_min: str. Start date filter (e.g. filter_date_min='2023-06-05').
  • filter_date_max: str. End date filter (e.g. filter_date_max='2023-07-30').
  • timezone: str or timezone object, default: 'UTC'.

strava_club_members

strava_club_members(club_ids, club_members_teams=None, timezone='UTC')

Description

  • Scraps and imports members of one or multiple Strava Club(s) to a dataset.

Parameters

  • club_ids: str list. List of Strava Club ids in which the tool should scrap data from (e.g. club_ids=['445017', '1045852']).
  • club_members_teams: dict, default: None. Option to add athlete_id to one or multiple teams (stored in the athlete_team column). athlete_id assigned to multiple teams will have its unique teams assignment comma separated.
  • timezone: str or timezone object, default: 'UTC'.

Example of club_members_teams:

club_members_teams={
    'Team A': ['1234, 5678'],
    'Team B': ['1234, 12345'],
}

strava_club_leaderboard

strava_club_leaderboard(club_ids, filter_date_min, filter_date_max, timezone='UTC')

Description

  • Scraps and imports leaderboard of one or multiple Strava Club(s) to a dataset.

Parameters

  • club_ids: str list. List of Strava Club ids in which the tool should scrap data from (e.g. club_ids=['445017', '1045852']).
  • filter_date_min: str. Start date filter (e.g. filter_date_min='2023-06-05').
  • filter_date_max: str. End date filter (e.g. filter_date_max='2023-07-30').
  • timezone: str or timezone object, default: 'UTC'.

strava_club_to_google_sheets

strava_club_to_google_sheets(df, sheet_id, sheet_name)

Description

  • Update/increment a Google Sheet sheet given an inputted dataset.

Parameters

  • df: DataFrame. Input dataset to be updated/incremented in a specified Google Sheets sheet.
  • sheet_id: str. Google Sheets file id.
  • sheet_name: str. Google Sheets sheet/tab where the data should be updated/incremented.

execution_time_to_google_sheets

execution_time_to_google_sheets(sheet_id, sheet_name, timezone='UTC')

Description

  • Update a Google Sheet sheet given the current time that the code was executed.

Parameters

  • sheet_id: str. Google Sheets file id.
  • sheet_name: str. Google Sheets sheet/tab where the data should be updated/incremented.
  • timezone: str or timezone object, default: 'UTC'.

strava_export_gpx

strava_export_activities(activities_id, file_type)

Description

  • Export a list of activity_id to a GPS file.

Parameters

  • activities_id: int list or str list. List of activity_id to be exported (e.g. activities_id=[696657036, 696657037]).
  • file_type: str, default: '.gpx'. Activity export format. Note that the '.gpx' format uses Strava's built-in feature to export the activities, and '.tcx' uses Sauce for Strava Chrome Extension (which needs to be installed on Selenium's WebDriver to work). Strava's built-in export .gpx feature includes only trackpoints (with latitude and longitude); it is possible to manipulate those .gpx exports by converting them to other GPS file types (e.g. .tcx) and add faketimes using GPSBabel (see gps_tools.sh).

selenium_webdriver_quit

selenium_webdriver_quit()

Description

  • Terminates the WebDriver session.

Parameters

  • None.

Legal

Please note that the use of this code/tool may not comply with Strava's Terms of Service (especially the "Distributing, or disclosing any part of the Services in any medium, including without limitation by any automated or non-automated “scraping”" term) and Strava's API Agreement (especially the "You may not use web scraping, web harvesting, or web data extraction methods to extract data from the Strava Platform" term). Use this tool at your own risk.

See also

Strava Club Tracker: Tool that generates a progress tracker/dashboard for Club activities (relies on Strava's API) (HTML, PHP).

StravaClubActivities: Tool that downloads Club activities and generates a .csv for processing virtual race events (relies on Strava's API) (Ruby).

About

Web-scraping tool to extract public activities data from Strava Clubs (without Strava's API) using Selenium library in Python.

Resources

Stars

Watchers

Forks

Languages