[go: up one dir, main page]

Skip to content

mguideng/gdscrapeR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gdscrapeR: scrape Glassdoor company reviews in R

GitHub Release Date

ABOUT

gdscrapeR is an R package that scrapes company reviews from Glassdoor using a single function: get_reviews. It returns a data frame structure for holding the text data, which can be further prepped for text analytics learning projects.

INSTALL & LOAD

The latest version from GitHub:

install.packages("devtools")
devtools::install_github("mguideng/gdscrapeR")

library(gdscrapeR)

USAGE

Example

The URL to scrape the awesome SpaceX company will be: www.glassdoor.com/Reviews/SpaceX-Reviews-E40371.htm.

spacex-url

Function

Pass the company number through the get_reviews function. The company number is a string representing a company's unique ID number. Identified by navigating to a company's Glassdoor reviews web page and reviewing the URL for characters between "Reviews-" and ".htm" (usually starts with an "E" and followed by digits).

# Create data frame of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- get_reviews(companyNum = "E40371")

This will scrape the following variables:

  • Date - of when review was posted
  • Summary - e.g., "Great People"
  • Rating - star rating between 1.0 and 5.0
  • Title - e.g., "Current Employee - Manager in Hawthorne, CA"
  • Pros - upsides of the workplace
  • Cons - downsides of the workplace
  • Helpful - count marked as being helpful, if any
  • (and other info related to the source link)

Result

spacex-results

PREP FOR TEXT ANALYTICS

RegEx & Export

Use regular expressions to clean and extract additional variables and then export:

#### REGEX ####
# Package
library(stringr)    # pattern matching functions

# Add: PriKey (uniquely identify rows 1 to N, sorted from first to last review by date)
df$rev.pk <- as.numeric(rownames(df))

# Extract: Year, Status, Position, Location 
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))

df$rev.stat <- str_match(df$rev.title, ".+?(?= Employee -)")

df$rev.pos <- str_replace_all(df$rev.title, ".* Employee - |\\sin .*|\\s$", "")

df$rev.loc <- sub(".*\\sin ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
                       (grep("Former Employee|Current Employee|^+$", df$rev.loc, value = T)), 
                     "Not Given", df$rev.loc)

# Clean: Pros, Cons, Helpful
df$rev.pros <- gsub("&amp;", "&", df$rev.pros)

df$rev.cons <- gsub("&amp;", "&", df$rev.cons)

df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))

#### EXPORT ####
write.csv(df, "df-results.csv", row.names = F)

Exploration ideas

gdscrapeR was made for learning purposes. Analyze the unstructured text, extract relevant information, and transform it into useful insights.

  • Apply Natural Language Processing (NLP) methods to show what is being written about the most.
  • Sentiment analysis by categorizing the text data to determine whether a review is considered positive, negative, or neutral as a way of deriving the emotions and attitudes of employees. Here's a sample project: "Text Mining Company Reviews (in R) - Case of MBB Consulting".
  • Create a metrics profile for a company to track how star rating distributions are changing over time.
  • The "Text Mining with R" book by Julia Silge and David Robinson is highly recommended for further ideas.

If you find this package useful, feel free to star ⭐ it. Thanks for visiting ❤️ .

NOTES

  • Uses the rvest and purrr packages to make it easy to scrape company reviews into a data frame.
  • A common issue with scraping is the need to keep up with changes made to a website's pages. The Glassdoor site will change. Errors due to CSS selector changes will be shown as "Could not scrape data from website.".
  • Be polite.
    • A system sleeper is built in so there will be delays to slow down the scraper (expect ~1.5 minutes for every 100 reviews).
    • Also, saving the data frame to avoid redundant scraping sessions is suggested.
  • To contact maintainer: [imlearningthethings at gmail].