[go: up one dir, main page]

Skip to content

A webscraper API using Python to extract clothing information from thredup's website for petite clothing

Notifications You must be signed in to change notification settings

tas09009/thredup-scraper-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thredup Scraper API

standard-readme compliant

Thredup implemented CloudFlare to block all bots. I may look into libraries to bypass it but for now, this project has been tabled."

Thredup Scraper API is a command line, python based web scraper that uses beautiful soup to extract clothing information onto a csv file. Later the project will be migrated to a back end framework to use as an API.

Table of Contents

Background

There are two ways to reduce your carbon footprint when it comes to clothing:

  • buying used whenever possible
  • consists of natural fabrics (wool, silk, cotton, etc.) over man-made fabrics

This project is an attempt to combine the two ways by scraping for sustainable fabrics from the world's largest online consignment store.

But clothing is environmentally damaging, even AFTER it's been purchased. For example, throwing polyester or any plastic-made clothing in the washing machine releases microplastics in the ocean. Of course, the environmental damage also depends on the company, manufacturer, process of using materials (Ex: recycled polyester), quality, etc. In addition, wearing non-natural fibers is less comfortable, less breathable, and falls apart quicker than stronger fabrics made of linen, wool, silk, etc.

The basic rule to follow is Used natural-fabric clothing > new clothing for the following reasons:

This program makes the following disctinctions between good vs. bad fabrics:

Good fabrics: Bad fabrics:
cotton polyester
silk polyamide
wool acrylic
merino wool fabric not found*
alpaca No Fabric Content*
linen
hemp
bamboo
tencel

*many items on the site don't have fabric information, so we will assume worst case scenario

Install

  1. You can either clone the project by runninggit clone https://github.com/tas09009/thredup-scraper-api.git in your terminal or fork the project in order to contribute later: See Contributing below.
  2. Set up your Python virtual environment by running pyvenv venv in that directory and running source venv/bin/activate to active it. Or create a conda environment.
  3. Install Python requirements with pip install -r requirements.txt.

Usage

Run the program by typing python code/thredup_fullscrape.py. The terminal will then ask for the following three inputs:

  • url of thredup
  • number of pages to be scrapped
  • file name and location to save csv output

Beautiful Soup pulls all product links from a search page (50 per page) and then parses each product link to pull the following information:*

Information to be extracted Function Example
Link url of each item on a page Item_Link
Category Type clothing type Tops
Image_Link front picture of item Picture_Link
Description distinct features 'Crew neckline', 'Color blocked detail', 'Long sleeve', 'Blue'
Materials fabric content and it's percentage 100% Cotton
Size item size Size XS
Measurement measurements depend on item itself 28" Chest, 22" Length
Price price 3.99
Brand name brand Tommy Hilfiger

Picture of what the data export. You can also look at the "datasets/test_runs" to see more csv examples. basic_scrape_table_image

FYIs:

  • This project does not use rotating proxies nor HTTP headers due to time/money. Therefore, the code has a 5-10 second timer delay to each request being pulled.
  • Scraping one page i.e. 50 items per page, will take 6 to 8 minutes.

Additional Web Scraping Scripts

This project contains other libraries/python programs separate from the database project within the '/code/additional_modules' directory. They are:

thredup_tabs.py

Scrapes a given number of items within a search page to filter out clothing by the following "Materials":

  • Polyester
  • Acrylic
  • Fabric details not available
  • No Fabric Content All results (that don't contain the forbidden words) are opened in a new tab for viewing.

Usage

  • input: url of current page
  • output: new chrome tabs open one by one only showing fabrics that don't contain any of the banned words. 3 second delay per tab

thredup_fav.py

Removes all "sold" items from favorites list

Usage

  • input: url of "favorites" page
  • output: CLI notifying when items have been removed. Refresh page to see updates.

More to be added

Contributing

Please follow along this excellent step-by-step guide to learn how to contribute to an open-source project

License


Road Map

The web scrapping code can be made more efficient such as scraping multiple elements from one CSS tag rather than the whole page. Right now, it's been built to work. The code will be updated at a later time. See the Projects Board for the latest status

The following sections include further research, plans, etc.

Table of Contents

Future Goals

Make it as easy as possible to buy second hand clothing

  • Python library to include for scraping:
    • thredup
    • poshmark
    • ebay
    • heroine
    • etsy
    • ebay
    • The Real Real (luxury)
    • Vestiaire Collective (luxury)
    • local thrift stores How to get them online?
  • expand to men's clothing. Ex: grailed
  • include a WHY section
  • If a company has a store (ex: amour vert, reformation, etc.) then try on their clothes and remember their sizes
  • order an item or two from them, then buy the used version online
  • clothing websites should have a "used section" that you can sell back to them" elieen fisher now has this

Questions to Answer:

  • what percentage of clothing is considered "environmentally damaging" i.e. made of "banned" products
  • how many items are correctly sorted in their category?
    • Ex: clicked on casual dresses and many formal work dresses showed up
  • how many items are missing categories such as "accents" and "pattern"
    • how many have a tag such as "3/4 sleeve" but don't belong to any category
  • sizes vary per clothing item
    • Ex: size 00 and 0 for top but 2 for bottoms. But website cannot differentiate
  • data may need to be cleaned up prior to putting into database?
    • links will need to be made beforehand
  • Machine Learning
    • Classify sweaters as actual sweaters?
    • Pick clothing based on fashion styles. Ex: boho, chic, grunge, etc.
  • where does Viscose actually fall into place?
  • Some items sold are using 'recycled polyester' such as this Eileen Fisher Trenchcoat
  • how much of the clothing is fast fashion? obviously only in the petite category
  • Other thredup projects:
    • Thredup A project to extract data from the website and do statistical calculations on it Below is the description of the requirement
    • Thredup-Cart-Refresher Refreshes items inside the Thredup account's cart
    • WebCrawler-ThredUp I created this web crawler to scrape data from ThredUp products into a database
    • build a seasonal wardrobe with 5 items under $100 or $200? Use Vetta for ideas

Website inconsistencies

  • limited filters within the "petite" category such as
    • not able to search by fabrics -Ex: linen/cotton combination
      • Ex: 100% wool
    • not able to filter out fabrics
      • Ex: no polyester
      • Ex: no polyester or acrylic
  • shop by style on thredup's website. All the links go to the same link for all womens clothing
  • Search shorts: rompers are also displayed and identified as dresses
  • two tops are exactly the same but have different descriptions. Here and here
    • this blue top and white top are similar to the red tops above. Again, different descriptions
  • when jumping between different categories, the "sort by" method changes to "Recently Discounted" by default
  • Only product filters all clothing items have in common are:
    • color
    • pattern
    • accents
  • This project will be helpful to only those who are petite but eventually should expand to the others as well
  • Catch microplastics in washing machine (if you have to buy polyester) with:
  • Express casual pants - amour vert knockoff
  • Thredup's classes, id, div tags all have unintuitive names. Other websites's labels make much more sense
  • thredup.com/robots.txt
  • Tutorial: Web Scraping and BeautifulSoup exactly what I'm doing
  • Integrate IP addresses Web scraping with Python - 3 medium articles
  • robots.txt: rules of scraping such as frequency and specific pages
  • Thredup doesn't have an API, not for Python atleast
  • Thredup should have a database of the top 10 brands and their measurements and it should automatically pull from that when a brand is matched
  • item picture - high resolution only
  • website link
    • very difficult to pull, none of the links would appear. Realized that the search results display in order of "Recently Discounted" with no account login. As opposed to how I was searching "Newest First" with account logged in
      • Organized by Newest First which makes re-running code much easier, can update the database by webscarpping until first item is found already in the database
  • All item details
    • Description: dictionary with 6 to 8 keywords. These are values only. Need keys from search results link (left column). All values match a key to the columns on the left
    • Pull all keys from the columns first, then match their values based on the item description

Lessons Learned

  • Search by petite first, then sort. Rather than search all and then filter by petite. In the case of searching by a specific fabric (Ex: 100% merino wool), it's easier to search within the petite clothing and then filter out by fabric.
  • difficult to loop through different clothing types and multiples pages within a clothing type. Easier for now to search for one clothing type at a time.
  • filter out clothing by fabrics (polyester, polyamide, etc.)
    • second layer of filter for rayon, nylon, etc.
  • sort out clothing specifically by fabrics (wool, linen, etc.)
  • importing functions: caused circular dependencies
  • Can't use the Beautiful Soup HTML parser for Thredup because I cannot extract all the hrefs from the site for all the items. I have no idea where they are then!
    • XML will be the way to go, all items are in a grid with the 2nd to last number increasing for each item.
  • add item per row, rather than at the end of the list would require too much memory and time to write each row rather than 50 rows at a time

Clothing Sustainabilty Issues & Ideas

  • detailed email: all that is inconvenient + link to my thredup library. Would love to recommend the website to friends and others if these issues are fixed!
  • sizing and fabrics are usually incorrect, which is problematic since my main filter is by fabric. I have returned a few items in the past but there were some I later saw the discrepancy and it was too late to return
  • retail value incorrect. One blog mentioned this
  • need more feedback loops from customers
  • When I switch to Petite, all filters are reset
  • Cannot search by material
  • cannot search by eco-friendly materials either
  • email: not recommendations based on style and fabrics
  • Ask for access to API - read the docs
  • read their engineering blog
  • ML to create goody box
    • don't like how I don't know what I'm getting
    • display items within hours
    • choose what you like, or get similar recommendations
      • Suggested item: red turtleneck sweater. Suggested alternatives: different color, fabric, mockneck, etc.
  • thredup monthly renting? similar to rent the runway? already do Goody Boxes and Rescues
    • How happy are people with the Goody Boxes and Rescues? Online research
      • Thredup must have this data in their yearly report?
    • see ML info above
    • ML to take monthly feedback to learn how to improve next time
      • better sizing
      • style (if people preferred blazers over sweaters
    • create a ML and test it (buy all the items as a "goody box" bundle
      • return and give feedback to ML model. Test again with another order
        • verify: sizing, color matching to original picture, fabric, original retail price estimation, cut accuracy
  • Blog post about using Goody Box. Send her two shipments and give some feedback to ML model?
  • Huge Rescue Mystery Box prefers Free People. Preferences can be prioritized?
  • competitor to Prime wardrobe but better
  • possibility of adding men's clothing
  • phase out bad fabrics. Over time when people donate and their items are logged into their accounts, a warning sign should come up saying we will not take polyester after this point
  • get local thrift stores online as well. With thredup's help?
  • Blog post: 11 ETHICAL OR SUSTAINABLE CLOTHING BRANDS LIKE EVERLANE is wildly inaccurate. write a comment on the site
  • Blog post: 6 MUST-WATCH DOCUMENTARIES TO LEARN ABOUT SUSTAINABLE FASHION. True cost mentions Uniqulo as "fast fashion" which it is. But in the blog above, it recommends it as ethical and sustainable clothing
  • What are the best channels to reach out to Thredup?
  • How trustable is good on you? Other websites that have more brand ratings? Tried out several on good on you and they didn't have it
  • What if a local thrift store scanned the clothing tag, and it matched with the brand and page of the item with the picture?
  • Build a database from web-scraping and save all important info:
    • picture
    • link to item
    • details: material

Resources:

About

A webscraper API using Python to extract clothing information from thredup's website for petite clothing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published