How I coded a software to scrape 47,000+ Shopify stores - tools, pipeline, and lessons learned

Reading Time

This is the methodology behind my 47,000+ Shopify stores study. If you're here for the findings, go there. If you want to know how it was built - read on.

Where the data came from

I sourced all store URLs from PublicWWW, a search engine that indexes website source code. I wrote a regex to find every URL containing myshopify.com in its HTML - which is the fingerprint every Shopify store leaves in its code regardless of whether it uses a custom domain. (Still, I didn't take it for face value. I built another step exclusively focused to validate each Shopify store, as you'll see shortly.)

That search returned roughly 150,000 raw URLs. Most of them were dead, closed, or outdated. I deliberately chose PublicWWW over other sources because it skews toward more established stores. I wanted to avoid very recent stores and aimed at more mature merchants with real data worth analyzing.

That bias is worth knowing: this dataset does not represent the newest wave of Shopify stores, and that's intentional. Most stores in this dataset are at least 1 year old, though I don't have concrete data on this yet.

Step 1 - Validation - 100h~

Before touching any store, every URL went through a validation step. A store only moves forward if it passes all three checks:

Has window.Shopify.theme - confirms it's an active Shopify storefront
Is not closed - some stores are permanently shut but the domain still resolves or redirects to some type of domain purchasing website
Is not password-protected - password-locked stores return no usable data
Has no products - stores with no products were not included. (which, surprisingly, there were quite a few)

After validation, roughly 60,000 stores passed - about 40% of the raw list. The rest were discarded. No duplicates were allowed at any point in the pipeline; I was strict about that from the start.

Step 1 - Discovery (themes + apps) - 250h~

This is where it gets interesting and where most of the engineering time went.

Themes

Theme detection is straightforward. Shopify exposes theme data in window.Shopify.theme, so I fetch the store's HTML and read the name, version, and metadata directly from that object. Clean and reliable.

Apps - the hard part

App detection is a different problem entirely. Shopify doesn't expose a list of installed apps anywhere in the storefront. You have to infer them from the scripts, stylesheets, and injected code the apps leave behind.

My approach was to build a reference library of app fingerprints from scratch. For each app I wanted to detect, I went to its Shopify App Store page, found its demo store, and inspected the <script> tags. I wrote a function that strips all native Shopify scripts (checkout scripts, CDN assets, etc.) and isolates only the third-party app-injected scripts.

From there I built a selector for each app. For example, for Judge.me I used something like:

script[src*="judge"][src*="me"][src*=".js"]

The multi-part selector prevents false positives from naming variations like judge-me or judge_me. I then validated each selector against stores I already knew had that app installed, before adding it to the library.

Some apps were harder to detect because they're not injected via <script> tags - they're baked into the theme liquid files and only appear in the rendered DOM after the page loads. For those I had no choice but to use Puppeteer, which adds significant overhead since you're spinning up a full headless browser for each store instead of doing a simple HTML fetch.

In hindsight, I think it's possible to reliably detect most apps with a simpler BeautifulSoup + Python pipeline and avoid Puppeteer entirely - but I didn't fully think that through during this project. It's one of the things I'll improve in the next version.

My goal is to have a "set and forget" type of scraper where I just feed it stores and it handles everything properly and efficiently. Maybe that is utopia, though.

Step 2 - PageSpeed Insights - 395h~

After discovery, every valid store went through Google's PageSpeed Insights API. I measured four variants per store:

Home page - mobile
Home page - desktop
Product page - mobile
Product page - desktop

To get reliable numbers, I ran each variant 5 times and averaged the results. PSI scores can vary run to run, especially on mobile, so a single measurement isn't trustworthy at scale.

That means 20 API calls per store. Across 47,420 stores that's roughly 948,400 total PSI calls.

Each store took around 30 seconds to fully process. At that rate: 47,420 stores × 30 seconds = 1,422,600 seconds - roughly 395 hours of PSI processing alone. I rotated across 50+ API keys to avoid rate limits and keep the pipeline running continuously.

Step 3 - Niche classification (the hacky part) ~ 50h

I needed to classify every store into a niche category. I tried doing it programmatically - parsing product descriptions, tags, domain names, and meta content - but the results were too unreliable. Too many edge cases, too much noise.

So I did something I'm not entirely proud of, but it worked.

I already have a Cursor Pro subscription, which gives unlimited access to their AI in "auto" mode. So I built a RobotJS script that automated the following loop:

Fetch the full HTML of each store and save it to a local folder
Send batches of 20 HTML files to Cursor's AI context
Ask it to classify each store into one of my predefined niches based on the HTML content
Loop overnight, unsupervised

I also tested passing the URL directly and having Cursor use web fetch to classify live - that worked fine too, but sending the pre-fetched HTML felt faster. I didn't benchmark it properly though.

I have a video of it running overnight. It's one of those things that looks completely unhinged and also somehow works perfectly.

If I redo this, I'll integrate a proper AI classification step directly into the pipeline via API - cleaner, faster, no desktop automation required. But for this project, it got the job done.

Data cleaning

Raw scraped data is never clean. A few things I had to handle:

Failed PSI scores - some stores returned null or error responses. These were removed programmatically before analysis.
Duplicate theme column - a bug in my discovery pipeline was writing the theme name to two separate columns, which was inflating the database size significantly. I caught it and cleaned it before the final export.
Product count cap - Shopify's product endpoints are paginated. My scraper fetched the first page, which caps at 30 products per store. Any store showing "30" in the product count column likely has 30 or more - the real number is unknown. I flagged this in the analysis.

The final clean dataset: 47,420 stores, 73 columns, ~300MB.

Hardware

I used two hardwares for this:

- My personal laptop: an Acer Predator Helios 300 - i7-9750H, 32GB RAM, Nvidia RTX 2060.
- My portable gaming device, a Legion Go. (AMD Ryzen Z1 Extreme, 16gb RAM)

I installed my scraping software on both and split the URLs so it could be done faster.

Nothing special. The PSI step is almost entirely network-bound so the hardware barely matters there, but having 32GB of RAM made handling the full dataset in memory much smoother during the analysis phase.

On scraping at scale

If you're somewhat technical, you already know that writing a scraper and scraping a few hundreds of stores is the easy part. Managing thousands of stores over weeks is where things get complicated.

What happens when a store that was valid yesterday is now closed? Do you retry it? How do you flag it as permanently dead vs. temporarily unavailable? What if the PSI API returns a weird result for one store - does it corrupt the batch? How do you resume a 47,000-store pipeline after a crash without starting from zero? What if you forgot to fetch some specific data - is it worth it to re-run everything?

Most of the real engineering in this project wasn't the scraping itself - it was building the state management, retry logic, double check if I was leaving valuable data un-scraped... plus the data integrity checks that kept the pipeline trustworthy across months of intermittent runs.

And even after all that, there was still data cleaning and analysis phase.

If you're curious about what I actually found by analyzing all of this - the themes, niches, speed scores, apps, and more - the full study is here: 47,000+ Shopify stores analyzed. 800 hours. Here's the data.

If you'd like to try the scraper for yourself, I have published a Chrome Extension with most of its logic built in, so you can try it on a few Shopify stores.

Back to blog