Profitable Programming
Posts
How to scrape data from any website or mobile app

How to scrape data from any website or mobile app

The 6-Step Data Scraping System Every Engineer Should Master

April 14, 2025

Hey, I'm Vincent. Welcome to the first edition of my newsletter, Profitable Programming, where I explore the intersection of programming, entrepreneurship, and turning technical skills into profitable ventures. After quitting my job in 2024 to focus on my side businesses, I decided to document this journey transparently – sharing both successes and learning opportunities along the way.

Today, I'm diving into a skill that has been fundamental to my career and entrepreneurial success: data scraping. I've built scrapers for everything from high-growth B2C scale-ups to my own bootstrapped businesses. Back when I was 14 years old, I was already scraping video game data to build my own analytics platforms. At Joko, I engineered a large-scale coupon catalog system that worked across thousands of differently structured websites. Since then, I've launched two successful businesses centered around transforming public data into user-friendly products.

Through these experiences, I've developed a battle-tested system that works for virtually any data extraction challenge you might encounter. I call it the S.C.R.A.P.E system, and before diving into how it works, let me explain why this skill is worth mastering.

Why you might want to master scalable data extraction

The “AI revolution” powered by today's LLMs, arguably the most significant breakthrough of the past decade, was built thanks to the ability to extract data from public online sources. Access to fresh, structured data isn't just nice to have; it's a competitive weapon that transforms businesses. Here's what mastering this skill unlocks:

Monetization Opportunities: Companies pay premium rates for specialized, structured datasets that solve business problems. A simple business model I’ve seen working multiple times—scraping valuable data and exposing it through a paid API—can generate large profits with minimal efforts. Consider how many successful businesses have been built by simply organizing LinkedIn data more effectively than LinkedIn itself: Lemlist, Waalaxy, Scraping.io, ….
AI Enhancement: Your products gain immediate competitive advantage when powered by proprietary data. While competitors use generic AI training, your systems can answer industry-specific questions by connecting to exclusive datasets through techniques like retrieval-augmented generation (RAG), delivering insights others simply cannot match. This was one of our competitive advantage at Joe AI, a startup I previously worked at. Having verticalized, real-estate specific data allowed us to beat the competition in France.
Programmatic SEO: The highest-ROI customer acquisition channel remains organic search. With a industry-specific dataset, you can automatically generate thousands of targeted pages that precisely match search intent, dominating your niche in Google results. When combined with AI content generation, a single scraped dataset can power enough unique pages to drive six-figure monthly traffic. I built my latest SaaS product, BlogSEO, entirely around this proven concept—helping businesses scale their content marketing by connecting AI systems to valuable data sources for truly differentiated content that ranks.

Understanding data sources

Now that you understand the value of data scraping, let's get into the technical foundations. The SCRAPE system begins with a fundamental truth: any data that appears on your screen must come from somewhere accessible.

In other words, any information visible in a digital interface, whether a stock price chart in a mobile app or real estate listings on a website, comes from an accessible source. These sources typically fall into two categories:

External Data Sources (99% of cases): Information stored in databases, remote servers, or web services, usually exposed through APIs
Native Data: Hard-coded constants within application code (rare because very limited but occasionally encountered)

Most data you'll encounter is hosted on remote servers and exposed through APIs, which means we can communicate with those APIs just as the intended client would.

The SCRAPE system

1. Sniff the Network Traffic

The first step involves intercepting and analyzing the data exchange between client and server:

For Websites:

Open Chrome DevTools (F12) or equivalent browser tools
Navigate to the "Network" tab
Check "Preserve logs" and clear previous requests with CTRL + L
Browse to the page containing your target data and perform necessary actions
Observe the network requests triggered during your interaction

For Mobile Apps:

Install a proxy server like Charles or mitmproxy on your computer
Configure SSL proxying by setting up certificates (You can ask your favorite chat assistant for more detailed instructions)
Connect your smartphone to the proxy via network settings
Install SSL certificates if required
Restart the target app and navigate to screens displaying your data of interest
Monitor the captured traffic for relevant requests

2. Cram Through Captured Traffic

Once you've captured the client-server communication:

Filter out noise (analytics requests, tracking pixels, etc.)
Identify which domain hosts the API endpoints
Examine request paths and response payloads for relevance. You can use CMD/CTRL + F in Chrome’s network tab to look for specific values which are displayed in the UI inside the payloads of responses
For server-side rendered websites without API calls, you’ll need to parse HTML directly using tools like the jsdom package: indeed, you won’t see any API calls matching the value you are looking for apart from the GET request fetching the HTML document. This is more and more common as popular web frameworks like NextJS make it the default behavior.

3. Reverse-Engineer the API

This investigative phase requires experimentation and pattern recognition:

Test modifications to request parameters to understand their effect on responses
Use the browser console to experiment with fetch requests for web scrapers
Extract working requests as cURL commands to verify functionality outside the browser (Can be done easily, usually right click > Copy > Copy as cURL)
Design the sequence of requests that will form the foundation of your scraper
Be mindful of cached responses when experimenting, especially in mobile apps where caching is common

4. Analyze and Standardize

With a reliable data extraction method established, you are now ready to serialize fetched data into your own database:

Map the response fields to your target schema (this usually corresponds to your database’s)
Implement data validation and error handling
Standardize field formats (dates, currencies, units)

5. Plug and Deploy

Operationalizing your scraper involves:

Deploying it: I recommend serverless environments like AWS Lambda for cost efficiency and simplicity if your system has a reasonable size as they have a generous free tier
Setup CRON schedules for periodic execution, I do this with AWS EventBridge using declarative IaC by crafting a serverless.yml file in my repository.

6. Enhance and Bypass

Post-deployment optimization includes:

Implementing observability tools like Sentry to track failed requests
Addressing authentication challenges through Bearer token header management and cookie jars
Implementing proxy rotation for high-volume scraping. Cheap services like webshare are often enough to bypass rate limitations
Add fault tolerance with smart retry mechanisms where relevant

Common pitfalls: the browser automation trap

Many beginners gravitate toward browser automation tools like Puppeteer, Selenium, or Playwright instead of crafting a reverse engineered API requests. Puppeteer is a popular browser automation library developed by Google which allows you to navigate webpages with the Chromium browser and execute JavaScript DOM commands like document.querySelectorAll() to read data from a webpage. While these tools offer a lower barrier to entry, they come with significant drawbacks:

Resource Intensive: Web browsers consume substantial memory and CPU because of the whole browser API stack that needs to be run
Deployment Challenges: Running headless browsers in production environments is much more complex than deploying a service which performs simple network requests.
Cost Inefficiency: Higher computational requirements lead to increased infrastructure costs. It’s what’s more very hard to deploy chromium instances in serverless environments, so you cannot benefit from “only pay for what you use” pricing models of cloud providers.
Fragility: DOM changes and rendering inconsistencies cause frequent breakages
Data Structure Issues: Extracting data from rendered pages often requires complex selectors and transformations

Direct API communication is almost always more efficient, resilient, and cost-effective when implemented correctly. As highlighted previously for server-side rendered websites, even for HTML-based extraction, a simple HTTP GET request paired with a lightweight DOM parsing library delivers better results without the overhead of a full browser stack.

Data scraping, when executed correctly, offers tremendous value if you want to build great, profitable products. By following the SCRAPE system, you can build reliable, efficient data pipelines that transform raw web content into actionable intelligence.