Revolutionizing Web Scraping: A Leap Forward with Multimodal Language Models

Somnath Banerjee
5 min readDec 10, 2023

Authors: Somnath Banerjee and Matt Herich

Web scraping, the extraction of data from websites, is a crucial practice in various industries for gathering valuable data and insights. From retailers analyzing competitor pricing to sales teams prospecting leads on LinkedIn, web scraping fuels valuable insights and informed decision-making.

Despite its widespread use, web scraping comes with challenges, especially in maintaining the scraper’s accuracy amidst HTML page changes. Teams often struggle to keep up with modifications in HTML and CSS classes, which can render scrapers ineffective.

Enter the transformative potential of multimodal large language models (LLMs) like OpenAI’s GPT-4 and Google’s Gemini. These advanced models possess image understanding capabilities, enabling them to extract data directly from website screenshots, bypassing the need for complex HTML/CSS parsing. By utilizing screenshots of webpages, these LLMs can replace hundreds or thousands of lines of code with just a few natural language sentences, revolutionizing the web scraping landscape. This paradigm shift dramatically simplifies the web scraping technology and reduces development and maintenance costs significantly.

In this article, we do a comparative analysis between GPT-4 and BARD (currently powered by Gemini Pro) in the following information extraction tasks across Sales and Real Estate use cases.

Sales use cases

  1. Extract customer names from a company homepage
  2. Obtain company information from a logo
  3. Extract company information from LinkedIn Sales Navigator
  4. Extract people information from LinkedIn Sales Navigator

Real Estate use cases

  1. Extract property information from a Zillow listing
  2. Extract property information from an Airbnb listing

In many of the information extraction tasks GPT-4 outperforms BARD. Only in the case of obtaining company information from a given logo BARD performs much better.

Conclusion

The capability to extract information directly from webpage screenshots represents a groundbreaking shift in technology. This innovation holds the potential to streamline web scraping processes, boosting efficiency, and substantially reducing costs. GPT-4 continues to stand out as the leading model for extracting information from web pages, while BARD showcases exceptional abilities in recognizing company logos, thereby broadening the range of potential applications for this remarkable technology.

Appendix

Provided below are the exact prompts and screenshots used for readers to explore. You need ChatGPT plus account to run the GPT-4 tests. Keep in mind that GPT-4 and BARD are not deterministic, and results may vary based on the dynamic nature of these models and potential changes in the image when downloaded from medium platform.

Extract customer names from a company homepage

Prompt

Attached screenshot is a company homepage. At the bottom of the page there are logos of their customers. Extract all the customer names corresponding to those logos

GPT-4

Note: We need to update the prompt asking ChatGPT not to use code interpreter

BARD

Note: GPT-4 got all the customer names correct whereas BARD got DCP Midstream wrong

Obtain company information from a logo

Prompt

Which company's logo is this?

GPT-4

BARD

We found GPT-4 often fails in this tasks, whereas BARD excels at recognizing company logos.

Interestingly we found that Google image search performs significantly better when searching by company logos, suggesting it could provide better training data for Gemini. Below is a comparison of Google and Bing searches for the Opsera logo.

Extract company information from LinkedIn Sales Navigator

Prompt


Extract all the texts from the screenshot. The texts are a list of company name, linkedin industry, number of employees and 1 line about the company. Extract all the texts you can and put it in a list format

GPT-4

BARD

Extract people information from LinkedIn Sales Navigator

Prompt

Extract all the texts from the screenshot. The texts are a list of person names, their title, time in current role, time in current company, a short description about them and their experience. Extract all the texts you can and put it in a list format

GPT-4

BARD

Extract property information from a Zillow listing

Prompt


Attached is the screenshot of a property listing page on Zillow. Extract property price, address, beds, bath, square feet, HOA information from the screenshot

GPT-4

BARD

Unable to extract information from this page

Extract property information from an Airbnb listing

Prompt

Attached is a screenshot of an Airbnb listing. Extract nightly rate, number of reviews, rating, city state, guests, beds and baths from the screenshot.

GPT-4

BARD

--

--