Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com
Real Estate Scraping
Avatar

The leading real estate sites of the world are a treasure trove of valuable data. The database of any popular US real estate site might contain information on more than 100M homes. These homes include the ones for sale, rent, or even ones not currently in the market. Scraping real estate data provides data for rent and property estimates called “Zestimates” as well. It helps owners, as well as customers, plan better by trying to estimate the prices of properties in the upcoming years.

When it comes to buying or renting properties, we know that the first thing that comes to one’s mind is price comparison. These sites for housing provide price comparison with all listings in that area, as well as basic information like the type of house, number of rooms, the size, a short description, etc.

Why Crawl Data from Real Estate Sites?

The large property listing companies target an entire region and work on millions of properties. But in case you are a real estate agent, instead of trying to gather data manually by yourself, you can better crawl data from a major real estate listing website.

You can also build Machine Learning models to predict the prices of properties and compare your predictions with Zillow™’s Zestimates™ and see which one is better or closer to actual values.

How to Scrape Real Estate Data Using Python?

In case you have followed any of our previous “How to crawl or scape” articles, you might already have the necessary setup ready on your computers. In case you have not, I recommend you to follow this article to set up data scraping python, its packages, and the text editor before you can get your hands dirty with the code.

Where is The Code?

Without much ado, we decided to bring you the code for scraping real estate data using python that will help you extract information from a property listing website. The data crawling code is written in Python and subsequently, I will show you how to run it and what you will get once you run it.

[code language=”python”]
#!/usr/bin/python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen

# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

url = input(‘Enter Zillow House Listing Url- ‘)

# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
property_json = {}
property_json[‘Details_Broad’] = {}
property_json[‘Address’] = {}

# Extract Title of the property listing

for title in soup.findAll(‘title’):
property_json[‘Title’] = title.text.strip()
break

for meta in soup.findAll(‘meta’, attrs={‘name’: ‘description’}):
property_json[‘Detail_Short’] = meta[‘content’].strip()

for div in soup.findAll(‘div’, attrs={‘class’: ‘character-count-truncated’}):
property_json[‘Details_Broad’][‘Description’] = div.text.strip()

for (i, script) in enumerate(soup.findAll(‘script’,
attrs={‘type’: ‘application/ld+json’})):
if i == 0:
json_data = json.loads(script.text)
property_json[‘Details_Broad’][‘Number of Rooms’] = json_data[‘numberOfRooms’]
property_json[‘Details_Broad’][‘Floor Size (in sqft)’] = json_data[‘floorSize’][‘value’]
property_json[‘Address’][‘Street’] = json_data[‘address’][‘streetAddress’]
property_json[‘Address’][‘Locality’] = json_data[‘address’][‘addressLocality’]
property_json[‘Address’][‘Region’] = json_data[‘address’][‘addressRegion’]
property_json[‘Address’][‘Postal Code’] = json_data[‘address’][‘postalCode’]
if i == 1:
json_data = json.loads(script.text)
property_json[‘Price in $’] = json_data[‘offers’][‘price’]
property_json[‘Image’] = json_data[‘image’]
break

with open(‘data.json’, ‘w’) as outfile:
json.dump(property_json, outfile, indent=4)

with open(‘output_file.html’, ‘wb’) as file:
file.write(html)

print (‘———-Extraction of data is complete. Check json file.———-‘)
[/code]

To run the code given above, you need to save it in a file with the extension, such as propertyScraper.py. Once that is done, from the terminal, run the command –

[code language=”python”]
python propertyScraper.py
[/code]

When you run it, you will be prompted to enter the URL of a property listing. This is the webpage that will actually be crawled for data by the program. We have used two links and scraped the data of two properties. Here are the links –

  1. https://www.zillow.com/homedetails/638-Grant-Ave-North-Baldwin-NY-11510/31220792_zpid/
  2. https://www.zillow.com/homedetails/10-Walnut-St-Arlington-MA-02476/56401372_zpid/

The JSON files obtained on running the code on the given in a later subtopic.

Code Explanation

Before going into how the code runs and what it returns, it is important to understand the code itself. As usual, we first hit the URL given and capture the entire HTML which we convert into a beautiful soup object. Once that is done, we extract specific divs, scripts, titles, and other tags with specific attributes. This way we are able to pinpoint specific information that we may want to extract from a page.

You can see that we have also extracted an image link for each property. This has been done deliberately since for something like real estate, images are just as much value as other information. While we have indeed extracted several fields from the real estate listing pages, it is to be noted that the HTML page does contain many more data points. Hence we are also saving the HTML content locally so that you can go through it and crawl more information.

Some of The House Listings that We Scraped

As we mentioned before, we actually crawled a few property listings for you to show you how the scraped data by Python would look in JSON format. Also, we have mentioned the property for which a particular JSON is, under the JSON. Now let’s talk about the data points that we scraped.

We got an image of the property (although many images for each property are available on a listing page, we got one for each- that is the top image for each listing). We also got the price (in $) that it is listed at, the title for the property, and a description of it that would help you create a mental picture of the property.

Along with this, we scraped the address, broken down into four separate parts: the street, the locality, the region, and the postal code. We have another details field that has multiple subfields, such as the number of rooms, the floor size, and a long description. In certain cases, the description is missing as we found out once we scraped multiple pages.

[code language=”python”]
{
“Details_Broad”: {
“Number of Rooms”: 4,
“Floor Size (in sqft)”: “1,728”
},
“Address”: {
“Street”: “638 Grant Ave”,
“Locality”: “North baldwin”,
“Region”: “NY”,
“Postal Code”: “11510”
},
“Title”: “638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow”,
“Detail_Short”: “638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924”,
“Price in $”: 299000,
“Image”: “https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg”
}
[/code]

[code language=”python”]
{
“Details_Broad”: {
“Description”: “Three dormer single family home situated in Arlington’s Brattle neighborhood between Arlington Heights and Arlington Center. Built in the 1920s this home offers beautiful period details, hard wood floors, beamed ceilings, fireplaced living room with private sunroom, a formal dining room, three large bedrooms, an office and two full baths. The potential of enhancing this property to expand living space and personalize to your personal taste is exceptional. Close to Minuteman Commuter Bikeway, Rt 77 and 79 Bus lines, schools, shopping and restaurants. Virtual staging and virtual renovation photos provided to help you visualize.”,
“Number of Rooms”: 4,
“Floor Size (in sqft)”: “2,224”
},
“Address”: {
“Street”: “10 Walnut St”,
“Locality”: “Arlington”,
“Region”: “MA”,
“Postal Code”: “02476”
},
“Title”: “10 Walnut St, Arlington, MA 02476 | MLS #72515880 | Zillow”,
“Detail_Short”: “10 Walnut St , Arlington, MA 02476-6116 is a single-family home listed for-sale at $725,000. The 2,224 sq. ft. home is a 4 bed, 2.0 bath property. Find 34 photos of the 10 Walnut St home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 72515880”,
“Price in $”: 725000,
“Image”: “https://photos.zillowstatic.com/p_h/ISifzwig3xt2re1000000000.jpg”
}
[/code]

[code language=”python”]
{
“Details_Broad”: {
“Number of Rooms”: 4,
“Floor Size (in sqft)”: “1,728”
},
“Address”: {
“Street”: “638 Grant Ave”,
“Locality”: “North baldwin”,
“Region”: “NY”,
“Postal Code”: “11510”
},
“Title”: “638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow”,
“Detail_Short”: “638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924”,
“Price in $”: 299000,
“Image”: “https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg”
}
[/code]

Scraping Real Estate Data on A Large Scale

Using code like this you can crawl details related to a few specific real estate properties only. You could manually check up on properties you are interested in from time to time. However, if you are looking to target a specific region in the USA, or internationally, you’d need an expert web scraping service provider or data scraping tools to help you gather property listings from a number of websites.

PromptCloud, as a leading web scraping provider believes that web scraping solutions should be hassle-free and should contain only two steps–the client gives the requirement and receives clean data.

Also, note here in the blog we have used both crawler and scraper. Don’t be confused more or less both are the same but if you want to know more, you can check our blog on web data crawling vs web data scraping.


Disclaimer: The code present in our tutorial is only for learning purposes. We will not be responsible for the way it is used, and there will be no liability from our end for any adverse effect of the source code. The mere presence of this code on our site does not imply that we promote scraping or crawling the websites mentioned in the article. The sole purpose of this tutorial is to showcase the technique of writing web scrapers for leading web portals. We are not obligated to deliver any support for the code, however; we encourage you to add your questions and feedback in the comment section so that we may check and respond at certain intervals.

Sharing is caring!

Are you looking for a custom data extraction service?

Contact Us