Submit Your Requirement
Scroll down to discover

Scraping Real Estate Data Using Python

August 12, 2019Category : Blog
Scraping Real Estate Data Using Python

Last Updated on by Tarun

The leading real estate sites of the world are a treasure trove of valuable data. The database of any of popular real estate site targeting the US might contain information on more than 100 million homes. These homes include the ones for sale, rent, or even ones not currently on the market. It provides rent and property estimates called “Zestimates” as well. It helps owners, as well as customers, plan better by trying to estimate the prices of properties in the next one, five or even ten years.

When it comes to buying or renting properties, we know that the first thing that comes to one’s mind is price comparison. These sites for housing provide price comparison with all listings in that area, as well as basic information like the type of house it is, number of rooms, the size, a short description, etc. You can even get new estimates for a property if a certain change has been recently made – for example, say a swimming pool has been added in the backyard, or the kitchen has been remodeled. These portals also provide multiple APIs for developer networks.

It partnered with Microsoft to provide a bird’s eye view of famous properties. These consist of pictures that are taken from airplanes and these pictures are far superior to the ones taken from satellites.

Why crawl data from real estate sites?

The large property listing companies target an entire nation and work on millions of properties. But in case you are a real estate agent, or if you are setting up shop and targeting a specific state or region, it is better that instead of trying to gather data yourself, you crawl it from a major real estate listing website.

You can also build Machine Learning models to predict the prices of properties and compare your predictions with Zillow™’s Zestimates™ and see which one is better or closer to real values.

How to set things up?

In case you have followed any of our previous “How to crawl” articles, you might already have the necessary setup ready on your computers. In case you have not, I recommend you to follow this article to set up Python, its packages and the text editor before you can get your hands dirty with the code. 

Where is the code?

Without much ado, we decided to bring you the code for the scraper that will help you extract information from a property listing. It is written in Python and subsequently, I will show you how to run it and what you will get once you run it.

[code language=”python”] #!/usr/bin/python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen

# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

url = input(‘Enter Zillow House Listing Url- ‘)

# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
property_json = {}
property_json[‘Details_Broad’] = {}
property_json[‘Address’] = {}

# Extract Title of the property listing

for title in soup.findAll(‘title’):
property_json[‘Title’] = title.text.strip()
break

for meta in soup.findAll(‘meta’, attrs={‘name’: ‘description’}):
property_json[‘Detail_Short’] = meta[‘content’].strip()

for div in soup.findAll(‘div’, attrs={‘class’: ‘character-count-truncated’}):
property_json[‘Details_Broad’][‘Description’] = div.text.strip()

for (i, script) in enumerate(soup.findAll(‘script’,
attrs={‘type’: ‘application/ld+json’})):
if i == 0:
json_data = json.loads(script.text)
property_json[‘Details_Broad’][‘Number of Rooms’] = json_data[‘numberOfRooms’] property_json[‘Details_Broad’][‘Floor Size (in sqft)’] = json_data[‘floorSize’][‘value’] property_json[‘Address’][‘Street’] = json_data[‘address’][‘streetAddress’] property_json[‘Address’][‘Locality’] = json_data[‘address’][‘addressLocality’] property_json[‘Address’][‘Region’] = json_data[‘address’][‘addressRegion’] property_json[‘Address’][‘Postal Code’] = json_data[‘address’][‘postalCode’] if i == 1:
json_data = json.loads(script.text)
property_json[‘Price in $’] = json_data[‘offers’][‘price’] property_json[‘Image’] = json_data[‘image’] break

with open(‘data.json’, ‘w’) as outfile:
json.dump(property_json, outfile, indent=4)

with open(‘output_file.html’, ‘wb’) as file:
file.write(html)

print (‘———-Extraction of data is complete. Check json file.———-‘)
[/code]

To run the code given above, you need to save it in a file with the extension, such as propertyScraper.py. Once that is done, from the terminal, run the command –

[code language=”python”] python propertyScraper.py
[/code]

When you run it, you will be prompted to enter the URL of a property listing. This is the webpage that will actually be scraped by the program. We have used two links and scraped the data of two properties. Here are the links –

  1. https://www.zillow.com/homedetails/638-Grant-Ave-North-Baldwin-NY-11510/31220792_zpid/
  2. https://www.zillow.com/homedetails/10-Walnut-St-Arlington-MA-02476/56401372_zpid/

The JSON files obtained on running the code on the given in a later subtopic.

Code explanation

Before going into how the code runs and what it returns, it is important to understand the code itself. As usual, we first hit the URL given and capture the entire HTML which we convert into beautiful soup object. Once that is done, we extract specific divs, scripts, titles, and other tags with specific attributes. This way we are able to pinpoint specific information that we may want to extract from a page. 

You can see that we have also extracted an image link for each property. This has been done deliberately since for something like real estate, images are just as much value as other information. While we have indeed extracted several fields from the real estate listing pages, it is to be noted that the HTML page does contain many more data points. Hence we are also saving the HTML content locally so that you can go through it and crawl more information.

Some of the house listings that we scraped

Like we mentioned before, we actually scraped a few property listings for you to show you how the scraped data would look in JSON format. Also, we have mentioned the property for which a particular JSON is, under the JSON. Now let’s talk about the data points that we scraped. 

We got an image of the property (although many images for each property is available on a listing page, we got one for each- that is the top image for each listing). We also got the price (in $) that it is listed at, the title for the property, and a description of it that would help you create a mental picture of the property. 

Along with this, we scraped the address, broken down into four separate parts: the street, the locality, the region, and the postal code. We have another details field that has multiple subfields, such as the number of rooms, the floor size, and a long description. In certain cases, the description is missing as we found out once we scraped multiple pages. 

[code language=”python”] {
"Details_Broad": {
"Number of Rooms": 4,
"Floor Size (in sqft)": "1,728"
},
"Address": {
"Street": "638 Grant Ave",
"Locality": "North baldwin",
"Region": "NY",
"Postal Code": "11510"
},
"Title": "638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow",
"Detail_Short": "638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924",
"Price in $": 299000,
"Image": "https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg"
}
[/code] [code language=”python”] {
"Details_Broad": {
"Description": "Three dormer single family home situated in Arlington’s Brattle neighborhood between Arlington Heights and Arlington Center. Built in the 1920s this home offers beautiful period details, hard wood floors, beamed ceilings, fireplaced living room with private sunroom, a formal dining room, three large bedrooms, an office and two full baths. The potential of enhancing this property to expand living space and personalize to your personal taste is exceptional. Close to Minuteman Commuter Bikeway, Rt 77 and 79 Bus lines, schools, shopping and restaurants. Virtual staging and virtual renovation photos provided to help you visualize.",
"Number of Rooms": 4,
"Floor Size (in sqft)": "2,224"
},
"Address": {
"Street": "10 Walnut St",
"Locality": "Arlington",
"Region": "MA",
"Postal Code": "02476"
},
"Title": "10 Walnut St, Arlington, MA 02476 | MLS #72515880 | Zillow",
"Detail_Short": "10 Walnut St , Arlington, MA 02476-6116 is a single-family home listed for-sale at $725,000. The 2,224 sq. ft. home is a 4 bed, 2.0 bath property. Find 34 photos of the 10 Walnut St home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 72515880",
"Price in $": 725000,
"Image": "https://photos.zillowstatic.com/p_h/ISifzwig3xt2re1000000000.jpg"
}
[/code] [code language=”python”] {
"Details_Broad": {
"Number of Rooms": 4,
"Floor Size (in sqft)": "1,728"
},
"Address": {
"Street": "638 Grant Ave",
"Locality": "North baldwin",
"Region": "NY",
"Postal Code": "11510"
},
"Title": "638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow",
"Detail_Short": "638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924",
"Price in $": 299000,
"Image": "https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg"
}
[/code]

Scraping real estate data on a large scale

Using code like this you can crawl details related to a few specific real estate properties. And you could manually check up on properties you are interested in from time to time. However, if you are looking to target a specific region in the USA, or internationally, you’d need an expert web scraping service provider to help you gather property listings from a number of websites. PromptCloud, as a leading web scraping provider believes that web scraping solutions should be hassle-free and should contain only two steps – the client gives the requirement and receives clean data.


Disclaimer: The code present in our tutorial is only for learning purposes. We will not be responsible for the way it is used and there will be no liability from our end for any adverse effect of the source code. The mere presence of this code on our site does not imply that we promote scraping or crawl the websites mentioned in the article. The sole purpose of this tutorial is to showcase the technique of writing web scrapers for leading web portals. We are not obligated to deliver any support for the code, however, we encourage you to add your questions and feedback in the comment section so that we may check and respond at certain intervals.

Leave a Reply

Your email address will not be published. Required fields are marked *

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top