Submit Your Requirement
Scroll down to discover

How to Scrape Real Estate Listings from Trulia using Python

September 14, 2018Category : Blog
How to Scrape Real Estate Listings from Trulia using Python

Last Updated on by Prithwi Mondal

Trulia is a website that set shop in 2005, initially with real-estate listings in California. It provides an overview of the home as well as the neighborhood to buyers as well as those looking to rent a home across the United States. They offer several features usually unavailable in real estate websites such as varied details and pointers about the locality and a map view of the area with the local places of importance, locations of crimes that have taken place in the past, schools and other public buildings marked.

Trulia also provides information regarding prices at which the house has previously exchanged hands, along with an option to compare the price of the house you select, with houses of same or different features, in different pin codes across the states. Besides the real estate details, prospective brokers are listed along with their cell numbers so that it is easy for you to make that dream property yours. Trulia’s map of local crimes is built upon data collected from SpotCrime.com and CrimeReports.com, which in turn collect and classify data using law enforcement reports and news reports. Every property listed on Trulia has information on nearby amenities, departmental stores, and local schools.

An interactive map with commuter and transit data shows the driving or commute times of the property from any given location in the United States. A visual representation of commute times is converted into a map-view to give a realistic look.

The best features of Trulia are the following-

  1. What Locals Say- What Locals Say, is a recent Trulia feature, that allows home buyers, sellers, and renters to get views of locals of an area, about the neighborhood. Information for this is gathered using polls, surveys, and independent reviews.
  2. Trulia Neighborhoods- Trulia recently launched Trulia Neighborhoods. It is a unique feature that helps people get more information about a property listing from its website. Original photography, description, and facts about the area, along with even drone footage can be seen in this feature.
  3. Local Legal Protections- Local Legal Protections is a service that provides information on local nondiscrimination laws that apply to a house, employment, as well as public accommodations. This data is provided beside property listings to make it easier for a more diverse crowd to find adequate accommodation in a comfortable environment.

So why should you be interested in scraping data from Trulia?

Why you should use a scraping solution to get data from Trulia? Well, I definitely hope that you do not want to have a team of ten people scraping data by copy-pasting day and night, instead of using a scraping engine that can do a year’s manual work in a day.

How to get started with scraping Trulia?

As for the installation and getting started, you can get those from a similar article, where we discussed how to crawl data from a leading travel portal. Once you have installed python and other dependencies along with the code editor Atom, come back to this article, to read on….

Where is the code?

In case you are tired of the text, let’s go right to the code. Although the code is given below, you can also download it from the link, and get down to business. You can run it using the python command itself as you might have seen in the other scraping tutorials. Once you download the program, just go to the location in command prompt and run the command-

python trulia_extractor.py

It will prompt you to enter the link for a Trulia property listing. Once the Extraction complete, a confirmation message is shown and you can go on to check your folder for the JSON file and the HTML file created.

[code language=”python”]

H:\Python_Algorithmic_Problems\Scraping_assignments\Trulia-Data-Extraction>python trulia_extractor.py
Enter Trulia Property Listing Url- https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215
———-Extraction of data is complete. Check json file.———-

#!/usr/bin/python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen

# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

url = input(‘Enter Trulia Property Listing Url- ‘)

# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
product_json = {}

# This code block will get you a one liner description of the listed property

for meta in soup.findAll(‘meta’, attrs={‘name’: ‘description’}):
try:
product_json[‘description’] = meta[‘content’] break
except:
pass

# This code block will get you the link of the listed property

for link in soup.findAll(‘link’, attrs={‘rel’: ‘canonical’}):
try:
product_json[‘link’] = link[‘href’] break
except:
pass

# This code block will get you the price and the currency of the listed property

for scripts in soup.findAll(‘script’,
attrs={‘type’: ‘application/ld+json’}):
details_json = ast.literal_eval(scripts.text.strip())
product_json[‘price’] = {}
product_json[‘price’][‘amount’] = details_json[‘offers’][‘price’] product_json[‘price’][‘currency’] = details_json[‘offers’
][‘priceCurrency’]

# This code block will get you the detailed description of the the listed property

for paragraph in soup.findAll(‘p’, attrs={‘id’: ‘propertyDescription’}):
product_json[‘broad-description’] = paragraph.text.strip()
product_json[‘overview’] = []

# This code block will get you the important points regarding the listed property

for divs in soup.findAll(‘div’,
attrs={‘data-auto-test-id’: ‘home-details-overview’
}):
for divs_second in divs.findAll(‘div’):
for uls in divs_second.findAll(‘ul’):
for lis in uls.findAll(‘li’, text=True, recursive=False):
product_json[‘overview’].append(lis.text.strip())

# Creates a json file with all the information that you extracted

with open(‘house_details.json’, ‘w’) as outfile:
json.dump(product_json, outfile, indent=4)

# Creates an html file in your local with the html content of the page you parsed.

with open(‘output_file.html’, ‘wb’) as file:
file.write(html)

print (‘———-Extraction of data is complete. Check json file.———-‘)
[/code]

If you enter the HTML mentioned in the example, you will get this JSON saved in your folder-

Here is the link to download and compare the JSON.

[code language=”php”]

{
"description": "327 101st St #1A, Brooklyn, NY is a 3 bed, 3 bath, 1302 sq ft home in foreclosure. Sign in to Trulia to receive all foreclosure information.",
"link": "https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215",
"price": {
"amount": "510000",
"currency": "USD"
},
"broad-description": "Very Large Duplex Unit with 1st floor featuring a Finished Recreational Room, an Entertainment Room and a Half Bathroom. Second Level Features 2 Bedrooms, 2 Full Bathrooms, a Living Room/Dining Room and an Outdoor Space. There is Verrazano Bridge views.\n View our Foreclosure Guides",
"overview": [
"Condo",
"3 Beds",
"3 Baths",
"Built in 2006",
"5 days on Trulia",
"1,302 sqft",
"$392/sqft",
"143 views"
] }
[/code]

Code explained

In case you want to understand the code, you can look into the comment statements, and for understanding the working of the different modules, you need to google a bit. But the most important part here is using Bs4 or BeautifulSoup. BeautifulSoup came into being when a group of developers realized that a lot of HTML code on the internet wasn’t “well-formed” but was functional. What this resulted in is that it did its work as expected, with some minor rarely occurring errors, but when someone tried to parse the same HTML file, he would meet roadblocks- that is he would be getting errors that the HTML wasn’t well-formed. If he tried to convert the HTML into a tree or any other data-structure, he would still get the same error. Now he has to sit and clean HTML written by some developer living in some other part of the world. This delays his real objective. Thus, to make things easier for coders, the team developed a parser that would absorb and HTML file passed and create a BeautifulSoup object with nodes and attributes that you can traverse very easily, almost like you traverse a tree.

For example when I write the code-

[code language=”php”]

for paragraph in soup.findAll(‘p’, attrs={‘id’: ‘propertyDescription’}):
product_json[‘broad-description’] = paragraph.text.strip()
[/code]

I am trying to extract the text within a <p> tag that has id = propertyDescription inside it. Simple isn’t it? Well you need to check out their website to understand more of it, and try out self exploration to extract more data from the HTML file that is also created on running the program. Here is the link for the HTML generated on running the code with the link provided above.

So what data did we get from Trulia?

So what were we able to extract using this simple code? If you look at the JSON properly, you can see that we have extracted quite a bit.

First, we got the description, which is sort of a header that you can use for the listing, then the link in case you need it for any reason, and it is followed up by the price broken into the amount as well as the currency. The broad description consists of an owner’s description that paints a picture in a person’s head as to how the house is. The overview contains a number of key aspects. Why are they not in a key: value format? Well, that is because no two houses may have the same aspects or things to boast of. That is why this heading consists of a list of important features that a prospective buyer might take interest in. It may include various points such as the number of beds, bathrooms, when the house was built, since when it is listed in Trulia, the total area, price per square feet, the number of people who have viewed the listing till date, and more.

So you understand that these things can change, and probably if you run the program on a listing one day, you might not get the same JSON as the one you got the day before from the same listing.

Using this code in business

This Trulia scraping setup can be used in your business in several ways. You can create a CSV of listing links, and get the code to run on individual rows of the CSV, using an automation script. What would be better is that you could build a system, that would grab all listings of a location when the location is fed into it, and then run this code to grab all data of each listing. This can easily be achieved using the expertise of scraping service providers such as PromptCloud. The use of such a database can be of many uses. Providing services to people trying to find accommodation is one that comes to one’s mind at the very beginning, but there are rather unconventional uses that can actually make you more money. You could collect data and across different locations, and build a prediction model that could help people decide how much to sell their house for. To build this model, you would need data like when a house in a listing was built, how many times it changed hands, its area, number of bedrooms, locality, and more. Most of these data are available in Trulia.

Data is money in today’s data-driven economy and making the most of all these freely available data on the internet can prove very profitable in any avenue of business that you decide to venture into. I would be signing off on that note and leave it for you to ponder over the fact.


Need help with extracting web data?

Get clean and ready-to-use data from websites for business applications through our web scraping services.

Oops! We could not locate your form.

Disclaimer: The code provided in this tutorial is only for learning purposes. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.

Leave a Reply

Your email address will not be published. Required fields are marked *

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top