Last Updated on by
When it comes to online Q&A forums, the first name that comes to our mind is Quora. Ever since it opened its doors to the public on June 21st, 2010, it has been everything- from the agony aunt to the all-wise and all-knowing uncle for millions across the globe. There are few topics if any at all, that have not touched by Quorans in these few years. Whether you are planning to go for a holiday, or cannot decide which colleges to go to, Quora always has answers. But these answers need to be taken with a pinch of salt since not every user on the platform comes with the same level of knowledge on the matter at hand.
However, to tackle this, we have upvotes and downvotes, and usually, if you search for a specific question, the answer with the most upvotes comes up at the top- a democratic system that works better than most real-life democracies.
Now when it comes to searching for answers, most people go to quora, be it using their website, or the app, and type our questions. However, you can write your web-scraper to scrape data from quora by entering questions, and this what we will be showing you today.
Where Is The Code?
Like a lot of other DIY web scraping articles that we have published before, we shall be using Python3.7 and the BeautifulSoup library to scrape our data and save it in a JSON file. The only other thing that you will be needing is a decent text editor. We have used PyCharm, which is a full-blown IDE, but you can also use the Atom since it comes with multiple plugins and is more lightweight.
So to start with the code, we begin by importing the libraries that we will be needing, both internal and external. Once done, we need to make sure that we set the SSL certificate’s verify-mode to “CERT_NONE”, and check hostname to False, to avoid getting SSL certificate errors when we start scraping data. Once this is done, our setup is complete, and we can accept a question from the user. For this demo, we supplied the following value when this question was asked.
We create the Quora URL using this question. Does a simple string manipulation convert the question to this URL- https://www.quora.com/Should-I-move-to-London?
This string manipulation is required since quora formats its URLs in this manner.
Once we have created the URL, we use the inbuilt Request function from urllib to hit the webpage and make sure that we add Firefox in the header, so that the website is not able to track that we are accessing it from a piece of code. This part is important since most websites block scrapers and if you miss the header. Your IP will likely be blocked, and further actions can be initiated against you.
After we have obtained the webpage in HTML format and stored it in a variable. We need to convert it to a BeautifulSoup object so that it is easier to parse and extract data from. Then extract the question on the webpage from the first “title” tag on the page. We need to remove “ – Quora” from it since all titles come with the following string. Scraping the answer is slightly more complicated. You need to extract the JSON stored in the element of type “script” having the value for “type” as “application/ld+json”. Once you have obtained this JSON, you shall find a list of answers with multiple fields. While few fields are given for each answer. We have extracted the most important ones-
- The date on which the answer was written.
- The answer itself.
- The number of upvotes that it received.
Once the data extraction is completed, we can append it to a list of answers and save the final list in a JSON file.
Understanding The Output:
The JSON file given below contains some of the answers that were scraped from the HTML page when we ran the code with the question mentioned in the last section. As you can see, the JSON has two fields, the question, and the answers. Each answer consists of the three parameters that we mentioned earlier. While the number of answers scraped for this particular question were many. We have only shown a few of them below. Feel free to run the code yourself and check all the answers to this question, or any other.
The Limitations Of Scraping Content From Quora:
While this might look like a perfect solution to finding the answers to any question on Quora. Like every other piece of DIY code, it comes with multiple limitations. One important aspect is that not every question you type will exist in Quora. And unless you put a proper try-catch block on the code that accepts answers. You will have your code break every time you type a question that does not exist. At the same time. You might need to type your question multiple times to find which version exists. And you would rather just find it from Quora. A better implementation would be to find the question that matches the one you entered closest.
Another aspect to consider is one related to the qualms of scraping data and how you choose to use it. You need to make sure that you go through the robot.txt file and scrape data, and use it accordingly. Any commercial use of this code can lead you to legal issues. And using the data collected for anything other than research purposes may also cause problems.