Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!
Scrape Content From Quora
When it comes to searching for answers, most people go to Quora, be it using their website, or the app, and type their questions. However, you can write your web-scraper to scrape Quora using Python by entering questions, and this is what we will be showing you today.
Like a lot of other DIY web scraping articles that we have published before, we shall be using Python3.7 and the BeautifulSoup library to crawl Quora data and save it in a JSON file. Using this code you would be able to scrape and extract Quora answers and questions easily. The only other thing that you will be needing is a decent text editor. We have used PyCharm, which is a full-blown IDE, but you can also use Atom since it comes with multiple plugins and is more lightweight.
So to start with the code, we begin by importing the libraries that we will be needing, both internal and external. Once done, we need to make sure that we set the SSL certificate’s verify-mode to “CERT_NONE”, and check hostname to False, to avoid getting SSL certificate errors when we start scraping data. Once this is done, our setup is complete, and we can accept a question from the user. For this demo, we supplied the following value when this question was asked.
We create the Quora URL using this question. Does a simple string manipulation convert the question to this URL
Once we have created the URL, we use the inbuilt Request function from urllib to hit the webpage and make sure that we add Firefox in the header, so that the website is not able to track that we are accessing it from a piece of code. This part is important since most websites block scrapers and if you miss the header. Your IP will likely be blocked, and further actions can be initiated against you.
After we have obtained the webpage in HTML format and stored it in a variable. We need to convert it to a BeautifulSoup object so that it is easier to parse and extract data from. Then extract the question on the webpage from the first “title” tag on the page. We need to remove “ – Quora” from it since all titles come with the following string. Scraping the answer is slightly more complicated. You need to extract the JSON stored in the element of type “script” having the value for “type” as “application/ld+json”. Once you have obtained this JSON, you shall find a list of answers with multiple fields. While few fields are given for each answer. We have extracted the most important ones-
Once the data extraction is completed, we can append it to a list of answers and save the final list in a JSON file.
The JSON file given below contains some of the answers that were scraped from the HTML page when we ran the code with the question mentioned in the last section. As you can see, the JSON has two fields, the question, and the answers. Each answer consists of the three parameters that we mentioned earlier. While the number of answers scraped for this particular question were many. We have only shown a few of them below. Feel free to run the code yourself and check all the answers to this question, or any other.
While this might look like a perfect solution to finding the answers to any question on Quora. Like every other piece of DIY code, it comes with multiple limitations. One important aspect is that not every question you type will exist in Quora. You will have your code break every time you type a question that does not exist. At the same time, you might need to type your question multiple times to find which version exists. A better implementation would be to find the question that matches the one you entered closest.
Another aspect to consider is one related to the qualms of scraping Quora data and how you choose to use it. You need to make sure that you go through the robot.txt file and scrape data, and use it accordingly. Any commercial use of this code can lead you to legal issues. And using the data collected for anything other than research purposes may also cause problems.
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.