Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com
Abhisek Roy

If the town’s talking about a video, it has to be on YouTube, if the town’s going crazy over an article, it has to be on Facebook, and if the whole town is discussing the latest photos, they just have to be on Instagram. Here’s how to scrape instagram data

Founded in 2010, Instagram has become the largest photo sharing app worldwide, and its website features in top 15 websites in Alexa ranks most often. From its early days of only 1×1 picture upload restrictions, Instagram (lovingly called ‘insta’ by most), has come a long way. You can upload almost any media on Instagram these days, a set of photos, videos, dubsmashes, and more. However, the most liked things on Instagram are the celebrity profiles, and hashtags. In fact Wikipedia lists some top Instagram profiles- the ones having the most followers around the globe. The figures go up to millions of followers checking out their posts every single day. At the same time, hashtags are another important feature in Instagram. They create a set of similar images, and hence if anyone searches for a hashtag in Instagram, they are bound to find images with similarity or a literal connection. A simple example – “#hot” might lead you to images of things burning, fire as well as images of people who think they are attractive.

How to Scrape Instagram Data

So today we will be looking into scraping data from Instagram, more specifically, Instagram pages. We will also be scraping the number of followers, number of people following and number of posts of some top Instagram profiles. We will be using a text file with links for the profiles to do the task. You can add more profiles to the list and the program will also print the details for the new links that you pasted.

Second, we will scrape and save data to a text file, links of images with a given hashtag, that is provided by you. Here also we will be having a list of words stored in a text file, that we will read in python and for each word we will do a #word and search in Instagram and get the links for images associated to the hashtag.

Both the activity will have certain limitations, such as you cannot get details of private profiles, and only a certain number of links (Instagram has a cap) will be downloaded.

As usual, before we begin, I will once again request you to check out this link and set up python, and install a text-editor. I prefer atom and if you follow the link, you will get the link and how to download and install steps for it. Also install the python packages mentioned using pip as asked. In case I ask you to install some more packages, those can be done on the go.

Extracting Instagram Data of top Instagrammers

So, getting down to the code, here is a link for the given program. What we have basically done is, read the links from the given text file, created a beautifulsoup object for the extracted html for each link, and extracted some specific information from the URL. Here is a link to the text file that we have used. These are some of the top users as mentioned in Wikipedia. Although I copied the links manually from Wikipedia, that could have been done through scraping as well, but then I will leave that for another DIY article.

[code language=”python”]
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json

class Insta_Info_Scraper:

def getinfo(self, url):
html = urllib.request.urlopen(url, context=self.ctx).read()
soup = BeautifulSoup(html, ‘html.parser’)
data = soup.find_all(‘meta’, attrs={‘property’: ‘og:description’
})
text = data[0].get(‘content’).split()
user = ‘%s %s %s’ % (text[-3], text[-2], text[-1])
followers = text[0]
following = text[2]
posts = text[4]
print (‘User:’, user)
print (‘Followers:’, followers)
print (‘Following:’, following)
print (‘Posts:’, posts)
print (‘—————————‘)

def main(self):
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE

with open(‘users.txt’) as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for url in self.content:
self.getinfo(url)

if __name__ == ‘__main__’:
obj = Insta_Info_Scraper()
obj.main()
[/code]

Once you keep the program and the text file in the same folder and run the command-

python insta_info_scraper.py, this is what will come up in the command prompt-

[code language=”python”]

H:Python_Algorithmic_ProblemsScraping_assignmentsInstagram-Data-Extractor>python insta_info_scraper.py
User: Selena Gomez (@selenagomez)
Followers: 144.1m
Following: 49
Posts: 1,468
—————————
User: Cristiano Ronaldo (@cristiano)
Followers: 143.1m
Following: 416
Posts: 2,366
—————————
User: Ariana Grande (@arianagrande)
Followers: 130.5m
Following: 1,348
Posts: 3,669
—————————
User: Taylor Swift (@taylorswift)
Followers: 112.1m
Following: 0
Posts: 233
—————————
User: <img draggable=”false” data-mce-resize=”false” data-mce-placeholder=”1″ data-wp-emoji=”1″ class=”emoji” alt=”&#x1f47b;” src=”https://s.w.org/images/core/emoji/11/svg/1f47b.svg”> neymarjr (@neymarjr)
Followers: 103.4m
Following: 817
Posts: 4,263
—————————
User: Justin Bieber (@justinbieber)
Followers: 102.5m
Following: 92
Posts: 4,367
—————————
[/code]

Although this is a command prompt output, you could save the details to a text file using-

python insta_info_scraper.py > info.txt

which will cause the output to go to a text file, or better still, you could save it all to a json. Now, how to create a json file with all this info?  I will show that below:

[code language=”python”]

#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json

class Insta_Info_Scraper:

def getinfo(self, url):
html = urllib.request.urlopen(url, context=self.ctx).read()
soup = BeautifulSoup(html, ‘html.parser’)
data = soup.find_all(‘meta’, attrs={‘property’: ‘og:description’
})
text = data[0].get(‘content’).split()
user = ‘%s %s %s’ % (text[-3], text[-2], text[-1])
followers = text[0]
following = text[2]
posts = text[4]
info={}
info[“User”] = user
info[“Followers”] = followers
info[“Following”] = following
info[“Posts”] = posts
self.info_arr.append(info)

def main(self):
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE
self.info_arr=[]

with open(‘users.txt’) as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for url in self.content:
self.getinfo(url)
with open(‘info.json’, ‘w’) as outfile:
json.dump(self.info_arr, outfile, indent=4)
print(“Json file containing required info is created…………”)

if __name__ == ‘__main__’:
obj = Insta_Info_Scraper()
obj.main()
[/code]

When you run this code, you will only see a single statement being printed-

[code language=”python”]

H:Python_Algorithmic_ProblemsScraping_assignmentsInstagram-Data-Extractor&gt;python insta_info_scraper_json_format.py
Json file containing required info is created…………
[/code]

However, at the same time, a json file called info.json will be created in your folder-

The json will look like this-

[code language=”python”]

[
{
“User”: “Selena Gomez (@selenagomez)”,
“Followers”: “144.1m”,
“Following”: “49”,
“Posts”: “1,468”
},
{
“User”: “Cristiano Ronaldo (@cristiano)”,
“Followers”: “143.1m”,
“Following”: “416”,
“Posts”: “2,366”
},
{
“User”: “Ariana Grande (@arianagrande)”,
“Followers”: “130.5m”,
“Following”: “1,348”,
“Posts”: “3,669”
},
{
“User”: “Taylor Swift (@taylorswift)”,
“Followers”: “112.1m”,
“Following”: “0”,
“Posts”: “233”
},
{
“User”: “ud83dudc7b neymarjr (@neymarjr)”,
“Followers”: “103.4m”,
“Following”: “817”,
“Posts”: “4,263”
},
{
“User”: “Justin Bieber (@justinbieber)”,
“Followers”: “102.5m”,
“Following”: “92”,
“Posts”: “4,367”
}
[/code]

You can see that the data is the same, but it is in a more manageable and usable format. All that we did was stored the extracted data in a different format. But imagine if you are writing a program, or creating an app, that will be using data extracted from a thousand Instagram profiles, which type of data structure will be easier to use in your code? A json file, or a text file? Cleaning of data is something often overlooked, but is a very important step. In fact many data scientists have confessed that although they do work on complex models and simulation, data extraction, cleaning, and rearrangements make up almost seventy percent of their work. Now, you must have learned in detail on how to scrape Instagram data.

Extracting Image Links for Given Hashtags

Rather than info from celeb accounts, images are the main resources of Instagram. And what better way would be get images to form clusters and build a machine learning model, than using Instagram hashtags and downloading images having particular hashtags.

Here is the code which has been given below-

[code language=”python”]

#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json

class Insta_Image_Links_Scraper:

def getlinks(self, hashtag, url):

html = urllib.request.urlopen(url, context=self.ctx).read()
soup = BeautifulSoup(html, ‘html.parser’)
script = soup.find(‘script’, text=lambda t:
t.startswith(‘window._sharedData’))
page_json = script.text.split(‘ = ‘, 1)[1].rstrip(‘;’)
data = json.loads(page_json)
print (‘Scraping links with #’ + hashtag+”………..”)
for post in data[‘entry_data’][‘TagPage’][0][‘graphql’
][‘hashtag’][‘edge_hashtag_to_media’][‘edges’]:
image_src = post[‘node’][‘thumbnail_resources’][1][‘src’]
hs = open(hashtag + ‘.txt’, ‘a’)
hs.write(image_src + ‘n’)
hs.close()

def main(self):
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE

with open(‘hashtag_list.txt’) as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for hashtag in self.content:
self.getlinks(hashtag,
‘https://www.instagram.com/explore/tags/’
+ hashtag + ‘/’)

if __name__ == ‘__main__’:
obj = Insta_Image_Links_Scraper()
obj.main()
[/code]

For the program, we used a text file of a few hashtags, namely mustang, nature, nike, football and fifa, some words chosen at random. You can change the words in the text file or add some new words, and they will be used to form hashtags, and image-links for them will be downloaded as well.

When you run the given program, this is what you will be seeing in the command prompt-

[code language=”python”]
H:Python_Algorithmic_ProblemsScraping_assignmentsInstagram-Data-Extractor&gt;python insta_image_link_scraper.py
Scraping links with #mustang………..
Scraping links with #nature………..
Scraping links with #nike………..
Scraping links with #football………..
Scraping links with #fifa………..
[/code]

At the same time, you will see 5 files created in your current folder-

Mustang.txt, nature.txt, nike.txt, football.txt, fifa.txt

I have not uploaded the links for these files that were generated for me, due to security reasons, and of course, when you run the program, you will get a different set of links- the latest links related to the hashtag. Now, you must have learned in detail on how to scrape Instagram data.

Why Make the Effort?

Now you might be thinking, why make the effort, why not just download a few images from google, whenever I need to. Well, it depends on what you are trying to achieve. If you need a handful of images for your article, you can go with google images. But if you are trying to get all the images related to an event, or a person, or a something else, you can try using the hashtag to get images from Instagram. This way you will get a huge set of images, from where you can write a program to remove same and similar images, and then separate images with text and images without text and then use both types of images appropriately for your analysis.

Starting with SURF and SIFT (the two best image processing algorithms), image recognition and image processing techniques have evolved over the last few years. Speeding cars, to motion sensing cameras all use images to make an inference. With Instagram’s almost unlimited supply of images, your research project could go a long way. Now, you must have learned in detail on how to scrape Instagram data.

How to Integrate Scraping Systems With Your Business?

Scraping data from a website like Instagram, as per your requirements might end up being a tiresome task and can take your mind off much important things such as your business. Without experienced R and Python developers, it might take you months to build a complete system that will get you data from Instagram as per your exact requirements, since the website has high security, and keeps changing its layout. For complete solutions you should instead contact experienced web scraping teams such as PromptCloud. All you need to do is give us your requirements and you will be handed the data in a format that can go along well with your business. This saves you time, energy, money, and manpower, and leaves you to take up more difficult business challenges while we worry about the tech. Now, you must have learned in detail on how to scrape Instagram data.

Need Help with Extracting Web Data?

Get clean and ready-to-use data from websites for business applications through our web scraping services.

Disclaimer: The code provided in this tutorial is only for learning purposes. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.

Sharing is caring!

Are you looking for a custom data extraction service?

Contact Us