If so, youâre already familiar with web scraping.
But, while this can certainly be useful, thereâs much more to web scraping than grabbing a few title tagsâit can actually be used to extract any data from any web page in seconds.
The question is: what data would you need to extract and why?
In this post, Iâll aim to answer these questions by showing you 6 web scraping hacks:
- How to find content âevangelistsâ in website comments
- How to collect prospectsâ data from âexpert roundupsâ
- How to remove junk âguest postâ prospects
- How to analyze performance of your blog categories
- How to choose the right content for Reddit
- How to build relationships with those who love your content
Iâve also automated as much of the process as possible to make things less daunting for those new to web scraping.
But first, letâs talk a bit more about web scraping and how it works.
A basic introduction to web scraping
Letâs assume that you want to extract the titles from your competitorsâ 50 most recent blog posts.
You could visit each website individually, check the HTML, locate the title tag, then copy/paste that data to wherever you needed it (e.g. a spreadsheet).
But, this would be very time-consuming and boring.
Thatâs why itâs much easier to scrape the data we want using a computer application (i.e. web scraper).
In general, there are two ways to âscrapeâ the data youâre looking for:
- Using a path-based system (e.g. XPath/CSS selectors);
- Using a search pattern (e.g. Regex)
XPath/CSS (i.e. path-based system) is the best way to scrape most types of data.
For example, letâs assume that we wanted to scrape the h1 tag from this document:
We can see that the h1 is nested in the body tag, which is nested under the html tagâhereâs how to write this as XPath/CSS:
- XPath: /html/body/h1
- CSS selector: html > body > h1
But what if we wanted to scrape the list of fruit instead?
You might guess something like: //ul/li (XPath), or ul > li (CSS), right?
Sure, this would work. But because there are actually two unordered lists (ul) in the document, this would scrape both the list of fruit AND all list items in the second list.
However, we can reference the class of the ul to grab only what we want:
- XPath: //ul[@class=âfruitâ]/li
- CSS selector: ul.fruit > li
Regex, on the other hand, uses search patterns (rather than paths) to find every matching instance within a document.
This is useful whenever path-based searches wonât cut the mustard.
For example, letâs assume that we wanted to scrape the words âfirstâ, âsecond,â and âthirdâ from the other unordered list in our document.
Thereâs no way to grab just these words using path-based queries, but we could use this regex pattern to match what we need:
<li>This is the (.*) item in the list<\/li>
This would search the document for list items (li) containing âThis is the [ANY WORD] item in the listâ AND extract only [ANY WORD] from that phrase.
Here are a few useful XPath/CSS/Regex resources:
- Regexr.com â Learn, build and test Regex;
- W3Schools XPath tutorial;
And scraping tools:
OK, letâs get started with a few web scraping hacks!
1. Find âevangelistsâ who may be interested in reading your new content by scraping existing website comments
Most people who comment on WordPress blogs will do so using their name and website.
You can spot these in any comments section as theyâre the hyperlinked comments.
But what use is this?
Well, letâs assume that youâve just published a post about X and youâre looking for people who would be interested in reading it.
Hereâs a simple way to find them (that involves a bit of scraping):
- Find a similar post on your website (e.g. if your new post is about link building, find a previous post you wrote about SEO/link buildingâjust make sure it has a decent amount of comments.);
- Scrape the names + websites of all commenters;
- Reach out and tell them about your new content.
Hereâs how to scrape them:
Go to the comments section then right-click any top-level comment and select âScrape similarâŠâ (note: you will need to install the Scraper Chrome Extension for this).
This should bring up a neat scraped list of commenters names + websites.
Make a copy of this Google Sheet, then hit âCopy to clipboard,â and paste them into the tab labeled â1. START HEREâ.
Go to the tab labeled â2. NAMES + WEBSITESâ and use the Google Sheets hunter.io add-on to find the email addresses for your prospects.
You can then reach out to these people and tell them about your new/updated post.
IMPORTANT: We advise being very careful with this strategy. Remember, these people may have left a comment, but they didnât opt into your email list. That could have been for a number of reasons, but chances are they were only really interested in this post. We, therefore, recommend using this strategy only to tell commenters about the updates to the post and/or other new posts that are similar. In other words, donât email people about stuff theyâre unlikely to care about!
Hereâs the spreadsheet with sample data.
2. Find people willing to contribute to your posts by scraping existing âexpert roundupsâ
âExpertâ roundups are WAY overdone.
But, this doesnât mean that including advice/insights/quotes from knowledgeable industry figures within your content is a bad idea; it can add a lot of value.
In fact, we did exactly this in our recent guide to learning SEO.
But, while itâs easy to find âexpertsâ you may want to reach out to, itâs important to remember that not everyone responds positively to such requests. Some people are too busy, while others simply despise all forms of âcoldâ outreach.
So, rather than guessing who might be interested in providing a quote/opinion/etc for your upcoming post, letâs instead reach out to those with a track record of responding positively to such requests by:
- Finding existing âexpert roundupsâ (or any post containing âexpertâ advice/opinions/etc) in your industry;
- Scraping the names + websites of all contributors;
- Building a list of people who are most likely to respond to your request.
Letâs give it a shot with this expert roundup post from Nikolay Stoyanov.
First, we need to understand the structure/format of the data we want to scrape. In this instance, it appears to be full name followed by a hyperlinked website.
HTML-wise, this is all wrapped in a <strong> tag.
Because we want both the names (i.e. text) and website (i.e. link) from within this <strong> tag, weâre going to use the Scraper extension to scrape for the âtext()â and âa/@hrefâ using XPath, like this:
Donât worry if your data is a little messy (as it is above); this will get cleaned up automatically in a second.
Next, make a copy of this Google Sheet, hit âCopy to clipboard,â then paste the raw data into the first tab (i.e. â1. START HEREâ).
Repeat this process for as many roundup posts as you like.
Finally, navigate to the second tab in the Google Sheet (i.e. â2. NAMES + DOMAINSâ) and youâll see a neat list of all contributors ordered by # of occurrences.
Here are 9 ways to find the email addresses for everyone on your list.
IMPORTANT: Always research any prospects before reaching out with questions/requests. And DONâT spam them!
Hereâs the spreadsheet with sample data.
3. Remove junk âguest postâ prospects by scraping RSS feeds
Blogs that havenât published anything for a while are unlikely to respond to guest post pitches.
Why? Because the blogger has probably lost interest in their blog.
Thatâs why I always check the publish dates on their few most recent posts before pitching them.
(If they havenât posted for more than a few weeks, I donât bother contacting them)
However, with a bit of scraping knowhow, this process can be automated. Hereâs how:
- Find the RSS feed for the blog;
- Scrape the âpubDateâ from the feed
Most blogs RSS feeds can be found at domain.com/feed/âthis makes finding the RSS feed for a list of blogs as simple as adding â/feed/â to the URL.
For example, the RSS feed for the Ahrefs blog can be found at https://ahrefs.com/blog/feed/
You can then use XPath within the IMPORTXML function in Google Sheets to scrape the pubDate element:
importxml(âhttps://ahrefs.com/blog/feed/â,â//pubDateâ)))
This will scrape every pubDate element in the RSS feed, giving you a list of publishing dates for the most recent 5-10 blog posts for that blog.
But how do you do this for an entire list of blogs?
Well, Iâve made another Google Sheet that automates the process for youâjust paste a list of blog URLs (e.g. https://ahrefs.com/blog) into the first tab (i.e. â1. ENTER BLOG URLsâ) and you should see something like this appear in the âRESULTSâ tab:
It tells you:
- The date of the most recent post;
- How many days/weeks/months ago that was;
- Average # of days/weeks/months between posts (i.e. how often they post, on average)
This is super-useful information for choosing who to pitch guest posts to.
For example, you can see that we publish a new post every 11 days on average, meaning that Ahrefs would definitely be a great blog to pitch to if you were in the SEO/marketing industry đ
Hereâs the spreadsheet with sample data.
Recommended reading: An In-Depth Look at Guest Blogging in 2016 (Case Studies, Data & Tips)
4. Find out what type of content performs best on your blog by scraping post categories
Many bloggers will have a general sense of what resonates with their audience.
But as an SEO/marketer, I prefer to rely on cold hard data.
When it comes to blog content, data can help answer questions that arenât instantly obvious, such as:
- Do some topics get shared more than others?
- Are there specific topics that attract more backlinks than others?
- Are some authors more popular than others?
In this section, Iâll show you exactly how to answer these questions for your blog by combining a single Ahrefs export with a simple scrape. Youâll even be able to auto-generate visual data representations like this:
Hereâs the process:
- Export the âtop contentâ report from Ahrefs Site Explorer;
- Scrape categories for all the blog posts;
- Analyse the data in Google Sheets (hint: Iâve included a template that does this automagically!)
To begin, we need to grab the top pages report from Ahrefsâletâs use ahrefs.com/blog for our example.
Site Explorer > Enter ahrefs.com/blog > Pages > Top Content > Export as .csv
Next, make a copy of this Google Sheet then paste all data from the Top Content .csv export into cell A1 of the first tab (i.e. â1. Ahrefs Exportâ).
Now comes the scrapingâŠ
Open up one of the URLs from the âContent URLâ column and locate the category under which the post was published.
We now need to figure out the XPath for this HTML element, so right-click and hit âInspectâ to view the HTML.
In this instance, we can see that the post category is contained within a <div> with the class âpost-categoryâ, which is nested within the <header> tag. This means our XPath would be:
//header/div[@class=âpost-categoryâ]
Now that we know this, we can use Screaming Frog to scrape the post category for each post; hereâs how:
- Open Screaming Frog and go to âModeâ > âListâ;
- Go to âConfigurationâ > âSpiderâ and uncheck all the boxes (like this);
- Go to âConfigurationâ > âCustomâ > âExtractionâ > âExtractor 1â and paste in your XPath (e.g. //header/div[@class=âpost-categoryâ]). Make sure you choose âXPathâ as the scraper mode and âExtract Textâ as the extractor mode (like this)
- Copy/paste all URLs from the Content URL into Screaming Frog, and start the scrape;
Once complete, head to the âCustomâ tab, filter by âExtractionâ and youâll see the extracted data for each URL.
Hit âExportâ, then copy all the data in the .csv into the next tab in the Google Sheet (i.e. â2. SF extractionâ).
Go to the final tab in the Google Sheet (i.e. âRESULTSâ) and youâll see a bunch of data + accompanying graphs.
Hereâs the spreadsheet with sample data.
5. Promote only the RIGHT kind of content on Reddit (by looking at what has already performed well)
Redditors despise self-promotion.
In fact, any lazy attempts to self-promote via the platform are usually met with a barrage of mockery and foul-language.
But hereâs the thing:
Redditors have nothing against you sharing something with them; you just need to make sure itâs something they actually care about.
The best way to do this is to scrape (and analyze) what they liked in the past, then share more of that type of content with them.
Hereâs the process:
- Choose a subreddit (e.g. /r/Entrepreneur);
- Scrape the top 1000 posts of all time;
- Analyse the data and act accordingly (yep, Iâve included a Google Sheet that does this for you!)
OK, first things first, make a copy of this Google Sheet + enter the subreddit you want to analyze. You should then see a formatted link to that subreddits top posts appear alongside it.
This takes you to a page showing the top 25 posts of all time for that subreddit.
However, this page only shows the top 25 posts. Weâre going to analyze the top 1,000, so we need to use a scraping tool to scrape multiple pages of results.
Reddit actually makes this rather difficult but Import.io (free up to 500 queries per month, which is plenty) can do this with ease.
Hereâs what weâre going to scrape from these pages (hint: click the links to see an example of each data point)):
- Rank;
- Score/upvotes;
- Title;
- User submitted by;
- Comments;
- Link flair (optional as this is not available on all subredditsâŠitâs also more obvious on some subreddits than othersâlearn more here)
OK, letâs stick with /r/Entrepreneur for our exampleâŠ
Go to Import.io > sign up > new extractor > paste in the link from the Google Sheet (shown above)
Click âGoâ.
Import.io will now work its magic and extract a bunch of data from the page.
Hit âSaveâ (but donât run it yet!)
Right now, the extractor is only set up to scrape the top 25 posts. You need to add the other URLs (from the tab labeled â2. MORE LINKSâ in the Google Sheet) to scrape the rest.
Add these under the âSettingsâ tab for your extractor.
Hit âSave URLsâ then run the extractor.
Download the .csv once complete.
Copy/paste all data from the .csv into the sheet labeled â3. IMPORT.IO EXPORTâ in the spreadsheet.
Finally, go to the âRESULTSâ sheet and enter a keywordâit will then kick back some neat stats showing how interested that subreddit is likely to be in your topic.
Hereâs the spreadsheet with sample data.
6. Build relationships with people who are already fans of your content
Most tweets will drive ZERO traffic to your website.
Thatâs why âbegging for tweetsâ from anyone and everyone is a terrible idea.
However, thatâs not to say all tweets are worthlessâitâs still worth reaching out to those who are likely to send real traffic to your website.
Hereâs a workflow for doing this (note: it includes a bit of Twitter scraping):
- Scrape and add all Twitter mentions to a spreadsheet (using IFTTT);
- Scrape the number of followers for the people whoâve shared a lot of your stuff;
- Find contact details, then reach out and build relationships with these people.
OK, so first, make a copy of this Google Sheet.
IMPORTANT: You MUST make a copy of this on the root of your Google Drive (i.e. not in a subfolder). It MUST also be named exactly âMy Twitter Mentionsâ.
Next, turn this recipe on within your IFTTT account (youâll need to connect your Twitter + Google Drive accounts to IFTTT in order to do this).
What does this recipe do? Basically, every time someone mentions you on Twitter, itâll scrape the following information and add it to a new row in the spreadsheet:
- Twitter handle (of the person who mentioned you);
- Their tweet;
- Tweet link;
- Time/date they tweeted
And if you go to the second sheet in the spreadsheet (i.e. the one labeled â1.Tweetsâ), youâll see the people whoâve mentioned you and tweeted a link of yours the highest number of times.
But, the fact that theyâve mentioned you a number of times doesnât necessarily indicate that theyâll drive any real traffic to your website.
So, you now want to scrape the number of followers each of these people has.
You can do this with CSS selectors using Screaming Frog.
Just set your search depth to â0â (see here), then use these settings under the custom extractor:
Hereâs each CSS selector (for clarification):
- Twitter Name: h1
- Twitter Handle: h2 > a > span > b
- Followers: li.ProfileNav-item.ProfileNav-item--followers > a > span.ProfileNav-value
- Website: div.ProfileHeaderCard > div.ProfileHeaderCard-url > span.ProfileHeaderCard-urlText.u-dir > a
Copy/paste all the Twitter links from the spreadsheet into Screaming Frog and run it.
Once finished, go to:
Custom > Extraction > Export
Open the exported .csv, then copy/paste all the data into the next tab in the sheet (i.e. the one labeled â2. SF Exportâ).
Lastly, go to the final tab (i.e. â3. RESULTSâ) and youâll see a list of everyone whoâs mentioned you along with a bunch of other information including:
- # of times they tweeted about you,
- # of followers
- Their website (where applicable)
Because these people have already shared your content in the past, and also have a good number of followers, itâs worth reaching out and building relationships with them.
Hereâs the spreadsheet with sample data.
Final thoughts
Web scraping is crazily powerful.
All you need is some basic XPath/CSS/Regex knowledge (along with a web scraping tool, of course) and itâs possible to scrape anything from any website in a matter of seconds.
Iâm a firm believer that the best way to learn is by doing, so I highly recommend that you spend some time replicating the experiments above. This will also teach you to pay attention to things that could easily be automated with web scraping in future.
So, play around with the tools/ideas above and let me know what you come up with in the comments section below đ