{"id":17473,"date":"2017-10-16T23:45:50","date_gmt":"2017-10-17T07:45:50","guid":{"rendered":"https:\/\/ahrefs.com\/blog\/?p=17473"},"modified":"2023-07-19T03:32:33","modified_gmt":"2023-07-19T08:32:33","slug":"web-scraping-for-marketers","status":"publish","type":"post","link":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/","title":{"rendered":"6 Actionable Web Scraping Hacks for White Hat Marketers"},"content":{"rendered":"<div class=\"intro-txt\"> Have you ever used a program like <a href=\"https:\/\/www.screamingfrog.co.uk\/seo-spider\/\" target=\"_blank\" rel=\"noopener\">Screaming Frog<\/a>&nbsp;to extract metadata (e.g. title\/description\/etc.) from a bunch of web pages in&nbsp;bulk?&nbsp;<\/div>\n<p>If so, you\u2019re <em>already<\/em>&nbsp;familiar with web scraping.<\/p>\n<p>But, while this can certainly be useful, there\u2019s much more to web scraping than grabbing a few title tags\u2014it can actually be used to extract <em>any<\/em>&nbsp;data from <em>any<\/em>&nbsp;web page in seconds.<\/p>\n<p>The question is: <em>what<\/em>&nbsp;data would you need to extract and <em>why<\/em>?<\/p>\n<p>In this post, I\u2019ll aim to answer these questions by showing you 6 web scraping hacks:<\/p>\n<ol>\n<li>How to find content \u201cevangelists\u201d in website comments<\/li>\n<li>How to collect prospects\u2019 data from \u201cexpert roundups\u201d<\/li>\n<li>How to remove junk \u201cguest post\u201d prospects<\/li>\n<li>How to analyze performance of your blog categories<\/li>\n<li>How to choose the right content for Reddit<\/li>\n<li>How to build relationships with those who love your content<\/li>\n<\/ol>\n<p>I\u2019ve also automated as much of the process as possible to make things less daunting for those new to web scraping.<\/p>\n<p>But first, let\u2019s talk a bit more about web scraping and how it&nbsp;works.<\/p>\n<h2>A basic introduction to web scraping<\/h2>\n<p>Let\u2019s assume that you want to extract the titles from your competitors\u2019 50 most recent blog&nbsp;posts.<\/p>\n<p>You could visit each website individually, check the HTML, locate the title tag, then copy\/paste that data to wherever you needed it (e.g. a spreadsheet).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"215\" class=\"wp-image-17435\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/view-source_https___ahrefs_com_blog_asking-for-tweets_.png\" alt=\"view source https ahrefs com blog asking for tweets\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/view-source_https___ahrefs_com_blog_asking-for-tweets_.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/view-source_https___ahrefs_com_blog_asking-for-tweets_-768x184.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/view-source_https___ahrefs_com_blog_asking-for-tweets_-680x163.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>But, this would be <em>very<\/em>&nbsp;time-consuming and boring.<\/p>\n<p>That\u2019s why it\u2019s much easier to scrape the data we want using a computer application (i.e. web scraper).<\/p>\n<p>In general, there are two ways to \u201cscrape\u201d the data you\u2019re looking for:<\/p>\n<ol>\n<li>Using a path-based system (e.g. XPath\/CSS selectors);<\/li>\n<li>Using a search pattern (e.g.&nbsp;Regex)<\/li>\n<\/ol>\n<p>XPath\/CSS (i.e. path-based system) is the best way to scrape most types of&nbsp;data.<\/p>\n<p>For example, let\u2019s assume that we wanted to scrape the <a href=\"https:\/\/ahrefs.com\/blog\/h1-tag\/\"><em>h1<\/em>&nbsp;tag<\/a> from this document:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"444\" class=\"wp-image-17453\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-h1.png\" alt=\"HTML h1\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-h1.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-h1-768x379.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-h1-680x336.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>We can see that the <em>h1<\/em>&nbsp;is nested in the <em>body <\/em>tag, which is nested under the <em>html<\/em>&nbsp;tag\u2014here\u2019s how to write this as XPath\/CSS:<\/p>\n<ul>\n<li><strong>XPath:<\/strong>&nbsp;\/html\/body\/h1<\/li>\n<li><strong>CSS selector:<\/strong>&nbsp;html &gt; body &gt;&nbsp;h1<\/li>\n<\/ul>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> Because there is only one h1 tag in the document, we don\u2019t actually need to give the full path. Instead, we can just tell the scraper to find all instances of h1 throughout the document with \u201c\/\/h1\u201d for XPath, and simply \u201ch1\u201d&nbsp;for CSS.&nbsp;<\/div>\n<p>But what if we wanted to scrape the list of fruit instead?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"444\" class=\"wp-image-17451\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-fruit.png\" alt=\"html fruit\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-fruit.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-fruit-768x379.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-fruit-680x336.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>You might guess something like: <em>\/\/ul\/li<\/em>&nbsp;(XPath), or <em>ul &gt; li<\/em>&nbsp;(CSS), right?<\/p>\n<p>Sure, this would work. But because there are actually two unordered lists (ul) in the document, this would scrape both the list of fruit AND all list items in the second list.<\/p>\n<p>However, we can reference the <em>class<\/em>&nbsp;of the <em>ul<\/em>&nbsp;to grab only what we&nbsp;want:<\/p>\n<ul>\n<li><strong>XPath:<\/strong>&nbsp;\/\/ul[@class=\u2019fruit\u2019]\/li<\/li>\n<li><strong>CSS selector:<\/strong>&nbsp;ul.fruit &gt;&nbsp;li<\/li>\n<\/ul>\n<p>Regex, on the other hand, uses search patterns (rather than paths) to find <em>every<\/em>&nbsp;matching instance within a document.<\/p>\n<p>This is useful whenever path-based searches won\u2019t cut the mustard.<\/p>\n<p>For example, let\u2019s assume that we wanted to scrape the words \u201cfirst\u2019, \u201csecond,\u201d and \u201cthird\u201d from the other unordered list in our document.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"444\" class=\"wp-image-17447\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-regex.png\" alt=\"html regex\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-regex.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-regex-768x379.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-regex-680x336.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>There\u2019s no way to grab <em>just<\/em>&nbsp;these words using path-based queries, but we could use this regex pattern to match what we&nbsp;need:<\/p>\n<p><strong>&lt;li&gt;This is the (.*) item in the list&lt;\\\/li&gt;<\/strong><\/p>\n<p>This would search the document for list items (<em>li<\/em>) containing <em>\u201cThis is the [ANY WORD] item in the list\u201d <\/em>AND extract <em>only<\/em>&nbsp;[ANY WORD] from that phrase.<\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> Because regex doesn\u2019t use the structured nature of HTML\/XML files, results are often less accurate than they are with CSS\/XPath. You should only&nbsp;use Regex when XPath\/CSS isn\u2019t a viable option.&nbsp;<\/div>\n<p>Here are a few useful XPath\/CSS\/Regex resources:<\/p>\n<ul>\n<li><u><a href=\"http:\/\/regexr.com\/\" target=\"_blank\" rel=\"noopener\">Regexr.com<\/a><\/u>&nbsp;\u2014 Learn, build and test&nbsp;Regex;<\/li>\n<li><u><a href=\"https:\/\/www.w3schools.com\/xml\/xpath_intro.asp\" target=\"_blank\" rel=\"noopener target=\">W3Schools XPath tutorial<\/a><\/u>;<\/li>\n<\/ul>\n<p>And scraping tools:<\/p>\n<ul>\n<li><u><a href=\"http:\/\/urlprofiler.com\/\" target=\"_blank\" rel=\"noopener\">URL Profiler<\/a><\/u><\/li>\n<li><u><a href=\"https:\/\/www.screamingfrog.co.uk\/seo-spider\/\" target=\"_blank\" rel=\"noopener\">Screaming Frog<\/a><\/u><\/li>\n<li><u><a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scraper\/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en\" target=\"_blank\" rel=\"noopener\">Scraper (Chrome Extension)<\/a><\/u><\/li>\n<li><u><a href=\"http:\/\/seotoolsforexcel.com\/\" target=\"_blank\" rel=\"noopener\">SeoTools for&nbsp;Excel<\/a><\/u><\/li>\n<li><u><a href=\"https:\/\/www.import.io\/\" target=\"_blank\" rel=\"noopener\">Import.io<\/a><\/u><\/li>\n<\/ul>\n<p>OK, let\u2019s get started with a few web scraping hacks!<\/p>\n<h2>1. Find \u201cevangelists\u201d who may be interested in reading your new content by scraping existing website comments<\/h2>\n<p>Most people who comment on WordPress blogs will do so using their name and website.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"901\" height=\"327\" class=\"wp-image-17464\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/wordpress-comment-name-website.png\" alt=\"wordpress comment name website\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/wordpress-comment-name-website.png 901w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/wordpress-comment-name-website-768x279.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/wordpress-comment-name-website-680x247.png 680w\" sizes=\"auto, (max-width: 901px) 100vw, 901px\"><\/p>\n<p>You can spot these in any comments section as they\u2019re the hyperlinked comments.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"262\" class=\"wp-image-17457\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/hyperlinked-comment.png\" alt=\"hyperlinked comment\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/hyperlinked-comment.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/hyperlinked-comment-768x224.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/hyperlinked-comment-680x198.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>But what use is&nbsp;this?<\/p>\n<p>Well, let\u2019s assume that you\u2019ve just published a post about X and you\u2019re looking for people who would be interested in reading it.<\/p>\n<p>Here\u2019s a simple way to find them (that involves a bit of scraping):<\/p>\n<ol>\n<li>Find a similar post on your website (e.g. if your new post is about link building, find a previous post you wrote about SEO\/link building\u2014just make sure it has a decent amount of comments.);<\/li>\n<li>Scrape the names + websites of all commenters;<\/li>\n<li>Reach out and tell them about your new content.<\/li>\n<\/ol>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> This works well because these people are (a) existing fans of your&nbsp;work, and (b) loved one of your previous posts on the topic so much that they left a comment. So, while this is still \u201ccold\u201d pitching, the likelihood of them being interested in your content is much higher in comparison to pitching directly to strangers.&nbsp;<\/div>\n<p>Here\u2019s how to scrape them:<\/p>\n<p>Go to the comments section then right-click any top-level comment and select \u201cScrape similar\u2026\u201d (note: you will need to install the <u><a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scraper\/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en\" target=\"_blank\" rel=\"noopener\">Scraper Chrome Extension<\/a><\/u>&nbsp;for this).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"476\" class=\"wp-image-17439\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-comments.png\" alt=\"scrape similar comments\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-comments.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-comments-768x406.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-comments-680x360.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>This should bring up a neat scraped list of commenters names + websites.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"430\" class=\"wp-image-17440\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-done.png\" alt=\"scrape similar done\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-done.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-done-768x367.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/scrape-similar-done-680x325.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>Make a copy of <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1NDMdwvwEVg3T7pzT9BRfBuBCJueHAgTYyOWis-XkLsk\/edit?usp=sharing\" target=\"_blank\" rel=\"noopener\">this Google Sheet<\/a><\/u>, then hit \u201cCopy to clipboard,\u201d and paste them into the tab labeled \u201c1. START&nbsp;HERE\u201d.<\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> If you have multiple pages of comments, you\u2019ll have to repeat this process for&nbsp;each.&nbsp;<\/div>\n<p>Go to the tab labeled \u201c2. NAMES + WEBSITES\u201d and use the <u><a href=\"https:\/\/hunter.io\/sheets\" target=\"_blank\" rel=\"noopener\">Google Sheets hunter.io add-on<\/a><\/u>&nbsp;to find the email addresses for your prospects.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"319\" class=\"wp-image-17450\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/email-addresses.gif\" alt=\"email addresses\"><\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> Hunter.io won\u2019t succeed with all your prospects so <u><a href=\"https:\/\/ahrefs.com\/blog\/find-email-address\/\" target=\"_blank\" rel=\"noopener\">here are more actionable ways to find email addresses<\/a> <\/u><\/div>\n<p>You can then reach out to these people and tell them about your new\/updated post.<\/p>\n<p><strong>IMPORTANT<\/strong>: We advise being <em>very<\/em>&nbsp;careful with this strategy. Remember, these people may have left a comment, but they <em>didn\u2019t<\/em>&nbsp;opt into your email list. That could have been for a number of reasons, but chances are they were only really interested in this post. We, therefore, recommend using this strategy <em>only<\/em> to tell commenters about the updates to the post and\/or other new posts that are similar. In other words, don\u2019t email people about stuff they\u2019re unlikely to care&nbsp;about!<\/p>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/16RoiXcELZn6MI3RlC4dus0c2rnVthZEvg9kQucURsVU\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<h2>2. Find people willing to contribute to your posts by scraping existing \u201cexpert roundups\u201d<\/h2>\n<p>\u201cExpert\u201d roundups are WAY overdone.<\/p>\n<p>But, this doesn\u2019t mean that including advice\/insights\/quotes from knowledgeable industry figures within your content is a bad idea; it <em>can<\/em>&nbsp;add a lot of&nbsp;value.<\/p>\n<p>In fact, we did exactly this in <u><a href=\"https:\/\/ahrefs.com\/blog\/learn-seo\/\" target=\"_blank\" rel=\"noopener\">our recent guide to learning SEO<\/a><\/u>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"474\" class=\"wp-image-17461\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/how-to-learn-seo-in-2017-experts.png\" alt=\"how to learn seo in 2017 experts\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/how-to-learn-seo-in-2017-experts.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/how-to-learn-seo-in-2017-experts-768x404.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/how-to-learn-seo-in-2017-experts-680x358.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>But, while it\u2019s easy to find \u201cexperts\u201d you may want to reach out to, it\u2019s important to remember that not everyone responds positively to such requests. Some people are too busy, while others simply despise all forms of \u201ccold\u201d outreach.<\/p>\n<p>So, rather than guessing who might be interested in providing a quote\/opinion\/etc for your upcoming post, let\u2019s instead reach out to those with a track record of responding positively to such requests by:<\/p>\n<ol>\n<li>Finding existing \u201cexpert roundups\u201d (or any post containing \u201cexpert\u201d advice\/opinions\/etc) in your industry;<\/li>\n<li>Scraping the names + websites of all contributors;<\/li>\n<li>Building a list of people who are most likely to respond to your request.<\/li>\n<\/ol>\n<p>Let\u2019s give it a shot with this <u><a href=\"https:\/\/niksto.com\/penguin-penalty\/\" target=\"_blank\" rel=\"noopener\">expert roundup post from Nikolay Stoyanov<\/a><\/u>.<\/p>\n<p>First, we need to understand the structure\/format of the data we want to scrape. In this instance, it appears to be <em>full name<\/em>&nbsp;followed by a hyperlinked <em>website<\/em>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"185\" class=\"wp-image-17468\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/tim-soulo-expert-roundup.png\" alt=\"tim soulo expert roundup\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/tim-soulo-expert-roundup.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/tim-soulo-expert-roundup-768x158.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/tim-soulo-expert-roundup-680x140.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>HTML-wise, this is all wrapped in a <em>&lt;strong&gt;<\/em>&nbsp;tag.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"207\" class=\"wp-image-17462\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-inspect-chrome.png\" alt=\"html inspect chrome\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-inspect-chrome.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-inspect-chrome-768x177.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-inspect-chrome-680x156.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> You can inspect the HTML for any on-page element by right-clicking on it and hitting \u201cInspect\u201d in Chrome.&nbsp;<\/div>\n<p>Because we want both the names (i.e. text) and website (i.e. link) from within this <em>&lt;strong&gt;<\/em>&nbsp;tag, we\u2019re going to use the <u><a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scraper\/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en\" target=\"_blank\" rel=\"noopener\">Scraper extension<\/a><\/u>&nbsp;to scrape for the \u201ctext()\u201d and \u201ca\/@href\u201d using XPath, like&nbsp;this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"443\" class=\"wp-image-17460\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/strong-scraper.png\" alt=\"strong scraper\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/strong-scraper.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/strong-scraper-768x378.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/strong-scraper-680x335.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Don\u2019t worry if your data is a little messy (as it is above); this will get cleaned up automatically in a second.<\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> For those unfamiliar with XPath syntax, I recommend using <u><a href=\"http:\/\/ricostacruz.com\/cheatsheets\/xpath\" target=\"_blank\" rel=\"noopener\">this cheat sheet<\/a><\/u>. Assuming you have basic HTML knowledge, this should be enough to help you understand how to extract the data <em>you<\/em>&nbsp;want from a web&nbsp;page&nbsp;<\/div>\n<p>Next, make a copy of <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/11bN0vk8Y9sz1cl3ngCovWS3TH0dq_UooFOYZ2Lgxem0\/copy\" target=\"_blank\" rel=\"noopener\">this Google Sheet<\/a><\/u>, hit \u201cCopy to clipboard,\u201d then paste the raw data into the first tab (i.e. \u201c1. START&nbsp;HERE\u201d).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"254\" class=\"wp-image-17442\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/raw-data-from-scraper.png\" alt=\"raw data from scraper\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/raw-data-from-scraper.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/raw-data-from-scraper-768x217.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/raw-data-from-scraper-680x192.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Repeat this process for as many roundup posts as you&nbsp;like.<\/p>\n<p>Finally, navigate to the second tab in the Google Sheet (i.e. \u201c2. NAMES + DOMAINS\u201d) and you\u2019ll see a neat list of all contributors ordered by # of occurrences.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"901\" height=\"255\" class=\"wp-image-17463\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/roundup-scraping-final-tab.png\" alt=\"roundup scraping final tab\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/roundup-scraping-final-tab.png 901w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/roundup-scraping-final-tab-768x217.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/roundup-scraping-final-tab-680x192.png 680w\" sizes=\"auto, (max-width: 901px) 100vw, 901px\"><\/p>\n<p>Here are&nbsp;<u><a href=\"https:\/\/ahrefs.com\/blog\/find-email-address\/\" target=\"_blank\" rel=\"noopener\">9 ways to find the email addresses for everyone on your list<\/a><\/u>.<\/p>\n<p><strong>IMPORTANT<\/strong>: Always research any prospects before reaching out with questions\/requests. And DON\u2019T spam&nbsp;them!<\/p>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1fnid0I8AIId_MhywBMlOvAYa0EEV4nKUSzvB8O_mpkQ\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<h2>3. Remove junk \u201cguest post\u201d prospects by scraping RSS&nbsp;feeds<\/h2>\n<p>Blogs that haven\u2019t published anything for a while are unlikely to&nbsp;respond to guest post pitches.<\/p>\n<p>Why? Because the blogger has <em>probably<\/em>&nbsp;lost interest in their&nbsp;blog.<\/p>\n<p>That\u2019s why I <em>always<\/em>&nbsp;check the publish dates on their few most recent posts before pitching them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"824\" height=\"243\" class=\"wp-image-17445\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/guest-post-recently.png\" alt=\"guest post recently\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/guest-post-recently.png 824w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/guest-post-recently-768x226.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/guest-post-recently-680x201.png 680w\" sizes=\"auto, (max-width: 824px) 100vw, 824px\"><\/p>\n<p>(If&nbsp;they haven\u2019t posted for more than a few weeks, I don\u2019t bother contacting them)<\/p>\n<p>However, with a bit of scraping knowhow, this process can be automated. Here\u2019s how:<\/p>\n<ol>\n<li>Find the RSS feed for the&nbsp;blog;<\/li>\n<li>Scrape the \u201c<em>pubDate<\/em>\u201d from the&nbsp;feed<\/li>\n<\/ol>\n<p>Most blogs RSS feeds can be found at <em>domain.com\/feed\/\u2014<\/em>this makes finding the RSS feed for a list of blogs as simple as adding \u201c\/feed\/\u201d to the&nbsp;URL.<\/p>\n<p>For example, the RSS feed for the Ahrefs blog can be found at <u><a href=\"https:\/\/ahrefs.com\/blog\/feed\/\" target=\"_blank\" rel=\"noopener\">https:\/\/ahrefs.com\/blog\/feed\/<\/a><\/u><\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> This won\u2019t work for every blog. Some bloggers use other services such as FeedBurner to create RSS feeds. It will, however, work for&nbsp;most.&nbsp;<\/div>\n<p>You can then use XPath within the <em>IMPORTXML<\/em>&nbsp;function in Google Sheets to scrape the <em>pubDate<\/em>&nbsp;element:<\/p>\n<p><em>importxml(\u201c<\/em>https:\/\/ahrefs.com\/blog\/feed\/<em>\u201d,\u201d<\/em><strong>\/\/pubDate<\/strong><em>\u201d)))<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"267\" class=\"wp-image-17456\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/pubdate-google-sheets.gif\" alt=\"pubdate google sheets\"><\/p>\n<p>This will scrape every <em>pubDate<\/em>&nbsp;element in the RSS feed, giving you a list of publishing dates for the most recent 5-10 blog posts for that&nbsp;blog.<\/p>\n<p>But how do you do this for an entire list of&nbsp;blogs?<\/p>\n<p>Well, I\u2019ve made <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1ok5YO8H8tyN-pi1O8WtdYj1fLmjRVc5jZ0p42X2tl5M\/copy\" target=\"_blank\" rel=\"noopener\">another Google Sheet<\/a><\/u>&nbsp;that automates the process for you\u2014just paste a list of blog URLs (e.g. <u><a href=\"https:\/\/ahrefs.com\/blog\" target=\"_blank\" rel=\"noopener\">https:\/\/ahrefs.com\/blog<\/a><\/u>) into the first tab (i.e. \u201c1. ENTER BLOG URLs\u201d) and you should see something like this appear in the \u201cRESULTS\u201d tab:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"189\" class=\"wp-image-17471\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/rss-google-sheets.png\" alt=\"rss google sheets\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/rss-google-sheets.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/rss-google-sheets-768x161.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/rss-google-sheets-680x143.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>It tells&nbsp;you:<\/p>\n<ul>\n<li>The date of the most recent post;<\/li>\n<li>How many days\/weeks\/months ago that&nbsp;was;<\/li>\n<li>Average # of days\/weeks\/months between posts (i.e. how often they post, on average)<\/li>\n<\/ul>\n<p>This is super-useful information for choosing who to pitch guest posts&nbsp;to.<\/p>\n<p>For example, you can see that we publish a new post every 11 days on average, meaning that Ahrefs would definitely be a great blog to pitch to if you were in the SEO\/marketing industry&nbsp;\ud83d\ude42<\/p>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1PXe3Bg3-ocRV1RoyqqShV4PUXNqLpt-MPlzSXHbv4LU\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<p><em>Recommended reading: <\/em><em><a href=\"https:\/\/ahrefs.com\/blog\/guest-blogging\/\" target=\"_blank\" rel=\"noopener\">An In-Depth Look at Guest Blogging in 2016 (Case Studies, Data &amp;&nbsp;Tips)<\/a><\/em><\/p>\n<h2>4. Find out what type of content performs best on your blog by scraping post categories<\/h2>\n<p>Many bloggers will have a general sense of what resonates with their audience.<\/p>\n<p>But as an SEO\/marketer, I prefer to rely on cold hard&nbsp;data.<\/p>\n<p>When it comes to blog content, data can help answer questions that aren\u2019t instantly obvious, such&nbsp;as:<\/p>\n<ul>\n<li>Do some topics get shared more than others?<\/li>\n<li>Are there specific topics that attract more backlinks than others?<\/li>\n<li>Are some authors more popular than others?<\/li>\n<\/ul>\n<p>In this section, I\u2019ll show you exactly how to answer these questions for your blog by combining a single Ahrefs export with a simple scrape. You\u2019ll even be able to auto-generate visual data representations like&nbsp;this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"558\" class=\"wp-image-17436\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-graph.png\" alt=\"blog data graph\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-graph.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-graph-768x476.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-graph-680x422.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Here\u2019s the process:<\/p>\n<ol>\n<li>Export the \u201ctop content\u201d report from <u><a href=\"https:\/\/ahrefs.com\/site-explorer\" target=\"_blank\" rel=\"noopener\">Ahrefs Site Explorer<\/a><\/u>;<\/li>\n<li>Scrape categories for all the blog&nbsp;posts;<\/li>\n<li>Analyse the data in Google Sheets (hint: I\u2019ve included a <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1xg6iO8_5bAAhQ9_vzzcC6Yu6VWX3jWoQzw67ypmB1d8\/copy\" target=\"_blank\" rel=\"noopener\">template<\/a><\/u>&nbsp;that does this automagically!)<\/li>\n<\/ol>\n<p>To begin, we need to grab the top pages report from Ahrefs\u2014let\u2019s use ahrefs.com\/blog for our example.<\/p>\n<p><em>Site Explorer &gt; Enter ahrefs.com\/blog &gt; Pages &gt; Top Content &gt; Export as .csv<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"509\" class=\"wp-image-17455\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/ahrefs-site-explorer-top-content.png\" alt=\"ahrefs site explorer top content\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/ahrefs-site-explorer-top-content.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/ahrefs-site-explorer-top-content-768x434.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/ahrefs-site-explorer-top-content-680x385.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> Don\u2019t export more than 1,000 rows for this. It won\u2019t work with this spreadsheet.&nbsp;<\/div>\n<p>Next, make a copy of <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1xg6iO8_5bAAhQ9_vzzcC6Yu6VWX3jWoQzw67ypmB1d8\/copy\" rel=\"noopener\">this Google Sheet<\/a><\/u>&nbsp;then paste all data from the Top Content .csv export into cell A1 of the first tab (i.e. \u201c1. Ahrefs Export\u201d).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"772\" height=\"354\" class=\"wp-image-17458\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-content-analysis.gif\" alt=\"blog content analysis\"><\/p>\n<p>Now comes the scraping\u2026<\/p>\n<p>Open up one of the URLs from the \u201cContent URL\u201d column and locate the category under which the post was published.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"224\" class=\"wp-image-17459\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-post-category.png\" alt=\"blog post category\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-post-category.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-post-category-768x191.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-post-category-680x169.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>We now need to figure out the XPath for this HTML element, so right-click and hit \u201cInspect\u201d to view the&nbsp;HTML.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"180\" class=\"wp-image-17454\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-post-category.png\" alt=\"html post category\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-post-category.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-post-category-768x154.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/html-post-category-680x136.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>In this instance, we can see that the post category is contained within a &lt;div&gt; with the class \u201cpost-category\u201d, which is nested within the &lt;header&gt; tag. This means our XPath would&nbsp;be:<\/p>\n<p><em>\/\/header\/div[@class=\u2018post-category\u2019]<\/em><\/p>\n<p>Now that we know this, we can use <u><a href=\"https:\/\/www.screamingfrog.co.uk\/seo-spider\/\" target=\"_blank\" rel=\"noopener\">Screaming Frog<\/a><\/u>&nbsp;to scrape the post category for each post; here\u2019s how:<\/p>\n<ol>\n<li>Open Screaming Frog and go to <em>\u201cMode\u201d &gt; \u201cList\u201d<\/em>;<\/li>\n<li>Go to <em>\u201cConfiguration\u201d &gt; \u201cSpider\u201d<\/em>&nbsp;and uncheck all the boxes (<a href=\"https:\/\/imgur.com\/a\/ZjYdE\" target=\"_blank\" rel=\"noopener\">like this<\/a>);<\/li>\n<li>Go to \u201cConfiguration\u201d &gt; \u201cCustom\u201d &gt; \u201cExtraction\u201d &gt; \u201cExtractor 1\u201d and paste in your XPath (e.g. <em>\/\/header\/div[@class=\u2018post-category\u2019]). <\/em>Make sure you choose \u201cXPath\u201d as the scraper mode and \u201cExtract Text\u201d as the extractor mode (<a href=\"https:\/\/imgur.com\/a\/6TNJU\" target=\"_blank\" rel=\"noopener\">like this<\/a>)<\/li>\n<li>Copy\/paste all URLs from the Content URL into Screaming Frog, and start the scrape;<\/li>\n<\/ol>\n<p>Once complete, head to the&nbsp;\u201cCustom\u201d tab, filter by \u201cExtraction\u201d and you\u2019ll see the extracted data for each&nbsp;URL.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"909\" height=\"435\" class=\"wp-image-17438\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extracted-data.png\" alt=\"screaming frog extracted data\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extracted-data.png 909w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extracted-data-768x368.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extracted-data-680x325.png 680w\" sizes=\"auto, (max-width: 909px) 100vw, 909px\"><\/p>\n<p>Hit \u201cExport\u201d, then copy all the data in the .csv into the next tab in the Google Sheet (i.e. \u201c2. SF extraction\u201d).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"278\" class=\"wp-image-17441\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/sf-scrape.gif\" alt=\"sf scrape\"><\/p>\n<p>Go to the final tab in the Google Sheet (i.e. \u201cRESULTS\u201d) and you\u2019ll see a bunch of data + accompanying graphs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"349\" class=\"wp-image-17437\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-complete.png\" alt=\"blog data complete\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-complete.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-complete-768x298.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/blog-data-complete-680x264.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> In order for this process to give actionable insights, it\u2019s important that your blog posts are well-categorized. I think it\u2019s fair to say that our categorization at Ahrefs could do with some additional work, so take the results above with a pinch of&nbsp;salt.&nbsp;<\/div>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1uSxFro7NvCkt9AzRArmXsVll0WE-wQtR6aL1trZlEDY\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<h2>5. Promote only the RIGHT kind of content on Reddit (by looking at what has already performed well)<\/h2>\n<p>Redditors despise&nbsp;self-promotion.<\/p>\n<p>In fact, any lazy attempts to self-promote via the platform are <em>usually<\/em>&nbsp;<u><a href=\"https:\/\/www.webpagefx.com\/marketing-guides\/how-to-use-reddit-for-marketing\/famous-reddit-marketing-failures.html\" target=\"_blank\" rel=\"noopener\">met with a barrage of mockery and foul-language<\/a><\/u>.<\/p>\n<p>But here\u2019s the&nbsp;thing:<\/p>\n<p>Redditors have <em>nothing<\/em>&nbsp;against you sharing something with them; you just need to make sure it\u2019s something they <em>actually<\/em>&nbsp;care about.<\/p>\n<p>The best way to do this is to scrape (and analyze) what they liked in the past, then share more of that type of content with&nbsp;them.<\/p>\n<p>Here\u2019s the process:<\/p>\n<ol>\n<li>Choose a subreddit (e.g. \/r\/Entrepreneur);<\/li>\n<li>Scrape the top 1000 posts of all&nbsp;time;<\/li>\n<li>Analyse the data and act accordingly (yep, I\u2019ve included a Google Sheet that does this for&nbsp;you!)<\/li>\n<\/ol>\n<p>OK, first things first, make a copy of <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1Xo8Er8lPoFFJNpr4AWhlxHX-gNqbqMvcqTe_EieXgn8\/copy\" target=\"_blank\" rel=\"noopener\">this Google Sheet<\/a><\/u>&nbsp;+ enter the subreddit you want to analyze. You should then see a formatted link to that subreddits top posts appear alongside it.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"112\" class=\"wp-image-17452\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/Reddit_Analysis_-_Google_Sheets_and_Screaming_Frog_SEO_Spider_8_1_-_List_Mode__Pasted_.png\" alt=\"Reddit Analysis Google Sheets and Screaming Frog SEO Spider 8 1 List Mode Pasted\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/Reddit_Analysis_-_Google_Sheets_and_Screaming_Frog_SEO_Spider_8_1_-_List_Mode__Pasted_.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/Reddit_Analysis_-_Google_Sheets_and_Screaming_Frog_SEO_Spider_8_1_-_List_Mode__Pasted_-768x96.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/Reddit_Analysis_-_Google_Sheets_and_Screaming_Frog_SEO_Spider_8_1_-_List_Mode__Pasted_-680x85.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>This takes you to a page showing the top 25 posts of all time for that subreddit.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"435\" class=\"wp-image-17448\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/top-posts-reddit.png\" alt=\"top posts reddit\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/top-posts-reddit.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/top-posts-reddit-768x371.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/top-posts-reddit-680x329.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>However, this page only shows the top 25 posts. We\u2019re going to analyze the top 1,000, so we need to use a scraping tool to scrape multiple pages of results.<\/p>\n<p>Reddit actually makes this rather difficult but <u><a href=\"https:\/\/www.import.io\/\" target=\"_blank\" rel=\"noopener\">Import.io<\/a><\/u>&nbsp;(free up to 500 queries per month, which is plenty) can do this with&nbsp;ease.<\/p>\n<p>Here\u2019s what we\u2019re going to scrape from these pages (hint: click the links to see an example of each data&nbsp;point)):<\/p>\n<ul>\n<li><u><a href=\"https:\/\/imgur.com\/a\/iQXFq\" target=\"_blank\" rel=\"noopener\">Rank<\/a><\/u>;<\/li>\n<li><u><a href=\"https:\/\/imgur.com\/a\/rRegn\" target=\"_blank\" rel=\"noopener\">Score\/upvotes<\/a><\/u>;<\/li>\n<li><u><a href=\"https:\/\/imgur.com\/a\/mAKgl\" target=\"_blank\" rel=\"noopener\">Title<\/a><\/u>;<\/li>\n<li><u><a href=\"https:\/\/imgur.com\/a\/cmAPj\" target=\"_blank\" rel=\"noopener\">User submitted by<\/a><\/u>;<\/li>\n<li><u><a href=\"https:\/\/imgur.com\/a\/Q3Gxu\" target=\"_blank\" rel=\"noopener\">Comments<\/a><\/u>;<\/li>\n<li><u><a href=\"https:\/\/imgur.com\/a\/mMSab\" target=\"_blank\" rel=\"noopener\">Link flair<\/a><\/u>&nbsp;(optional as this is not available on all subreddits\u2026it\u2019s also <u><a href=\"https:\/\/imgur.com\/a\/f2bj9\" target=\"_blank\" rel=\"noopener\">more obvious on some subreddits than others<\/a><\/u>\u2014learn more <u><a href=\"https:\/\/www.reddit.com\/r\/help\/comments\/3tbuml\/whats_a_flair\/\" target=\"_blank\" rel=\"noopener\">here<\/a><\/u>)<\/li>\n<\/ul>\n<p>OK, let\u2019s stick with \/r\/Entrepreneur for our example\u2026<\/p>\n<p><em>Go to Import.io &gt; sign up &gt; new extractor &gt; paste in the link from the Google Sheet (shown above)<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"208\" class=\"wp-image-17443\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-url.png\" alt=\"import io url\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-url.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-url-768x177.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-url-680x157.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Click \u201cGo\u201d.<\/p>\n<p>Import.io will now work its magic and extract a bunch of data from the&nbsp;page.<\/p>\n<div class=\"sidenote\"><div class=\"sidenote-title\">Sidenote.<\/div> It does sometimes extract pointless data so it\u2019s worth deleting any columns that aren\u2019t needed within the \u201cedit\u201d tab. Just remember to keep the data mentioned above in the right&nbsp;order.&nbsp;<\/div>\n<p>Hit \u201cSave\u201d (but don\u2019t run it&nbsp;yet!)<\/p>\n<p>Right now, the extractor is only set up to scrape the top 25 posts. You need to add the other URLs (from the tab labeled \u201c2. MORE LINKS\u201d in the Google Sheet) to scrape the&nbsp;rest.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"353\" class=\"wp-image-17467\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/reddit-analysis-sheet.png\" alt=\"reddit analysis sheet\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/reddit-analysis-sheet.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/reddit-analysis-sheet-768x301.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/reddit-analysis-sheet-680x267.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Add these under the \u201cSettings\u201d tab for your extractor.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"525\" class=\"wp-image-17449\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-add-urls.png\" alt=\"import io add urls\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-add-urls.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-add-urls-768x448.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-add-urls-680x397.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Hit \u201cSave URLs\u201d then run the extractor.<\/p>\n<p>Download the .csv once complete.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"157\" class=\"wp-image-17469\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-done.png\" alt=\"import io done\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-done.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-done-768x134.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/import-io-done-680x119.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Copy\/paste all data from the .csv into the sheet labeled \u201c3. IMPORT.IO EXPORT\u201d in the spreadsheet.<\/p>\n<p>Finally, go to the \u201cRESULTS\u201d sheet and enter a keyword\u2014it will then kick back some neat stats showing how interested that subreddit is likely to be in your&nbsp;topic.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"227\" class=\"wp-image-17446\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/keyword-analysis-reddit.png\" alt=\"keyword analysis reddit\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/keyword-analysis-reddit.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/keyword-analysis-reddit-768x194.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/keyword-analysis-reddit-680x172.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1WHc1Cr5BRAdVqz3s9apKZ6lyUZlu_omYpMOD9jX8O60\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<h2>6. Build relationships with people who are already fans of your content<\/h2>\n<p>Most tweets will drive ZERO traffic to your website.<\/p>\n<p>That\u2019s why \u201cbegging for tweets\u201d from anyone and everyone is a terrible idea.<\/p>\n<p>However, that\u2019s not to say <em>all<\/em>&nbsp;tweets are worthless\u2014it\u2019s still worth reaching out to those who are likely to send <em>real<\/em>&nbsp;traffic to your website.<\/p>\n<p>Here\u2019s a workflow for doing this (note: it includes a bit of Twitter scraping):<\/p>\n<ol>\n<li>Scrape and add all Twitter mentions to a spreadsheet (using IFTTT);<\/li>\n<li>Scrape the number of followers for the people who\u2019ve shared a lot of your&nbsp;stuff;<\/li>\n<li>Find contact details, then reach out and build relationships with these people.<\/li>\n<\/ol>\n<p>OK, so first, make a copy of <u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/17SX9MN2O1M2Ek7Fr22s_HvpbHlKezFOQ8AvBiFX3cu4\/copy\" target=\"_blank\" rel=\"noopener\">this Google Sheet<\/a><\/u>.<\/p>\n<p><strong>IMPORTANT:<\/strong>&nbsp;You MUST make a copy of this on the root of your Google Drive (i.e. not in a subfolder). It MUST also be named exactly \u201cMy Twitter Mentions\u201d.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"89\" class=\"wp-image-17466\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/google-drive-my-twitter-mentions.png\" alt=\"google drive my twitter mentions\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/google-drive-my-twitter-mentions.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/google-drive-my-twitter-mentions-768x76.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/google-drive-my-twitter-mentions-680x67.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Next, turn <u><a href=\"https:\/\/ifttt.com\/applets\/wewzNmGa-twitter-mentions-ahrefs\" target=\"_blank\" rel=\"noopener\">this recipe<\/a><\/u>&nbsp;on within your IFTTT account (you\u2019ll need to connect your Twitter + Google Drive accounts to IFTTT in order to do&nbsp;this).<\/p>\n<p>What does this recipe do? Basically, every time someone mentions you on Twitter, it\u2019ll scrape the following information and add it to a new row in the spreadsheet:<\/p>\n<ul>\n<li>Twitter handle (of the person who mentioned you);<\/li>\n<li>Their tweet;<\/li>\n<li>Tweet link;<\/li>\n<li>Time\/date they tweeted<\/li>\n<\/ul>\n<p>And if you go to the second sheet in the spreadsheet (i.e. the one labeled \u201c1.Tweets\u201d), you\u2019ll see the people who\u2019ve mentioned you and tweeted a link of yours the highest number of&nbsp;times.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"899\" height=\"275\" class=\"wp-image-17470\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-mentions.png\" alt=\"twitter mentions\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-mentions.png 899w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-mentions-768x235.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-mentions-680x208.png 680w\" sizes=\"auto, (max-width: 899px) 100vw, 899px\"><\/p>\n<p>But, the fact that they\u2019ve mentioned you a number of times doesn\u2019t necessarily indicate that they\u2019ll drive any <em>real<\/em>&nbsp;traffic to your website.<\/p>\n<p>So, you now want to scrape the number of followers each of these people has.<\/p>\n<p>You can do this with CSS selectors using Screaming Frog.<\/p>\n<p>Just set your search depth to \u201c0\u201d (see <a href=\"https:\/\/imgur.com\/a\/T2N9V\" target=\"_blank\" rel=\"noopener\">here<\/a>), then use these settings under the custom extractor:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"156\" class=\"wp-image-17465\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extractor-settings.png\" alt=\"screaming frog extractor settings\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extractor-settings.png 900w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extractor-settings-768x133.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-extractor-settings-680x118.png 680w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\"><\/p>\n<p>Here\u2019s each CSS selector (for clarification):<\/p>\n<ol>\n<li><strong>Twitter Name:<\/strong>&nbsp;h1<\/li>\n<li><strong>Twitter Handle:<\/strong>&nbsp;h2 &gt; a &gt; span &gt;&nbsp;b<\/li>\n<li><strong>Followers:<\/strong>&nbsp;li.ProfileNav-item.ProfileNav-item--followers &gt; a &gt; span.ProfileNav-value<\/li>\n<li><strong>Website<\/strong>: div.ProfileHeaderCard &gt; div.ProfileHeaderCard-url &gt; span.ProfileHeaderCard-urlText.u-dir &gt;&nbsp;a<\/li>\n<\/ol>\n<p>Copy\/paste all the Twitter links from the spreadsheet into Screaming Frog and run&nbsp;it.<\/p>\n<p>Once finished, go&nbsp;to:<\/p>\n<p><em>Custom &gt; Extraction &gt; Export<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1021\" height=\"104\" class=\"wp-image-17444\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-custom-extraction.png\" alt=\"screaming frog custom extraction\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-custom-extraction.png 1021w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-custom-extraction-768x78.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/screaming-frog-custom-extraction-680x69.png 680w\" sizes=\"auto, (max-width: 1021px) 100vw, 1021px\"><\/p>\n<p>Open the exported .csv, then copy\/paste all the data into the next tab in the sheet (i.e. the one labeled \u201c2. SF Export\u201d).<\/p>\n<p>Lastly, go to the final tab (i.e. \u201c3. RESULTS\u201d) and you\u2019ll see a list of everyone who\u2019s mentioned you along with a bunch of other information including:<\/p>\n<ul>\n<li># of times they tweeted about&nbsp;you,<\/li>\n<li># of followers<\/li>\n<li>Their website (where applicable)<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"814\" height=\"244\" class=\"wp-image-17472\" src=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-results.png\" alt=\"twitter results\" srcset=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-results.png 814w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-results-768x230.png 768w, https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/twitter-results-680x204.png 680w\" sizes=\"auto, (max-width: 814px) 100vw, 814px\"><\/p>\n<p>Because these people have already shared your content in the past, and also have a good number of followers, it\u2019s worth reaching out and building relationships with&nbsp;them.<\/p>\n<p><u><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1amjQX7Y9ZcS7VkjucE9PBmvrKvhQ0VAjdGx0jg-d_uo\/copy\" target=\"_blank\" rel=\"noopener\">Here\u2019s the spreadsheet with sample data<\/a><\/u>.<\/p>\n<h2>Final thoughts<\/h2>\n<p>Web scraping is <em>crazily<\/em>&nbsp;powerful.<\/p>\n<p>All you need is some basic XPath\/CSS\/Regex knowledge (along with a web scraping tool, of course) and it\u2019s possible to scrape <em>anything<\/em>&nbsp;from <em>any<\/em>&nbsp;website in a matter of seconds.<\/p>\n<p>I\u2019m a firm believer that the best way to learn is by <em>doing<\/em>, so I highly recommend that you spend some time replicating the experiments above. This will also teach you to pay attention to things that could easily be automated with web scraping in future.<\/p>\n<p>So, play around with the tools\/ideas above and let me know what you come up with in the comments section below&nbsp;\ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If so, you\u2019re already&nbsp;familiar with web scraping. But, while this can certainly be useful, there\u2019s much more to web scraping than grabbing a few title tags\u2014it can actually be used to extract any&nbsp;data from any&nbsp;web page in seconds. The question<span class=\"ellipsis\">\u2026<\/span><\/p>\n<div class=\"read-more\">Read more \u203a<\/div>\n<p><!-- end of .read-more --><\/p>\n","protected":false},"author":114,"featured_media":17508,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"wp_typography_post_enhancements_disabled":false,"footnotes":""},"categories":[390],"tags":[],"coauthors":[336],"class_list":["post-17473","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-marketing","odd"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>6 Actionable Web Scraping Hacks for White Hat Marketers<\/title>\n<meta name=\"description\" content=\"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"6 Actionable Web Scraping Hacks for White Hat Marketers\" \/>\n<meta property=\"og:description\" content=\"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/\" \/>\n<meta property=\"og:site_name\" content=\"SEO Blog by Ahrefs\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Ahrefs\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-17T07:45:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-07-19T08:32:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"952\" \/>\n\t<meta property=\"og:image:height\" content=\"498\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Joshua Hardwick\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@JoshuaCHardwick\" \/>\n<meta name=\"twitter:site\" content=\"@ahrefs\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/\"},\"author\":{\"name\":\"Joshua Hardwick\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#\\\/schema\\\/person\\\/e6a89cbde8e750d22996aa26e213e712\"},\"headline\":\"6 Actionable Web Scraping Hacks for White Hat Marketers\",\"datePublished\":\"2017-10-17T07:45:50+00:00\",\"dateModified\":\"2023-07-19T08:32:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/\"},\"wordCount\":3614,\"commentCount\":31,\"publisher\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2017\\\/10\\\/FB-web-scraping-for-marketers.jpg\",\"articleSection\":[\"General Marketing\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/\",\"name\":\"6 Actionable Web Scraping Hacks for White Hat Marketers\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2017\\\/10\\\/FB-web-scraping-for-marketers.jpg\",\"datePublished\":\"2017-10-17T07:45:50+00:00\",\"dateModified\":\"2023-07-19T08:32:33+00:00\",\"description\":\"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/web-scraping-for-marketers\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2017\\\/10\\\/FB-web-scraping-for-marketers.jpg\",\"contentUrl\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2017\\\/10\\\/FB-web-scraping-for-marketers.jpg\",\"width\":952,\"height\":498},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/\",\"name\":\"SEO Blog by Ahrefs\",\"description\":\"Link Building Strategies &amp; SEO Tips\",\"publisher\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#organization\",\"name\":\"Ahrefs\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/06\\\/ahrefs-logo.png\",\"contentUrl\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/06\\\/ahrefs-logo.png\",\"width\":2048,\"height\":768,\"caption\":\"Ahrefs\"},\"image\":{\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Ahrefs\\\/\",\"https:\\\/\\\/x.com\\\/ahrefs\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/ahrefs\\\/\",\"https:\\\/\\\/www.youtube.com\\\/c\\\/ahrefscom\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/#\\\/schema\\\/person\\\/e6a89cbde8e750d22996aa26e213e712\",\"name\":\"Joshua Hardwick\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/10\\\/meme.jpg109e89523fcea81015d3cc08c79f9036\",\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/10\\\/meme.jpg\",\"contentUrl\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/10\\\/meme.jpg\",\"caption\":\"Joshua Hardwick\"},\"description\":\"Head of Content @ Ahrefs (or, in plain English, I'm the guy responsible for ensuring that every blog post we publish is EPIC).\",\"sameAs\":[\"https:\\\/\\\/x.com\\\/JoshuaCHardwick\"],\"url\":\"https:\\\/\\\/ahrefs.com\\\/blog\\\/author\\\/joshua-hardwick\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"6 Actionable Web Scraping Hacks for White Hat Marketers","description":"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/","og_locale":"en_US","og_type":"article","og_title":"6 Actionable Web Scraping Hacks for White Hat Marketers","og_description":"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing","og_url":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/","og_site_name":"SEO Blog by Ahrefs","article_publisher":"https:\/\/www.facebook.com\/Ahrefs\/","article_published_time":"2017-10-17T07:45:50+00:00","article_modified_time":"2023-07-19T08:32:33+00:00","og_image":[{"width":952,"height":498,"url":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg","type":"image\/jpeg"}],"author":"Joshua Hardwick","twitter_card":"summary_large_image","twitter_creator":"@JoshuaCHardwick","twitter_site":"@ahrefs","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#article","isPartOf":{"@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/"},"author":{"name":"Joshua Hardwick","@id":"https:\/\/ahrefs.com\/blog\/#\/schema\/person\/e6a89cbde8e750d22996aa26e213e712"},"headline":"6 Actionable Web Scraping Hacks for White Hat Marketers","datePublished":"2017-10-17T07:45:50+00:00","dateModified":"2023-07-19T08:32:33+00:00","mainEntityOfPage":{"@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/"},"wordCount":3614,"commentCount":31,"publisher":{"@id":"https:\/\/ahrefs.com\/blog\/#organization"},"image":{"@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#primaryimage"},"thumbnailUrl":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg","articleSection":["General Marketing"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/","url":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/","name":"6 Actionable Web Scraping Hacks for White Hat Marketers","isPartOf":{"@id":"https:\/\/ahrefs.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#primaryimage"},"image":{"@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#primaryimage"},"thumbnailUrl":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg","datePublished":"2017-10-17T07:45:50+00:00","dateModified":"2023-07-19T08:32:33+00:00","description":"Web scraping allows you to extract any data from any web page in seconds. Take these 6 practical applications of web scraping and use them in your marketing","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ahrefs.com\/blog\/web-scraping-for-marketers\/#primaryimage","url":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg","contentUrl":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2017\/10\/FB-web-scraping-for-marketers.jpg","width":952,"height":498},{"@type":"WebSite","@id":"https:\/\/ahrefs.com\/blog\/#website","url":"https:\/\/ahrefs.com\/blog\/","name":"SEO Blog by Ahrefs","description":"Link Building Strategies &amp; SEO Tips","publisher":{"@id":"https:\/\/ahrefs.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ahrefs.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ahrefs.com\/blog\/#organization","name":"Ahrefs","url":"https:\/\/ahrefs.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ahrefs.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2023\/06\/ahrefs-logo.png","contentUrl":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2023\/06\/ahrefs-logo.png","width":2048,"height":768,"caption":"Ahrefs"},"image":{"@id":"https:\/\/ahrefs.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Ahrefs\/","https:\/\/x.com\/ahrefs","https:\/\/www.linkedin.com\/company\/ahrefs\/","https:\/\/www.youtube.com\/c\/ahrefscom"]},{"@type":"Person","@id":"https:\/\/ahrefs.com\/blog\/#\/schema\/person\/e6a89cbde8e750d22996aa26e213e712","name":"Joshua Hardwick","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2019\/10\/meme.jpg109e89523fcea81015d3cc08c79f9036","url":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2019\/10\/meme.jpg","contentUrl":"https:\/\/ahrefs.com\/blog\/wp-content\/uploads\/2019\/10\/meme.jpg","caption":"Joshua Hardwick"},"description":"Head of Content @ Ahrefs (or, in plain English, I'm the guy responsible for ensuring that every blog post we publish is EPIC).","sameAs":["https:\/\/x.com\/JoshuaCHardwick"],"url":"https:\/\/ahrefs.com\/blog\/author\/joshua-hardwick\/"}]}},"as_json":null,"as_tables":null,"as_images":null,"json_reviewers":[],"as_coauthors":[],"as_post_info":null,"as_sticky":null,"as_hreflang":null,"_links":{"self":[{"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/posts\/17473","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/users\/114"}],"replies":[{"embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/comments?post=17473"}],"version-history":[{"count":0,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/posts\/17473\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/media\/17508"}],"wp:attachment":[{"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/media?parent=17473"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/categories?post=17473"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/tags?post=17473"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/ahrefs.com\/blog\/wp-json\/wp\/v2\/coauthors?post=17473"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}