6 Actionable Web Scraping Hacks for White Hat Marketers Have you ever used a program like Screaming Frogto extract metadata (e.g. title/description/etc.) from a bunch of web pages in bulk? If so, youre alreadyfamiliar with web scraping. But, while this can certainly be useful, theres much more to web scraping than grabbing a few title tagsit can actually be used to extract anydata from anyweb page in seconds. The question is: whatdata would you need to extract and why? In this post, Ill aim to answer these questions by showing you 6 web scraping hacks: How to find content evangelists in website commentsHow to collect prospects data from expert roundupsHow to remove junk guest post prospectsHow to analyze performance of your blog categoriesHow to choose the right content for RedditHow to build relationships with those who love your content Ive also automated as much of the process as possible to make things less daunting for those new to web scraping. But first, lets talk a bit more about web scraping and how it works. A basic introduction to web scraping Lets assume that you want to extract the titles from your competitors 50 most recent blog posts. You could visit each website individually, check the HTML, locate the title tag, then copy/paste that data to wherever you needed it (e.g. a spreadsheet).
But, this would be verytime-consuming and boring. Thats why its much easier to scrape the data we want using a computer application (i.e. web scraper). In general, there are two ways to scrape the data youre looking for: Using a path-based system (e.g. XPath/CSS selectors);Using a search pattern (e.g. Regex)
XPath/CSS (i.e. path-based system) is the best way to scrape most types of data. For example, lets assume that we wanted to scrape the h1tag from this document:
We can see that the h1is nested in the body tag, which is nested under the htmltagheres how to write this as XPath/CSS: XPath:/html/body/h1CSS selector:html > body > h1 Sidenote. Because there is only one h1 tag in the document, we dont actually need to give the full path. Instead, we can just tell the scraper to find all instances of h1 throughout the document with //h1 for XPath, and simply h1for CSS. But what if we wanted to scrape the list of fruit instead?
You might guess something like: //ul/li(XPath), or ul > li(CSS), right? Sure, this would work. But because there are actually two unordered lists (ul) in the document, this would scrape both the list of fruit AND all list items in the second list. However, we can reference the classof the ulto grab only what we want: XPath://ul[@class=fruit]/liCSS selector:ul.fruit > li Regex, on the other hand, uses search patterns (rather than paths) to find everymatching instance within a document. This is useful whenever path-based searches wont cut the mustard. For example, lets assume that we wanted to scrape the words first, second, and third from the other unordered list in our document.
Theres no way to grab justthese words using path-based queries, but we could use this regex pattern to match what we need: This is the (.*) item in the list/li> This would search the document for list items (li) containing This is the [ANY WORD] item in the list AND extract only[ANY WORD] from that phrase. Sidenote. Because regex doesnt use the structured nature of HTML/XML files, results are often less accurate than they are with CSS/XPath. You should onlyuse Regex when XPath/CSS isnt a viable option. Here are a few useful XPath/CSS/Regex resources: And scraping tools: OK, lets get started with a few web scraping hacks! 1. Find evangelists who may be interested in reading your new content by scraping existing website comments Most people who comment on WordPress blogs will do so using their name and website.
You can spot these in any comments section as theyre the hyperlinked comments.
But what use is this? Well, lets assume that youve just published a post about X and youre looking for people who would be interested in reading it. Heres a simple way to find them (that involves a bit of scraping): Find a similar post on your website (e.g. if your new post is about link building, find a previous post you wrote about SEO/link buildingjust make sure it has a decent amount of comments.);Scrape the names + websites of all commenters;Reach out and tell them about your new content. Sidenote. This works well because these people are (a) existing fans of yourwork, and (b) loved one of your previous posts on the topic so much that they left a comment. So, while this is still cold pitching, the likelihood of them being interested in your content is much higher in comparison to pitching directly to strangers. Heres how to scrape them: Go to the comments section then right-click any top-level comment and select Scrape similar (note: you will need to install the Scraper Chrome Extensionfor this).
This should bring up a neat scraped list of commenters names + websites.
Make a copy of this Google Sheet, then hit Copy to clipboard, and paste them into the tab labeled 1. START HERE. Sidenote. If you have multiple pages of comments, youll have to repeat this process for each. Go to the tab labeled 2. NAMES + WEBSITES and use the Google Sheets hunter.io add-onto find the email addresses for your prospects.
You can then reach out to these people and tell them about your new/updated post. IMPORTANT: We advise being verycareful with this strategy. Remember, these people may have left a comment, but they didntopt into your email list. That could have been for a number of reasons, but chances are they were only really interested in this post. We, therefore, recommend using this strategy only to tell commenters about the updates to the post and/or other new posts that are similar. In other words, dont email people about stuff theyre unlikely to care about! Heres the spreadsheet with sample data. 2. Find people willing to contribute to your posts by scraping existing expert roundups Expert roundups are WAY overdone. But, this doesnt mean that including advice/insights/quotes from knowledgeable industry figures within your content is a bad idea; it canadd a lot of value. In fact, we did exactly this in our recent guide to learning SEO.
But, while its easy to find experts you may want to reach out to, its important to remember that not everyone responds positively to such requests. Some people are too busy, while others simply despise all forms of cold outreach. So, rather than guessing who might be interested in providing a quote/opinion/etc for your upcoming post, lets instead reach out to those with a track record of responding positively to such requests by: Finding existing expert roundups (or any post containing expert advice/opinions/etc) in your industry;Scraping the names + websites of all contributors;Building a list of people who are most likely to respond to your request. Lets give it a shot with this expert roundup post from Nikolay Stoyanov. First, we need to understand the structure/format of the data we want to scrape. In this instance, it appears to be full namefollowed by a hyperlinked website.
HTML-wise, this is all wrapped in a tag.
Sidenote. You can inspect the HTML for any on-page element by right-clicking on it and hitting Inspect in Chrome. Because we want both the names (i.e. text) and website (i.e. link) from within this tag, were going to use the Scraper extensionto scrape for the text() and a/@href using XPath, like this:
Dont worry if your data is a little messy (as it is above); this will get cleaned up automatically in a second. Sidenote. For those unfamiliar with XPath syntax, I recommend using this cheat sheet. Assuming you have basic HTML knowledge, this should be enough to help you understand how to extract the data youwant from a web page Next, make a copy of this Google Sheet, hit Copy to clipboard, then paste the raw data into the first tab (i.e. 1. START HERE).
Repeat this process for as many roundup posts as you like. Finally, navigate to the second tab in the Google Sheet (i.e. 2. NAMES + DOMAINS) and youll see a neat list of all contributors ordered by # of occurrences.
Here are9 ways to find the email addresses for everyone on your list. IMPORTANT: Always research any prospects before reaching out with questions/requests. And DONT spam them! Heres the spreadsheet with sample data. 3. Remove junk guest post prospects by scraping RSS feeds Blogs that havent published anything for a while are unlikely torespond to guest post pitches. Why? Because the blogger has probablylost interest in their blog. Thats why I alwayscheck the publish dates on their few most recent posts before pitching them.
(Ifthey havent posted for more than a few weeks, I dont bother contacting them) However, with a bit of scraping knowhow, this process can be automated. Heres how: Find the RSS feed for the blog;Scrape the pubDate from the feed Most blogs RSS feeds can be found at domain.com/feed/this makes finding the RSS feed for a list of blogs as simple as adding /feed/ to the URL. For example, the RSS feed for the Ahrefs blog can be found at https://ahrefs.com/blog/feed/ Sidenote. This wont work for every blog. Some bloggers use other services such as FeedBurner to create RSS feeds. It will, however, work for most. You can then use XPath within the IMPORTXMLfunction in Google Sheets to scrape the pubDateelement: importxml(https://ahrefs.com/blog/feed/,//pubDate)))
This will scrape every pubDateelement in the RSS feed, giving you a list of publishing dates for the most recent 510 blog posts for that blog. But how do you do this for an entire list of blogs? Well, Ive made another Google Sheetthat automates the process for youjust paste a list of blog URLs
(e.g. https://ahrefs.com/blog) into the first tab (i.e. 1. ENTER BLOG URLs) and you should see something like this appear in the RESULTS tab:
It tells you: The date of the most recent post;How many days/weeks/months ago that was;Average # of days/weeks/months between posts (i.e. how often they post, on average) This is super-useful information for choosing who to pitch guest posts to. For example, you can see that we publish a new post every 11 days on average, meaning that Ahrefs would definitely be a great blog to pitch to if you were in the SEO/marketing industry Heres the spreadsheet with sample data. Recommended reading: An In-Depth Look at Guest Blogging in 2016 (Case Studies, Data & Tips) 4. Find out what type of content performs best on your blog by scraping post categories Many bloggers will have a general sense of what resonates with their audience. But as an SEO/marketer, I prefer to rely on cold hard data. When it comes to blog content, data can help answer questions that arent instantly obvious, such as: Do some topics get shared more than others?Are there specific topics that attract more backlinks than others?Are some authors more popular than others? In this section, Ill show you exactly how to answer these questions for your blog by combining a single Ahrefs export with a simple scrape. Youll even be able to auto-generate visual data representations like this:
Heres the process: Export the top content report from Ahrefs Site Explorer;Scrape categories for all the blog posts;Analyse the data in Google Sheets (hint: Ive included a templatethat does this automagically!) To begin, we need to grab the top pages report from Ahrefslets use ahrefs.com/blog for our example. Site Explorer > Enter ahrefs.com/blog > Pages > Top Content > Export as .csv
Sidenote. Dont export more than 1,000 rows for this. It wont work with this spreadsheet. Next, make a copy of this Google Sheetthen paste all data from the Top Content .csv export into cell A1 of the first tab (i.e. 1. Ahrefs Export).
Now comes the scraping Open up one of the URLs from the Content URL column and locate the category under which the post was published.
We now need to figure out the XPath for this HTML element, so right-click and hit Inspect to view the HTML.
In this instance, we can see that the post category is contained within a
with the class post-category, which is nested within the tag. This means our XPath would be: //header/div[@class=post-category] Now that we know this, we can use Screaming Frogto scrape the post category for each post; heres how: Open Screaming Frog and go to Mode > List;Go to Configuration > Spiderand uncheck all the boxes (like this);Go to Configuration > Custom > Extraction > Extractor 1 and paste in your XPath (e.g. //header/div[@class=post-category]). Make sure you choose XPath as the scraper mode and Extract Text as the extractor mode (like this)Copy/paste all URLs from the Content URL into Screaming Frog, and start the scrape; Once complete, head to theCustom tab, filter by Extraction and youll see the extracted data for each URL.
Hit Export, then copy all the data in the .csv into the next tab in the Google Sheet (i.e. 2. SF extraction).
Go to the final tab in the Google Sheet (i.e. RESULTS) and youll see a bunch of data + accompanying graphs.
Sidenote. In order for this process to give actionable insights, its important that your blog posts are wellcategorized. I think its fair to say that our categorization at Ahrefs could do with some additional work, so take the results above with a pinch of salt. Heres the spreadsheet with sample data. 5. Promote only the RIGHT kind of content on Reddit (by looking at what has already performed well) Redditors despiseself-promotion. In fact, any lazy attempts to self-promote via the platform are usuallymet with a barrage of mockery and foul-language. But heres the thing: Redditors have nothingagainst you sharing something with them; you just need to make sure its something they actuallycare about. The best way to do this is to scrape (and analyze) what they liked in the past, then share more of that type of content with them. Heres the process: Choose a subreddit (e.g. /r/Entrepreneur);Scrape the top 1000 posts of all time;Analyse the data and act accordingly (yep, Ive included a Google Sheet that does this for you!) OK, first things first, make a copy of this Google Sheet+ enter the subreddit you want to analyze. You should then see a formatted link to that subreddits top posts appear alongside it.
This takes you to a page showing the top 25 posts of all time for that subreddit.
However, this page only shows the top 25 posts. Were going to analyze the top 1,000, so we need to use a scraping tool to scrape multiple pages of results. Reddit actually makes this rather difficult but Import.io(free up to 500 queries per month, which is plenty) can do this with ease. Heres what were going to scrape from these pages (hint: click the links to see an example of each data point)): OK, lets stick with /r/Entrepreneur for our example Go to Import.io > sign up > new extractor > paste in the link from the Google Sheet (shown above)
Click Go. Import.io will now work its magic and extract a bunch of data from the page.
Sidenote. It does sometimes extract pointless data so its worth deleting any columns that arent needed within the edit tab. Just remember to keep the data mentioned above in the right order. Hit Save (but dont run it yet!) Right now, the extractor is only set up to scrape the top 25 posts. You need to add the other URLs (from the tab labeled 2. MORE LINKS in the Google Sheet) to scrape the rest.
Add these under the Settings tab for your extractor.
Hit Save URLs then run the extractor. Download the .csv once complete.
Copy/paste all data from the .csv into the sheet labeled 3. IMPORT.IO EXPORT in the spreadsheet. Finally, go to the RESULTS sheet and enter a keywordit will then kick back some neat stats showing how interested that subreddit is likely to be in your topic.
Heres the spreadsheet with sample data. 6. Build relationships with people who are already fans of your content Most tweets will drive ZERO traffic to your website. Thats why begging for tweets from anyone and everyone is a terrible idea (note: I proved this in my recent case study where tweets sent no traffic whatsoever to my website). However, thats not to say alltweets are worthlessits still worth reaching out to those who are likely to send realtraffic to your website. Heres a workflow for doing this (note: it includes a bit of Twitter scraping): Scrape and add all Twitter mentions to a spreadsheet (using IFTTT);Scrape the number of followers for the people whove shared a lot of your stuff;Find contact details, then reach out and build relationships with these people. OK, so first, make a copy of this Google Sheet. IMPORTANT:You MUST make a copy of this on the root of your Google Drive (i.e. not in a subfolder). It MUST also be named exactly My Twitter Mentions.
Next, turn this recipeon within your IFTTT account (youll need to connect your Twitter + Google Drive accounts to IFTTT in order to do this).
What does this recipe do? Basically, every time someone mentions you on Twitter, itll scrape the following information and add it to a new row in the spreadsheet: Twitter handle (of the person who mentioned you);Their tweet;Tweet link;Time/date they tweeted And if you go to the second sheet in the spreadsheet (i.e. the one labeled 1.Tweets), youll see the people whove mentioned you and tweeted a link of yours the highest number of times.
But, the fact that theyve mentioned you a number of times doesnt necessarily indicate that theyll drive any realtraffic to your website. So, you now want to scrape the number of followers each of these people has. You can do this with CSS selectors using Screaming Frog. Just set your search depth to 0 (see here), then use these settings under the custom extractor:
Heres each CSS selector (for clarification): Twitter Name:h1Twitter Handle:h2 > a > span > bFollowers:li.ProfileNav-item.ProfileNavitemfollowers > a > span.ProfileNav-valueWebsite: div.ProfileHeaderCard > div.ProfileHeaderCardurl > span.ProfileHeaderCard-urlText.u-dir > a Copy/paste all the Twitter links from the spreadsheet into Screaming Frog and run it. Once finished, go to: Custom > Extraction > Export
Open the exported .csv, then copy/paste all the data into the next tab in the sheet (i.e. the one labeled 2. SF Export). Lastly, go to the final tab (i.e. 3. RESULTS) and youll see a list of everyone whos mentioned you along with a bunch of other information including: # of times they tweeted about you,# of followersTheir website (where applicable)
Because these people have already shared your content in the past, and also have a good number of followers, its worth reaching out and building relationships with them. Heres the spreadsheet with sample data. Final thoughts Web scraping is crazilypowerful. All you need is some basic XPath/CSS/Regex knowledge (along with a web scraping tool, of course) and its possible to scrape anythingfrom anywebsite in a matter of seconds. Im a firm believer that the best way to learn is by doing, so I highly recommend that you spend some time replicating the experiments above. This will also teach you to pay attention to things that could easily be automated with web scraping in future. So, play around with the tools/ideas above and let me know what you come up with in the comments section below https://ahrefs.com/blog/web-scraping-for-marketers/