Scraping (Michael Bauer)

Page 1

http://unurl.org/scco Getting data of the web by scraping Introduction Many times data is not easily accessible ­ although it does exist. As much as we wish everything was available in CSV or the format of our choice ­ most data is published in different forms on the web. What if you want to use the data to combine it with other datasets and explore it independently? Scraping to the rescue! Scraping describes the method to extract data hidden in documents ­ such as Web Pages and PDFs and make it useable for further processing. It is among the most useful skills if you set out to investigate data ­ and most of the time it’s not especially challenging. For the most simple ways of scraping you don’t even need to know how to write code. This example relies heavily on Google Chrome for the first part. Some things work well with other browsers, however we will be using one specific browser extension only available on Chrome. If you can’t install Chrome, don’t worry the principles remain similar.

You will need: ● ● ●

Google Chrome Browser https://www.google.com/intl/en/chrome/browser/ OpenRefine from : http://openrefine.org Websites: ○ http://www.senado.gov.co/el­senado/senadores <­ does not work :/ ○ http://en.wikipedia.org/wiki/List_of_Presidents_of_the_Chamber_of_Representativ es_of_Colombia


Detour ­ a short introduction to HTML Getting data from websites might seem a little complicated at first ­ but rest assured, once you’ve done it a couple of times it will be similar. To extract data from websites we need to peek under the hood and look at the underlying HTML code. Don’t worry you don’t need to understand every detail of it just to be able to do so. HTML is the acronym for Hypertext Markup Language and is the language used to describe (markup) web pages. It is the underlying language to structure web­page content. HTML itself does not determine the way things look ­ it only helps to classify and structure content. So let’s peek at some websites. Walkthrough: Exploring HTML with Google Chrome 1. Open a website containing statistics on rural poverty in bolivia. e.g: http://www.senado.gov.co/el­senado/senadores in Chrome 2. Scroll down to the table 3. Right click on one of the entries 4. Select “Inspect Element”

5. Chrome will open a second area on the bottom of the page showing the underlying HTML code ­ focussed on the element you clicked


6. The pointy brackets are the HTML tags. 7. Now move your mouse up and down and notice how chrome tells you which element is which 8. You can expand and collapse certain sections by clicking on the triangles 9. Did you notice something? Every row in the long list is within one <tr></tr> section. <tr> indicates a table row. 10. The data fields are in <td></td> tags ­ td indicates table data. So we’re dealing with a table here? 11. If you scroll up the list you’ll notice a <table> element, followed by a <tbody> element ­ so yes this is a proper HTML table.

12. Go ahead and explore! HTML is no mystery. If you want to know more about it and how to build webpages with it ­ visit the School of Webcraft for a gentle introduction. Task: Identify a website and look at the HTML code using Inspect Element. Did you find something interesting?


Extracting a table from a webpage using Google Spreadsheets Knowing the structure of a website is the first step towards extracting and using the data. Let’s get our data into a spreadsheet ­ so we can use it further. An easy way to do this is provided by a special formula in Google Spreadsheets. Walkthrough: Importing HTML tables into Google Spreadsheets. 1. Go to http://drive.google.com, log in and create a new spreadsheet 2. Edit cell A1 (the top left cell) 3. Let’s import a table of senators: http://www.senado.gov.co/el­senado/senadores: 4. Enter the following formula into the cell: =importHTML(“http://www.senado.gov.co/el­senado/senadores”,”table”,1) (The last number indicates the number of the table in the document, just try them out and find the matching one...) (Spanish version ; istead of ,)

5. Press Enter 6. Wait for a while and see the a table magically appear


Congratulations! you liberated the website. Task: Find a website with a table and scrape the information from it. Share your result on the datahub.io (make sure to tag your dataset with schoolofdata.org)

Scraping websites using the Scraper extension for Chrome If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs. Walkthrough: Scraping a website with the Scraper extension 1. Open Google Chrome


2. Open the Scraper extension in the Chrome web store: https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd ?hl=en 3. Click the add to chrome button. 4. Now let’s go back to the list of senators 5. Open or reload http://www.senado.gov.co/el­senado/senadores 6. Now mark one row 7. Right click and select “scrape similar...”

8. A new window will appear ­ the scraper console


9. In the scraper console you will see the scraped content 10. Click on “Save to Google Docs...” to save the scraped content as a Google Spreadsheet. Easy wasn’t it? Now let’s do something a little more complicated. Waltkthrough: extended scraping with the Scraper extension 1. Did you notice how the table not only contains the names of the senators, but also a link to an extended profile page? Let’s get those links as well! 2. The first thing we need to do is inspect the page to understand the structure ­ remember the <tr> elements are the rows and the <td> elements are the data cells in the table. 3. If we look at a name entry, the name is within an <a> tag ­ the <a> tag marks the link and it has an attribute called “href” which links to the page…

4. Let’s get this out with the scraper extension… We now need the left side of the scraper console. 5. See the column definitions on the left? we’ll add a new column by clicking the green (+) next to the first column.


6. Now we are asked to enter a “XPath” query ­ XPath is a query language that tells the computer which parts of the website you’re interested in. 7. First we need to start like above in column 1: “*[1]” tells the computer to take the data from the first column. 8. Then we’ll need to add the link as we’ve seen above (the <a> tag) we do this by appending “/a” 9. In the end, we want the href­attribute of the link, thus we end “/@href”

10. When we press enter ­ the new table will contain the column we just added 11. Notice how the email column also contains a lot of information? The email address is also a link, so simply adding “/a” to the end of the XPath will clean this up.

12. This way we can extract even more advanced information from a web­page.


Task: Find a webpage having information you are interested in and scrape it! Don’t forget to post your results on datahub.io.

Scraping more than one webpage: Refine Until now we’ve only scraped data from a single webpage. What if there are more? Refine to the rescue. Refine is a program that was made to effectively clean up messy data ­ such as the table we’ve extracted above. However it can also load additional information from web­pages. For example our list of senators: Walkthrough: Scraping more information from senators with refine 1. Get a “CSV” from the spreadsheet we just scraped (if you didn’t manage ­ get it from here) 2. start Refine 3. Create a new project and select the “CSV” you just downloaded for upload

4. Click “Next” to get to the preview screen 5. Tell refine to interpret you data as “CSV” and to ignore the first line of the file

6. If some characters appear garbled: click in the Character encoding text box and select “UTF­8”


7. Now click on “Create Project” to create the project. 8. Perfect ­ let’s name the columns reasonably ­ e.g. the column containing the link or the one containing the email address. Do so by selecting the column options (the blue triangle next to the column name) and “edit column ­ rename this column”

9. Very good ­ now let’s load the pages into refine! 10. See how the link does not contain the domain, we’re looking at? This is because they were defined as “relative” ­ we’ll need to add the domain name, when downloading the page. 11. To download web­pages into refine, you’ll need to select the column options of the “Link” column and select “Edit column ­ add column by fetching URLs”


12. If you select this a new menu will pop up ­ first let’s give the new column a name­ I call it “Page” 13. Next ­ let’s reduce the “throttle delay” to about 50ms ­ the websites of the senate should be able to stand some load ;) 14. Then we’ll need to tell refine where to get the data from ­ right now it just takes everything in the column. For this we’ll have to alter the expression. 15. Expressions in refine are nothing else than formulas in spreadsheets ­ they work a little different, but if you can write formulas in spreadsheets, you can do this. 16. We simply have to prepend “http://www.senado.gov.co” to the url ­ so our expression will be: “‘http://www.senado.gov.co’ + value” NOTE: don’t forget the quotes around senado.gov.co (also copy and pasting might cause problems, type it)


17. Click “OK” to start the downloading… 18. Downloading all the pages will take a while. 19. In the meanwhile open one of the pages to take a close look at it: http://www.senado.gov.co/el­senado/senadores/213­directorio/106­aguilar­hurtado­nerthi nk­mauricio 20. See how it contains and address, picture and links to social media (on some profiles). 21. We’ll want to extract this information. 22. When refine stopped downloading you’ll see a page column with a lot of html code ­ don’t worry. We’ll use refine to handle this. 23. Let’s extract the address ­ if you inspect the page you’ll find the address is within one “<p>” tag with the id “contact­street” ­ we can use this to extract the address 24. In refine, select the column options for the “Page” column and select “edit column ­ add column based on this column”


25. Ok, now you’ll see an expression menu like the one we had previously. Name the column “Street Address” and then we’ll work on the expression 26. The first thing we need to do is tell Refine this page is a “Html” page ­ we do this by writing “value.parseHtml()” 27. Then we want to select the <p> tag with id “contact­street”, we do this by appending “.select(‘p#contact­street’)” 28. This will return a list ­ However we want to get a text back ­ so we append “.toString()” 29. The full expression will look like:


30. Perfect, if you click “OK” refine will extract the Street address of all senators. 31. Now let’s get the social media contacts… They are in a <div> with id “contact­misc” and each has it’s own link. 32. Select “edit­column ­ add column based on this column” as before. 33. Our expression is going to be very similar to above… first we’ll tell refine this is Html using “value.parseHtml()” 34. Then we’ll select the “div” ­ however since we know each of them is a link, we’ll append “a” to the selector. The expression is “value.parseHtml().select(‘div#contact­misc a’)” 35. Now we want to join the list together by semicolons (so we have a list of profiles) 36. The full expression looks like:


37. Click “OK” to get this out. 38. Notice how both new columns still contain HTML tags? Let’s remove them 39. To do so select “edit cells ­ Transform...” from the options of the respective column (e.g. the “Street Address” column) 40. It expects an expression to change this, the expression we want to clean out html is: ”value.replace(/\<.*?\>/,'')”


41. Do this for both columns ­ we now have extracted additional information for the senators!

Summary: In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi­structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming ­ for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Got Stuck? Need help? http://ask.schoolofdata.org! Further Reading See blogpost: http://schoolofdata.org/2012/12/04/the­webpage­is­the­api­scraping­resources/ Recipes: ● http://schoolofdata.org/handbook/recipes/introduction­to­html/ ● http://schoolofdata.org/handbook/recipes/liberating­html­tables/ ● http://schoolofdata.org/handbook/recipes/scraper­extension­for­chrome/ ● http://schoolofdata.org/handbook/recipes/scraping­multiple­pages­with­refine­and­scrape r/ ● http://schoolofdata.org/handbook/recipes/extracting­data­from­pdf­with­tabula/ Paul Bradshaw: Scraping for Journalists https://leanpub.com/scrapingforjournalists (ebook)

Want to learn more Data Skills? Visit: http://schoolofdata.org or http://escueladedatos.org



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.