top of page
Search
  • Writer's pictureScraper

How to webscrape with Python?

One of our friends recently went to VK (this is a Russian social network) and saw that she was showing her memories: with whom she made friends 3 years ago, what posted 5 years ago, etc.



She became interested in analyzing other data about her accumulated in a social network over 10 years. Then she turned to us with a request to tell her more about web scraping. So, we decided to write this article.

Before you analyze and build beautiful graphs, you need to get the data. Unfortunately, many services do not have a public API, so you have to parse html pages. In this article I will talk about how to parse a web site.


Stages


We would divide our task into two parts:

1) we need to unload and save the html pages.

2) we need to parse the html in a format convenient for further analysis (csv, json, pandas dataframe etc).


There are many python libraries for sending http requests. The most famous of them are urllib / urllib2 and Requests. You also need to select a library for parsing html. The most basic ones are re, BeatifulSoup, lxml and scrapy.



Data loading


So, first you need to upload the data. It’s worth trying to get the page by url and saving it to a local file, but most often the site recognizes the robot in us and refuses to show data.

However, the browser will help us. The browser passes the UserAgent, cookie, and a number of parameters to the headers. We need to pass the correct UserAgent to the header and get the necessary data.


Parsing


XPath is a query language for xml and xhtml documents. We will use XPath selectors working with the lxml library.

Now we will pass directly to obtaining data from html. The easiest way is to understand how the html page is arranged using the Inspect Element function in the browser.


As a result, in this article we explained how to parse websites, got acquainted with the Requests, BeautifulSoup and lxml libraries, and told how to get data suitable for further analysis from the Internet.


If you still have questions, ask them in the comments!

4 views0 comments

Recent Posts

See All
bottom of page