Skip to main content

Webscraping with AutoHotKeys and Python

Searching for novel data for a big data Power BI showcase I came accross and

How do we extract data from a site without any knowledge of it's API?
Python, urllib and BeautifulSoup are commonly described as standard tools for webscraping although they lack the features and rendering qualities of real browsers. There seems to be a race for an ultimate webscraping Python package, although the web is constantly changing and most data are text based.

In the sections below I will cover three different data acquisition strategies.
In conclusion I suggest you combine tools and prepare to learn the intricacies of regular expressions.

We will use AutoHotKeys, spreadsheets and Python with packages such as re, requests and csv.

Scraping AutoHotKeys, Python packages requests, re and csv

You can download script based automation software on this page:
Once installed you write AutoHotKey scripts in a text editor and save it as a file with postfix .ahk
A webpage can be saved using the syntax UrlDownloadToFile, http://websiteurl, sourcefilename.txt
Just put lines in a file to extract several webpages, then compile and run from a Windows folder using the right button mouse menu. Text files resembles a direct download of html files and are thus a high quality source of information for many purposes. Combine with a spreadsheet and the CONCATENATE text function to generate your ahk script file.

UrlDownloadToFile,, BilbasenFrontPage.txt

In practice we search for a particular brand of car (Volkswagen) on and traverse through search results, which gives us 119 pages to scrape for links to specific cars. The search result pages have very similar structure making it easy to prepare lines for a AutoHotKey script in a spreadsheet. 

From html source files we extract data in Python using simple iteration techniques and basic Python packages such as requests and re. The former Python package requests is preferred to urllib packages due to compatibility reasons. Perl like regular expressions are applicable with re

import requests
import re


for i in range(2, 119)
        print('Error reading file Bilbasen'+str(i)+'.txt')

It can be useful to remove whitespace and other obnoxious characters before processing. This is done with the replace method. In the example above listhtml is a Python object with variables and methods to call. Remember Python is an object oriented scripting language a list of variables and methods are given with the dir(Object_Name) command. Indentation to indicate scope is a very import aspect of Python syntax, so please be careful when you copy-paste Python code. There are many pitfalls for newbies.

We proceed to the actual webscraping process, in which I use simple regular expressions to extract data. Variables for each car is collected in a list and stored in a comma separated file.

import csv

with open('vw.csv', 'w') as csvfile:
    specswriter = csv.writer(csvfile, delimiter=';',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    for urlbil in link:
            print("Failed to read presentation page")

  '[^,]*(?=,)',,re.M).group().replace(" ",'')
            print("Could not retrieve manufactorer")
  '(?<=,)[^,]*(?=,)',,re.M).group().replace(" ",'')
            print("Could not retrieve model")
  '(?<=%s,).*(?=,\sBenzin|,\sDiesel)' % model,
            print("Could not retrieve variant")
            print("Could not retrieve fuel")
            print("Could not retrieve year")
            print("Could not retrieve average price")
            print("Could not retrieve equipment")


Scraping Selenium vs AutoHotkeys

Accessing a site with requests is a process with information loss, especially if the site requires javascript or other rendering capabilities during the parsing of sourcefiles. Seleniums webdriver capabilities provides webscraping results closer to a browsers rendering, but slows down data acquisition.  On the site we do not obtain needed information using require and turn to two different strategies. A traditional Python-Selenium strategy and an alternative in which source files are gathered with AutoHotKeys. 


Selenium is marketed as an attractive multi-platform webscraping tool.However, the slow down due to overhead in communication phases may require optimization of Python code, an expensive and a time consuming process vulnerable to software changes and upgrades.

from selenium import webdriver


There are loads of webpages with tutorials on optimal Selenium development environments. Many are continuously updated, most require local customization of guidelines, software packages and operating system.


A data acquisition comparable to the initial attempt on is obtained if automation is placed outside your Python environment. It may be your most import phase in the initial data extraction process and a faster alternative to Selenium.

Lets generate a list of urls to individual presentation pages in Python from initial websearch files gathered with AHK and Python.


for i in range(2, 400):
        print('Error reading file Bilzonen'+str((i-1)*12)+'.txt')

with open('VWpages.csv', 'w') as csvfile:
    vwswriter = csv.writer(csvfile, delimiter=';',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    for url in link:
        vwswriter.writerow(["UrlDownloadToFile,"+url+", VW_"+str(count)+".txt"])


Copy the generated file to a separate folder, compile and run using the right button menu.


Presentation pages subtracts are obtained within minutes and not hours in a comparison between AHK scripting and Selenium. Furthermore, obtained files contain needed variables in an easy-to-access format.
A general rule-of-thumb would be to gather data source files beforehand and continuously outside the Python development environment and stick to the core of Python programming in combination with regular expressions in processing of data.


Popular posts from this blog

HackRF on Windows 8

This technical note is based on an extract from thread. I have made several changes and added recommendations. I have experienced lot of latency using GnuRadio and HackRF on Pentoo Linux, so I wanted to try out GnuRadio on Windows.

HackRF One is a transceiver, so besides SDR capabilities, it can also transmit signals, inkluding sweeping a given range, uniform and Gaussian signals. Pentoo Linux provides the most direct access to HackRF and toolboxes. Install Pentoo Linux on a separate drive, then you can use osmocom_siggen from a terminal to transmit signals such as near-field GSM bursts, which will only be detectable within a meter.

Installation of MGWin and cmake: Download and install the following packages:
- MinGW Setup (Go to the Installer directory and download setup file)
- CMake (I am using CMake 3.2.2 and I installed it in C:\CMake, this path is important in the commands we must send in the MinGW shell)
Download and extract the packages respectively in the path C:\MinGW\msys\…

Example: Beeswarm plot in R


data <- read.dta("C:/Users/hellmund/Documents/MyStataDataFile.dta")





png(file="C:/Users/hellmund/Documents/il6.png", bg="transparent")

beeswarm(data$il6~data$group,data=data, method=c("swarm"),pch=16,pwcol=data$Gender,xlab='',ylab='il6',ylim=c(0,20))


boxplot(data$il6~data$group, data=data, add = T, names = c("","",""), col="#0000ff22")

Example: Business cards typeset with LaTeX

So you enjoy the quality of a professional typesetting system? You got Avery labels, a working MikTeX and the ticket package installed...
You might find some assistance from a half criminal paranoid zealot system administrator, willing to guide you through a dinosaur kingdom of TeX ... but that kind of assistance might also just leave you with nothing.

It was easy to get the layout of the labels with the option zw32010, but how about page margins? I tried to set things straight with the layouts package (\usepackage{layouts}\currentpage \pagedesign), but then there was still some unwanted white space and margins...

To make things less complicated I decided to make a single card. The solution is a hack because it needs customization (with voffset and hoffset as you see n the TeX code below) but the adjustment is more straightforward, especially if you use the boxed option with ticket.

The card was converted to png with Ghostscript and I could easily print the business cards with Averys …