Web Scraping and Black Lives Matter: analyzing patterns in ad placement

In the fall of 2019, our high school computer science teacher recruited us into a GGCS (Girls Go CyberStart) competition group because we were all in STEM-focused classes and interested in computer science. By February, we began having weekly meetings in a classroom provided by LoisLab, a local educational nonprofit, to solve the cybersecurity puzzles and learn more about computer science in general. After the competition concluded, we didn’t want to stop our weekly meetings. We began looking for a new project to work on together over the summer.

One of our mentors sent us an NPR interview about advertising companies filtering the locations of their ads. This interview highlighted certain companies preventing their ads from running alongside ‘polarizing’ topics, including the Black Lives Matter movement. Given our combined interests in civil rights and computer science we decided to explore this further. We wondered if it might be possible to verify this information ourselves and if we could even accomplish this kind of project with our existing skill set.

We believed that we could build a tool that would allow us to gather advertisement data from any web page. We could then apply that process to sets of articles — for example, some containing Black Lives Matter keywords and some not — to see if we could observe any patterns.

We needed a way to capture the advertiser information through code so we could automate the process and collect a large volume of data. However, it was difficult to locate banners and advertisers by analyzing the HTML because this information was obscured inside the JavaScript. We then turned to Selenium, a project which provides a set of tools to automate browser functions from code. We hoped that we could use its built-in features to allow us to isolate and identify elements like tags and class names, but the labels for the banner ads varied on each page load. In the end, we settled on a process which allowed us to click on the ads by using estimated XY coordinates and a combination of Selenium, the Chrome Webdriver, and PyAutoGUI. This allowed us to navigate to news articles, click on the ads, and store the advertiser’s web address in a dictionary. We stored and updated our code on Github as we worked through all of these iterations. You can find a link to a clean copy here.

Our code running, automating a browser session, and collecting banner ad visits

As the program finishes, it returns the data in a dictionary listing each advertiser’s URL and the number of times it was visited. Below you can find an example of the data that we collected from running our code on an NPR article relating to the Black Lives Matter movement for approximately twelve hours.

{‘foreverspin.com’: 1887, ‘business.comcast.com’: 3, ‘www.npr.org': 5, ‘www.penguinrandomhouse.com': 38, ‘www.getrealmaine.com': 140, ‘www.brimstoneconsulting.com': 1, ‘www.ecmcfoundation.org': 2, ‘topshelf.nprpresents.org’: 4, ‘www.barracuda.com': 10, ‘www.cpb.org': 1, ‘www.macfound.org': 1, ‘www.ikonpass.com': 3, ‘mellon.org’: 2, ‘one.npr.org’: 5, ‘www.capitalone.com': 3, ‘www.dnb.com': 3, ‘avtech.com’: 1, ‘www.ddcf.org': 1, ‘sloan.org’: 1, ‘www.amazon.com': 1, ‘www.td.com': 1, ‘mejuri.com’: 1, ‘www.alonetogether.com': 1}

The program successfully runs and returns a dictionary, but it is clear that our data is heavily skewed to one particular advertiser. In this instance, the most prominent advertiser is Forever Spin, but this varies between runs. While we are still unsure of the reason for these results, we are considering a few possible explanations:

  • The website could be tracking our visits using cookies
  • The site could be monitoring our traffic using our IP address
  • The results could actually be representative of typical visits to the article

Although we have been able to gather some data, we have not been able to collect enough to formulate any conclusions. We would need a much larger collection of data to report results or findings. Therefore, if you are interested in helping us with this fun project, we would appreciate it if you would take a look at the code on the Github and follow the instructions. It will explain how to run our code and share your results.

Questions? Reach us at info@loislab.org. We look forward to hearing from you.