Client-side scripting kata: Basic scraping

Basic scraping

Perform basic web scraping using the browser's developer tools

The way of kata

Kata is an exercise for muscle memory. It's not intended to fill your brain with information but train your fingers to react. The information is there to give you the why, but your fingers need to learn the how.

The material on this page is presented in a specific order — from least specific to highly technical. You will learn the most by jumping in as soon as you have some idea of what you should do. Once you're done, read the rest of the material and check your solution.

All katas are designed to be doable without using 3rd party libraries (and, in fact, the point is to also learn how to do what these libraries do).

To make the best of katas, observe the following rules.

Don't rush.
When stuck, take a break and do something unrelated.
Do not copy/paste code. Always retype everything.
Do not use AI tools to generate code.
Try to do something that wasn't in the instructions, experiment.
Repeat the kata from time to time, even if you think you've got it.
You have mastered the kata once you are able to complete it without thinking too much.

Remember, the goal is not to get it done, but to get some practice.

Introduction

In this exercise, you will practice basic web scraping using just the browser's developer tools.

Skills you will acquire

Analyzing the page structure
Locating elements
Generating an object URL
Generating a data URL
Triggering file downloads programmatically
Cleaning data

Objective

Create an object that maps extensions to their media types based on the provided page
Download the object as a JSON file

Check your solution

Copy the downloaded JSON into the developer tool and check that it can be parsed without errors

Materials

Common MIME types page (MDN)

Keep in mind

Media type is also know as MIME type. These terms are synonyms.

To execute a larger script on the page, you will probably want to write the code in multi-line format rather than execute it one line at a time. In Chrome and Chrome-based browsers, the Snippets feature can come in handy. You can find it under the Sources tab in dev tools.

Screenshots of the Snippets tab — Snippets tab on the left sidebar under Sources

In firefox, there is a multi-line mode that is activated by clicking the icon to the far-right or using the Ctrl+B (Cmd+B) shortcut.

Screenshot of the multi-line icon in Firefox — Multi-line icon is located along the right edge of the console

The first thing you want to do is identify the selector that you can use to uniquely identify either the table containing the necessary information or its rows. You will then iterate through all the rows and grab the extension and media type.

The use case for this data is looking up media type by file extension, so you want to create an object that can be used for that.

To trigger a download, you can simply trigger the click event on a link. This can be done using the .click() method on the link node. This will activate the default behavior of the link. You can set the href and download attributes to define what this will do.

There are two ways to convert file contents into a URL that you can use with links. One is data URL and another is object URL. Try both approaches.

When verifying the output, you don't need to call JSON.parse() (though there is nothing wrong with that). You can also simply paste the output into the console and evaluate it as code. Keep in mind that every valid JSON string is also valid JavaScript.

Reading list

Hints & spoilers

Spoiler: iterating table rows

In Chrome, you can right-click the element that you're wanting to reach, and in the menu yuo can select 'Copy' and then 'Copy JS Path'. This will copy to the clipboard some code that looks like this:

document.querySelector('#content > article > div > figure > table > tbody > tr:nth-child(1)')

You can then tweak this code so you can iterate all rows. For example:

document.querySelectorAll('#content > article > div > figure > table > tbody > tr')

Hajime, the duck guy