Server-side scripting kata: File indexing

File indexing

Create an index of files in a folder

The way of kata

Kata is an exercise for muscle memory. It's not intended to fill your brain with information but train your fingers to react. The information is there to give you the why, but your fingers need to learn the how.

The material on this page is presented in a specific order — from least specific to highly technical. You will learn the most by jumping in as soon as you have some idea of what you should do. Once you're done, read the rest of the material and check your solution.

All katas are designed to be doable without using 3rd party libraries (and, in fact, the point is to also learn how to do what these libraries do).

To make the best of katas, observe the following rules.

Don't rush.
When stuck, take a break and do something unrelated.
Do not copy/paste code. Always retype everything.
Do not use AI tools to generate code.
Try to do something that wasn't in the instructions, experiment.
Repeat the kata from time to time, even if you think you've got it.
You have mastered the kata once you are able to complete it without thinking too much.

Remember, the goal is not to get it done, but to get some practice.

Introduction

In this exercise, you will practice traversing and indexing a folder.

Skills you will acquire

Obtaining a list of files and folders within a folder
Reading file contents
Path operations
Hashing
Synchronization of asynchronous operations

Objective

Collect information about the files within a folder, including its subfolders
Print to console the following information:
- Relative path of the file
- File's media type
- File's modification date
- The file's content hash

Check your solution

Create a folder with an assortment of files including files of different types (text, code, images, SVG, JSON files, videos, etc.)
Include some nested folders with files in them
The output shows all files found in the folder, including those found in nested folders
The output shows the correct file media type for each file
The output shows the file's media type
The output shows the file's content hash as SHA-256 hexdigest

Keep in mind

Filesystem operations are io (input-output) operations. As such, they can be performed asynchronously as well as synchronously. The asynchronous API can be used using the traditional callback-based style or using promises. You should try all three options and note the difference in performance.

When using the asynchronous API, you will be reading multiple files in parallel. Since you want all operations to complete before the next step (outputting to console), you will need to synchronize the async operations.

For callback-style calls, synchronization means that you want to catch the last callback that is executed. Since the exact timing of the last callback is unknown ahead of time, you will want to set up some kind of counter that you check each time a callback finishes, and move to the next step once the counter reaches a certain number (e.g., 0 or total number of expected calls). Here's an example:

var filesRemaining = fileList.length
var result = []
fileList.forEach(function (file, i) {
    fs.readFile(file.path, function (err, content) {
        // Do something with content and add to result array
        filesRemaining--
        if (!filesRemaining) nextStep(result)
    })
})

In the above example, you ideally want the order of results in the result array to match the order of the files in fileList. Rather than using results.push(thing) you can use the i variable to assign the result to a specific index — results[i] = thing.

Since this kind of synchronization is common, there are libraries out there that do it for you. Try writing a function or a class that abstracts this pattern, and then also find a library or two and try them and see how they compare to your solution. Think about how much code gets saved (or not saved) with these canned solutions.

For promise based API, you can use Promise.all(). Compare the promise-based solution to the callback-based solution. What are the advantages and disadvantages that you see with each approach?

The path module offers functions for working with paths. These functions work on the paths as text, not the physical paths on disk. That your objective is to generate paths relative to the folder in which the files are located. For instance, if the full path to the file is /foo/bar/baz/qux.txt and the folder you are indexing is /foo/bar, the relative path of the file is baz/qux.txt. The full path is called an 'absolute' path.

The file's media type (MIME type) can be determined based on its extension. Depending on where the files in the folder come from, we may also sometimes employ more advanced techniques to determine the file type — for example by reading the magic number (file signature). In this exercise, you will assume that the files come from a trusted source and the extension represents the correct media type. If you are curious, you could also implement the media type detection using magic numbers, though.

The modification of time of the file is obtained using a fs.stat() function.

To generate the file hash you can use the NodeJS crypto module's createHash() function. Your aim is to create a SHA-256 hash — specifically the hexdigest of the SHA-256 hash — for each file's contents. This hash can be used in various ways, such as determining that two files have the same contents, or that file content has been modified. When working with web servers, we might use them, for example, for Etag headers.

Basic scraping (to obtain media types)

Hajime, the duck guy

File indexing

Introduction

Skills you will acquire

Objective

Check your solution

Keep in mind

Reading list

Want more?

File indexing

Introduction

Skills you will acquire

Objective

Check your solution

Keep in mind

Related katas

Reading list