File indexing
Create an index of files in a folder
The way of kata
Kata is an exercise for muscle memory. It's not intended to fill your brain with information but train your fingers to react. The information is there to give you the why, but your fingers need to learn the how.
The material on this page is presented in a specific order — from least specific to highly technical. You will learn the most by jumping in as soon as you have some idea of what you should do. Once you're done, read the rest of the material and check your solution.
All katas are designed to be doable without using 3rd party libraries (and, in fact, the point is to also learn how to do what these libraries do).
To make the best of katas, observe the following rules.
- Don't rush.
- When stuck, take a break and do something unrelated.
- Do not copy/paste code. Always retype everything.
- Do not use AI tools to generate code.
- Try to do something that wasn't in the instructions, experiment.
- Repeat the kata from time to time, even if you think you've got it.
- You have mastered the kata once you are able to complete it without thinking too much.
Remember, the goal is not to get it done, but to get some practice.
Introduction
In this exercise, you will practice traversing and indexing a folder.
Skills you will acquire
- Obtaining a list of files and folders within a folder
- Reading file contents
- Path operations
- Hashing
- Synchronization of asynchronous operations
Objective
- Collect information about the files within a folder, including its subfolders
- Print to console the following information:
- Relative path of the file
- File's media type
- File's modification date
- The file's content hash
Check your solution
- Create a folder with an assortment of files including files of different types (text, code, images, SVG, JSON files, videos, etc.)
- Include some nested folders with files in them
- The output shows all files found in the folder, including those found in nested folders
- The output shows the correct file media type for each file
- The output shows the file's media type
- The output shows the file's content hash as SHA-256 hexdigest
Keep in mind
Filesystem operations are io (input-output) operations. As such, they can be performed asynchronously as well as synchronously. The asynchronous API can be used using the traditional callback-based style or using promises. You should try all three options and note the difference in performance.
When using the asynchronous API, you will be reading multiple files in parallel. Since you want all operations to complete before the next step (outputting to console), you will need to synchronize the async operations.
For callback-style calls, synchronization means that you want to catch the last callback that is executed. Since the exact timing of the last callback is unknown ahead of time, you will want to set up some kind of counter that you check each time a callback finishes, and move to the next step once the counter reaches a certain number (e.g., 0 or total number of expected calls). Here's an example:
var filesRemaining = fileList.length
var result = []
fileList.forEach(function (file, i) {
fs.readFile(file.path, function (err, content) {
// Do something with content and add to result array
filesRemaining--
if (!filesRemaining) nextStep(result)
})
})
In the above example, you ideally want the order of results in the result
array to match the order of the files in fileList. Rather than using
results.push(thing) you can use the i variable to assign the result to a
specific index — results[i] = thing.
Since this kind of synchronization is common, there are libraries out there that do it for you. Try writing a function or a class that abstracts this pattern, and then also find a library or two and try them and see how they compare to your solution. Think about how much code gets saved (or not saved) with these canned solutions.
For promise based API, you can use Promise.all(). Compare the promise-based
solution to the callback-based solution. What are the advantages and
disadvantages that you see with each approach?
The path module offers functions for working with paths. These functions work
on the paths as text, not the physical paths on disk. That your objective is to
generate paths relative to the folder in which the files are located. For
instance, if the full path to the file is /foo/bar/baz/qux.txt and the folder
you are indexing is /foo/bar, the relative path of the file is baz/qux.txt.
The full path is called an 'absolute' path.
The file's media type (MIME type) can be determined based on its extension. Depending on where the files in the folder come from, we may also sometimes employ more advanced techniques to determine the file type — for example by reading the magic number (file signature). In this exercise, you will assume that the files come from a trusted source and the extension represents the correct media type. If you are curious, you could also implement the media type detection using magic numbers, though.
The modification of time of the file is obtained using a fs.stat() function.
To generate the file hash you can use the NodeJS crypto module's
createHash() function. Your aim is to create a SHA-256 hash — specifically the
hexdigest of the SHA-256 hash — for each file's contents. This hash can be used
in various ways, such as determining that two files have the same contents, or
that file content has been modified. When working with web servers, we might
use them, for example, for Etag headers.