Slicing Large Bodies of Text Into Chapter Files

I have a pet project involving a speed reader. I use a nifty API to provide a reading box for the user. Each word of text flashes to the user. You can read text faster by moving the words instead of moving your eyes. Speed reading changed my life when I discovered it. I became interested in reading again. I use it every day.

There are different kinds of writing, so there are different kinds of reading. It’s really easy to speed read an article or a blog post. It takes a much slower pace of speed reading to get through classical literature. I used the speed reader to get through assignments in college. One time the assignment was your standard book report. I chose to read the Strange Case of Jekyll and Hyde. I was able to read it twice. For another assignment, I used the speed reader to read through an entire 300-page dissertation.

I like the speed reader so much I made it my goal to get a copy of the top 100 books from Project Gutenberg and make them available via speed reader. I made an app, called Public Library, to serve the content. The way the API works it would be easiest to store each book as a collection of chapter files.

The Problem

I work on this project so sporadically that I admit it’s just sitting on the shelf. I started off with three books. I got up to 12. I was manually handling the text so I could make sure it was formatted correctly. I put off adding new books for a while because it was tedious work. Recently it occurred to me how to automate splicing the text into chapter files.

The Solution

Since I know what the names of the chapters are I search the body of text to find where each one occurs. Then I need to get each line of text between the occurrences of chapter names. Then save them as a file. Doing this programmatically has easily turned an hour(s) long process into about five minutes.

The first book I have spliced so far is Grimms’ Fairy Tales. It was 9057 lines of text into 62 chapters. You can see why I needed something better than manually copy and pasting if I was ever going to get to all of the top one hundred books. I have two more books lined up and a list’s worth more of books to get.

The Chapter Splicer

You can fork the ChapterSplicer from my GitHub. It is a class that reads a text file and stores the text. It searches the text for the occurrences of keywords. In this case chapter names. Then it will break up the text according to the location of the chapter names. It saves each chapter text to a target folder.