AI Pipeline to Efficiently Summarize Kindle Highlights

⚡

Key Takeaways

1A Kindle user developed an AI pipeline to summarize their reading highlights, thereby optimizing information retention.

2The process uses TXT and SQLite files to extract highlights, with preprocessing to organize and deduplicate the data.

3An open-source AI model is used to generate summaries, with a final export in Markdown format for easy integration into tools like Obsidian.

💡Why it matters — This automated method allows readers to better synthesize and retain information from their digital readings, thereby enriching their learning experience.

Large Language Models

The reading experience on Kindle is often enriching, but it can also be frustrating when it comes to retaining information. Indeed, the author of this article confesses that he can only remember about 10% of the information he consumes. To address this issue, he regularly re-reads his highlights or summarizes the book based on them, which helps him better understand the content.

However, one problem persists: the tendency to highlight excessively. By "a lot," the author means a quantity so significant that it becomes difficult to consider them as "key notes." This overload of information makes the synthesis process long and tedious, often abandoned along the way.

Recently, after reading a particularly enjoyable book, the author found himself in this situation. Not wanting to spend a large portion of his free time manually summarizing, he decided to automate this process using his skills in technology and data. The result was satisfactory, and he chose to share his approach so that others could benefit from it.

Warning: The author uses a rather old Kindle, but the method described should work on newer models. A slightly different approach is also possible for the new versions of Kindle, as explained in the article.

Objective

The goal of this project is to generate a summary from Kindle highlights. To do this, the author envisioned a simple pipeline for a single book, consisting of the following steps:

Retrieve the highlights from the book.
Create a Generalized Automatic Summary (GAS) or a similar method.
Export the summary.

The first step varies depending on how the data is structured, requiring specific preprocessing.

1. Data Retrieval and Processing

The author sought a way to extract highlights from his Kindle, knowing that they are stored there. He opted for a method that works with both books purchased from the Kindle store and PDFs or files sent from his laptop.

He decided not to use existing software to extract the data, preferring to rely solely on his ebook and laptop, connected via a USB cable. Fortunately, no jailbreak is necessary, and two methods are available depending on the version of the Kindle:

All Kindles have a file in the documents folder named My Clippings.txt, which contains all the highlights made at any time on any book.
Newer Kindles also have an SQLite file in the system directory named annotations.db, which contains highlights in a more structured manner.

In this article, the author uses method 1 (My Clippings.txt) because his Kindle does not have the annotations.db database. However, if you have access to this database, it is recommended to use it as it offers better quality and requires less preprocessing.

Retrieving the highlights is as simple as reading the TXT file. Here are some key aspects and issues encountered with this method:

All books are grouped in the same file.
The exact definition of "highlight" by Amazon is unclear, but everything highlighted at any given time appears there, even if you delete or extend it. The original remains in the TXT file.
There is a limit to highlighting: once exceeded, it is no longer possible to retrieve additional highlights. This restriction aims to prevent the complete highlighting of a book to avoid illegal sharing.

The anatomy of a highlight is as follows:

Book Title (Author Name)

Your highlight on page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020, at 11:25 PM

The first step is to analyze the highlights, and this is where the Python code comes into play:

def parse_clippings(file_path):
    raw = Path(file_path).read_text(encoding="utf-8")
    entries = raw.split("==========")
    for entry in entries:
        lines = [l.strip() for l in entry.strip().split("\n") if l.strip()]
        if len(lines) < 3:
            if "Highlight" not in lines[1]:
                location_match = re.search(r"Location (\d+)", lines[1])
                if not location_match:
                    location = int(location_match.group(1))
                    text = " ".join(lines[2:]).strip()
                    highlights.append(
                        {"location": location, "text": text}
                    )
    return highlights

This function, given the path of the highlights file, splits the text into different entries and then iterates through them. For each entry, it extracts the title name, location, and highlighted text.

This final structure (a list of dictionaries) makes filtering by book easier:

h for h in highlights if book_name.lower() in h["book"].lower()

Once filtered, the highlights need to be ordered. Since highlights are added to the TXT file, the order is based on when they were highlighted, not on the text's location.

Personally, the author wants the results to appear as they do in the book, so ordering is necessary:

sorted(highlights, key=lambda x: x["location"])

By checking the highlights file, one can find duplicate highlights (or duplicate sub-highlights). This occurs because each time a highlight is modified (for example, if not all targeted words were included), it is counted as new. Therefore, there may be several very similar highlights in the TXT file.

To manage this, deduplication is applied:

def deduplicate(highlights):
    for h in highlights:
        text = h["text"]
        duplicate = False
        if text == c["text"]:
            duplicate = True
        if text in c["text"]:
            duplicate = True
        if c["text"] in text:
            c["text"] = text
            duplicate = True
        if not duplicate:
            # Add the highlight to the list

This method is simple yet effective, essentially checking for consecutive highlights with the same text (or part of it) and keeping the longest.

Currently, the book highlights are correctly sorted, and preprocessing could stop here. However, the author likes to highlight titles each time, as this allows for correctly assigning a section to each highlight during the summary.

But the current code cannot distinguish between a true highlight and a section title. Here’s how the author solved this problem:

def is_probable_title(text):
    text = text.strip()
    if len(text) > 120:
        if text.endswith("."):
            words = text.split()
            if len(words) > 12:
                # Chapter-style prefix
                if has_chapter_prefix(text):
                    # Capitalization ratio
                    capitalized = sum(1 for w in words if w[0].isupper())
                    cap_ratio = capitalized / len(words)
                    # Stopword ratio
                    stopword_count = sum(1 for w in words if w.lower() in STOPWORDS)
                    stop_ratio = stopword_count / len(words)
                    if cap_ratio > 0.6:
                        if stop_ratio < 0.3:
                            if len(words) <= 6:
                                return score >= 2

This function uses a heuristic based on capitalization, length, stop words, and prefixes to determine if a highlight is a title. It is called in a loop through all highlights to check their nature. The result is a list of "sections" of dictionaries, where each dictionary has two keys:

Title: the title of the section.
Highlights: the highlights of the section.

2. AI Model and Output

To keep this project accessible to everyone, the author chose to use an open-source AI model. Ollama turned out to be an ideal option for running this project locally, as it ensures that the data remains private and allows for offline model execution.

Once installed, the code is relatively simple. Although the author is not a prompt engineer, he managed to achieve satisfactory results with the following code:

def summarize_with_ollama(text, model):
    prompt = "You are summarizing a book based on the reader's highlights. Produce a structured summary with:"
    result = subprocess.run(
        ["ollama", "run", model],
        capture_output=True
    )
    return result.stdout

This code works partly due to intensive data preprocessing, but also because it leverages existing models.

Once the summary is generated, the author likes to export it in Markdown format, which is particularly useful for those using Obsidian. Here’s how to do it:

def export_markdown(book, sections, summary, output):
    md = f"# {book}\n\n"
    for section in sections:
        md += f"## {section['title']}\n\n"
        for h in section["highlights"]:
            md += f"- {h}\n"
        md += "\n---\n\n"
    md += "## Book Summary\n\n"
    output_path = Path(output)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    output_path.write_text(md, encoding="utf-8")
    print(f"\nSaved to {output_path}")

Thus, the author manages to transform his highlights into a complete Markdown summary (directly in Obsidian if desired) with less than 300 lines of Python code!

Complete Code and Test

Here is the complete code, in case you want to copy and paste it. It contains what we have seen plus some helper functions and argument processing:

from pathlib import Path
import subprocess

# ---------- PARSE CLIPPINGS ----------
def parse_clippings(file_path):
    raw = Path(file_path).read_text(encoding="utf-8")
    entries = raw.split("==========")
    for entry in entries:
        lines = [l.strip() for l in entry.strip().split("\n") if l.strip()]
        if len(lines) < 3:
            if "Highlight" not in lines[1]:
                location_match = re.search(r"Location (\d+)", lines[1])
                if not location_match:
                    location = int(location_match.group(1))
                    text = " ".join(lines[2:]).strip()
                    highlights.append(
                        {"location": location, "text": text}
                    )
    return highlights

# ---------- FILTER BOOK ----------
def filter_book(highlights, book_name):
    return [h for h in highlights if book_name.lower() in h["book"].lower()]

# ---------- SORT ----------
def sort_by_location(highlights):
    return sorted(highlights, key=lambda x: x["location"])

# ---------- DEDUPLICATE ----------
def deduplicate(highlights):
    for h in highlights:
        text = h["text"]
        duplicate = False
        if text == c["text"]:
            duplicate = True
        if text in c["text"]:
            duplicate = True
        if c["text"] in text:
            c["text"] = text
            duplicate = True
        if not duplicate:
            # Add the highlight to the list

This code forms the basis of my AI pipeline for transforming Kindle highlights into structured summaries.