blog.notmyidea.org/content/code/2024-08-04-markdown-quotes-clusters.md at 6a86fabc6ed43f1f0bb01023a763bd8ddc7f3215

almet/blog.notmyidea.org

mirror of https://github.com/almet/notmyidea.git synced 2025-04-28 11:32:39 +02:00

2024-08-16 10:12:32 +02:00

5.3 KiB

Raw Blame History

title	tags
Categorizing book quotes in clusters	llm, markdown, clusters, sqlite

When I'm reading a theory book, oftentimes I'm taking notes: it helps me remember the contents, make it easier to get back to it later on, and overall it helps me organize my thoughts.

I was looking for an excuse to use LLM embeddings, to better understand what they are and how they work, so I took a stab at categorizing the quotes I have in different groups (named clusters).

Here is what I did:

Extract the quotes from the markdown files, and put them in a sqlite database ;
Create an embedding for each of the quotes. Embeddings are a binary representation of the content, and can be used to compare with other contents.
Run a K-nearest neighbors (k-NN) algorithm on it, to find the groups.
Organize these quotes in a new markdown document

I went with the llm command line tool with the embed-multi feature and the llm cluster plugin to group the notes together, and I wrote a python script to glue everything together.

In the end, I'm happy to have learnt how to make this work, but… the end results aren't as good as I expected it to be, unfortunately. Maybe that's because creating these clusters is where I actually learn, and automating it doesn't bring much value to me.

Grouping the quotes manually, removing the ones repeating themselves seems to lead to a more precise and "to the point" document.

That being said, here's how I did it. The main reason was to understand how it works!:

Extracting quotes from markdown files

First, I extracted the quotes and put them in a local sqlite database. Here is a python script I used:

def extract_quotes(input_file):
    with open(input_file, "r") as file:
        quote_lines = []
        quotes = []
        for line in file:
            if line.startswith(">"):
                quote_lines.append(line)
            else:
                if quote_lines:
                    quote = "\n".join(quote_lines).strip()
                    quotes.append(quote)
                    quote_lines = []
    return quotes

This is reading lines and grouping together the ones starting with a >. I'm not even using a Markdown parser here. I went with python because it seemed easier to get multi-line quotes.

Then, I insert all the quotes in a local database:

def recreate_database(db_path):
    if os.path.exists(db_path):
        os.remove(db_path)
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute("CREATE TABLE quotes (id INTEGER PRIMARY KEY, content TEXT)")
    conn.commit()
    return conn

Once the sqlite database created, insert each quote in it:

def insert_quotes_to_db(conn, quotes):
    cur = conn.cursor()
    for quote in quotes:
        cur.execute("INSERT INTO quotes (content) VALUES (?)", (quote,))
    conn.commit()

Bringing everything together like this:

@click.command()
@click.argument("input_markdown_file", type=click.Path(exists=True))
def main(input_markdown_file):
    """Process Markdown files and generate output with clustered quotes."""
    conn = recreate_database("quotes.db")
    quotes = extract_quotes(input_markdown_file)
    insert_quotes_to_db(conn, quotes)

Alternatively, you can create the database with sqlite-utils and populate it with a loop (but multi-lines quotes aren't taken into account. Python wins.):

sqlite-utils create-table quotes.db quotes id integer content text --pk=id --replace

grep '^>' "$INPUT_MARKDOWN_FILE" | while IFS= read -r line; do
  echo "$line" | jq -R '{"content": . }' -j | sqlite-utils insert quotes.db quotes -
done

Getting the clusters

That's really where the "magic" happens. Now that we have our local database, we can use llm to create the embeddings, and then the clusters:

llm embed-multi quotes -d quotes.db --sql "SELECT id, content FROM quotes WHERE content <> ''" --store

Avoiding the empty lines is mandatory, otherwise the OpenAI API is failing without much explanation (unless I missed it, at the moment llm doesn't generate embeddings with local models). It actually took me some time to figure out why the API calls were failing.

Then, we generate the clusters and the summaries:

llm cluster quotes 5 -d quotes.db --summary --prompt "Titre court pour l'ensemble de ces citations"

Which outputs us something like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1",
        "content": "> En se contentant de leur coller l'\u00e9tiquette d'oppresseurs et de les rejeter, nous \u00e9vitions de mont"
      },
      {
        "id": "10",
        "content": "> Toute personne qui essaie de vivre l'amour avec un partenaire d\u00e9pourvu de conscience affective sou"
      },
      // <snip>
    ],
    "summary": "Dynamiques émotionelles dans les relations genrées"
  },
  // etc.
]

Output as markdown

The last part is to put back everything together in a the STDOUT. It's a simple loop which does a lookup in the database for each item, prints the summary of each group and then the quotes.

You can find the full script here, included below:

{!extra/scripts/group-quotes.py!}

5.3 KiB Raw Blame History

Extracting quotes from markdown files

Getting the clusters

Output as markdown

5.3 KiB

Raw Blame History