blog.notmyidea.org/content/code/2024-08-04-markdown-quotes-clusters.md
2024-08-16 10:12:32 +02:00

141 lines
5.3 KiB
Markdown

---
title: Categorizing book quotes in clusters
tags: llm, markdown, clusters, sqlite
---
When I'm reading a theory book, oftentimes I'm taking notes: it helps me remember the contents, make it easier to get back to it later on, and overall it helps me organize my thoughts.
I was looking for an excuse to use LLM embeddings, to better understand what they are and how they work, so I took a stab at categorizing the quotes I have in different groups (named clusters).
Here is what I did:
1. Extract the quotes from the markdown files, and put them in a sqlite database ;
2. Create an embedding for each of the quotes. Embeddings are a binary representation of the content, and can be used to compare with other contents.
3. Run a [K-nearest neighbors (k-NN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm on it, to find the groups.
4. Organize these quotes in a new markdown document
I went with the [llm](https://llm.datasette.io/) command line tool with the `embed-multi` feature and the [`llm cluster` plugin](https://github.com/simonw/llm-cluster) to group the notes together, and I wrote a python script to glue everything together.
In the end, I'm happy to have learnt how to make this work, but… the end results aren't as good as I expected it to be, unfortunately. Maybe that's because creating these clusters is where I actually learn, and automating it doesn't bring much value to me.
Grouping the quotes manually, removing the ones repeating themselves seems to lead to a more precise and "to the point" document.
That being said, here's how I did it. The main reason was to understand how it works!:
## Extracting quotes from markdown files
First, I extracted the quotes and put them in a local sqlite database. Here is a python script I used:
```python
def extract_quotes(input_file):
with open(input_file, "r") as file:
quote_lines = []
quotes = []
for line in file:
if line.startswith(">"):
quote_lines.append(line)
else:
if quote_lines:
quote = "\n".join(quote_lines).strip()
quotes.append(quote)
quote_lines = []
return quotes
```
This is reading lines and grouping together the ones starting with a `>`. I'm not even using a Markdown parser here. I went with python because it seemed easier to get multi-line quotes.
Then, I insert all the quotes in a local database:
```python
def recreate_database(db_path):
if os.path.exists(db_path):
os.remove(db_path)
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute("CREATE TABLE quotes (id INTEGER PRIMARY KEY, content TEXT)")
conn.commit()
return conn
```
Once the sqlite database created, insert each quote in it:
```python
def insert_quotes_to_db(conn, quotes):
cur = conn.cursor()
for quote in quotes:
cur.execute("INSERT INTO quotes (content) VALUES (?)", (quote,))
conn.commit()
```
Bringing everything together like this:
```python
@click.command()
@click.argument("input_markdown_file", type=click.Path(exists=True))
def main(input_markdown_file):
"""Process Markdown files and generate output with clustered quotes."""
conn = recreate_database("quotes.db")
quotes = extract_quotes(input_markdown_file)
insert_quotes_to_db(conn, quotes)
```
Alternatively, you can create the database with `sqlite-utils` and populate it with a loop (but multi-lines quotes aren't taken into account. Python wins.):
```bash
sqlite-utils create-table quotes.db quotes id integer content text --pk=id --replace
grep '^>' "$INPUT_MARKDOWN_FILE" | while IFS= read -r line; do
echo "$line" | jq -R '{"content": . }' -j | sqlite-utils insert quotes.db quotes -
done
```
---
## Getting the clusters
That's really where the "magic" happens. Now that we have our local database, we can use `llm` to create the embeddings, and then the clusters:
```bash
llm embed-multi quotes -d quotes.db --sql "SELECT id, content FROM quotes WHERE content <> ''" --store
```
Avoiding the empty lines is mandatory, otherwise the OpenAI API is failing without much explanation (unless I missed it, at the moment `llm` doesn't generate embeddings with local models). It actually took me some time to figure out why the API calls were failing.
Then, we generate the clusters and the summaries:
```bash
llm cluster quotes 5 -d quotes.db --summary --prompt "Titre court pour l'ensemble de ces citations"
```
Which outputs us something like this:
```json
[
{
"id": "0",
"items": [
{
"id": "1",
"content": "> En se contentant de leur coller l'\u00e9tiquette d'oppresseurs et de les rejeter, nous \u00e9vitions de mont"
},
{
"id": "10",
"content": "> Toute personne qui essaie de vivre l'amour avec un partenaire d\u00e9pourvu de conscience affective sou"
},
// <snip>
],
"summary": "Dynamiques émotionelles dans les relations genrées"
},
// etc.
]
```
## Output as markdown
The last part is to put back everything together in a the `STDOUT`. It's a simple loop which does a lookup in the database for each item, prints the summary of each group and then the quotes.
You can find [the full script here](/extra/scripts/group-quotes.py), included below:
```python
{!extra/scripts/group-quotes.py!}
```