LLMを利用して簡単な画像解析ができると、できることが拡がるんじゃないかな？と思って早速作ってみた。

環境の準備

Mac を前提に書いてるのでご了承を！

Ollama

まずは Ollama のインストール

brew install --cask ollama

立ち上げ

ollama

LLaVA

LLaVa のインストール

ollama run llava

Python で実装

適当にディレクトリを作成する

mkdir ~/ollama-llava

読み込む画像を格納するディレクトリを作成する

mkdir -p ~/ollama-llava/media

適当に image_analysis.py を作成する

import requests
import base64
import json

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def analyze_image(image_path):
    base64_image = encode_image(image_path)

    data = {
        'model': 'llava',
        'prompt': 'Explain in detail what you see in this image.',
        'images': [base64_image]
    }

    response = requests.post('http://localhost:11434/api/generate',
                             headers={'Content-Type': 'application/json'},
                             json=data,
                             stream=True)

    if response.status_code == 200:
        full_response = ''
        for line in response.iter_lines():
            if line:
                json_response = json.loads(line)
                if 'response' in json_response:
                    full_response += json_response['response']
                    print(json_response['response'], end='', flush=True)
                if json_response.get('done', False):
                    break
        return full_response
    else:
        return f"Error: {response.status_code} - {response.text}"

if __name__ == "__main__":
    # 画像のパスを直接指定
    image_path = "/Users/USER/ollama-llava/media/image.jpg"

    print(f"Analyzing image: {image_path}")
    result = analyze_image(image_path)
    print("\nAnalysis complete.")

俺たちの Claude先生に説明してもらった


このコードは以下の機能を持つPythonスクリプトです：

1. 画像ファイルをBase64エンコードする関数 `encode_image` を定義。

2. 画像分析を行う `analyze_image` 関数を実装：
   - 画像をBase64エンコード
   - Ollama APIにリクエストを送信
   - レスポンスをストリーミングで受信し、結果を表示

3. メイン処理：
   - 画像ファイルのパスを指定
   - `analyze_image` 関数を呼び出して画像分析を実行
   - 結果を表示

このスクリプトは、指定された画像をOllama APIを使用して分析し、その内容を詳細に説明します。

~/ollama-llava/media に image.jpg と画像を配置してコードに実行すればOK👍

実行

テストでドラえもんの画像を読み込ませてみる

ドラえもん

Source: ドラえもんと仲間たち

python image_analysis.py

実行結果:

In the center of a white background, a blue cartoon character, Doraemon from the Japanese anime series, is captured in mid-run. The character's body is facing towards the right side of the image, but it's head and tail are angled to the left, giving the impression that Doraemon is turning its attention from one direction to another. Doraemon is not just running; it's also engaged in an activity. It's holding a yellow pencil with its right hand, suggesting some form of drawing or writing. The character's head is adorned with a red bow tie, adding a touch of color and personality to the image. A notable detail about Doraemon is its open mouth. It seems to be laughing or shouting, possibly in surprise or excitement. This adds a dynamic element to the otherwise static image. The simplicity of the background allows Doraemon to stand out as the main focus of this image. The white background also provides a stark contrast to Doraemon's blue body, further emphasizing its presence in the frame. Analysis complete.

DeepL で翻訳:

白い背景の中央で、日本のアニメシリーズに登場する青いキャラクター、ドラえもんが疾走している。ドラえもんの体は画像の右側に向いているが、頭と尻尾は左側に傾いており、ドラえもんがある方向から別の方向へと注意を向けているように見える。ドラえもんはただ走っているだけでなく、活動もしている。右手には黄色い鉛筆を持ち、絵を描いたり文字を書いたりしているようだ。ドラえもんの頭には赤い蝶ネクタイが付けられ、色彩と個性が加えられている。ドラえもんの特筆すべき点は、口を開けていることだ。驚いているのか、興奮しているのか、笑っているようにも、叫んでいるようにも見える。これは、静的なイメージにダイナミックな要素を加えている。背景がシンプルなので、ドラえもんがこの画像の主役として際立っている。また、白い背景はドラえもんの青い体とのコントラストを際立たせ、フレーム内での存在感をさらに強調している。分析完了。

...かなりすごいのでは？

別の画像で試してみる。ハンドサインが印象的なこの画像を読み込めるのか？

榎原依那

Source: Instagram

実行結果:

The image is a photograph featuring a person in the foreground making an "OK" hand gesture and smiling. This individual appears to be standing on a sandy beach, with the ocean visible in the background under a partly cloudy sky. The person has short hair, wears a red bikini top and blue bottoms, and is wearing a bracelet or watch on their right wrist. The person's pose suggests they are engaging with the viewer or camera, as they are making the universal "OK" sign which is commonly used to indicate approval or all-rightness. There is no visible text within the image to provide additional context or information. The style of the image is a casual, candid photograph that captures a moment of leisure or relaxation.

DeepL で翻訳:

画像は、手前の人物が「OK」のジェスチャーをして微笑んでいる写真。この人物は砂浜に立っているように見え、背景には一部曇り空の下に海が見える。

この人物はショートヘアで、赤いビキニのトップスと青いボトムスを着ており、右手首にブレスレットか腕時計をしている。
この人物のポーズは、一般的に承認や問題ないことを示すのに使われる「OK」のサインをしていることから、視聴者やカメラと関わっていることを示唆している。
画像内には、追加的な文脈や情報を提供するテキストは見えない。
画像のスタイルは、レジャーやリラックスの瞬間を捉えた、カジュアルで率直な写真である。

めっちゃ楽しい！すごい！！

例えば運営しているサービスの裏側で画像に対して LLM で作成したタグをつけておいてレコメンドできるシステムとか作れそうだなーと思った。

画像 + LLM にはいろいろな可能性があって、これからおもしろいサービスが出てくる予感がビンビンするであります。

LLaVA + Ollama を利用してローカルで簡単画像解析

環境の準備

Ollama

LLaVA

Python で実装

実行