PostgreSQL Deep Dive: PostgreSQL 9.5日本語マニュアルの検索システムをリリースしました（追記あり）

PostgreSQL 9.5の日本語マニュアルの検索システムをリリースしたので、ご紹介します。

PostgreSQL 9.5 マニュアル検索
http://snaga-lab.mybluemix.net/static/pgdoc/index.html

少し前からPostgreSQLのマニュアルを細かく調べる必要性が出てきたのですが、ご存じの通り、PostgreSQLのオンラインのマニュアルはGoogleと相性が良くありません。

本当はgrep -cでもいいくらいの機能なのですが、公開されているフォーマットがHTML、マニュアルのソースファイルはSGMLファイルなので、実際にそのままgrepしても、見栄え的にあまり嬉しくありません。

そのため、自分の開発の練習もかねてWebアプリとして作ってみました。

■マニュアル検索システムの機能

検索システムのURLは以下です。

PostgreSQL 9.5 マニュアル検索
http://snaga-lab.mybluemix.net/static/pgdoc/index.html

検索対象となるのは、日本PostgreSQLユーザ会が翻訳して以下で公開しているPostgreSQL 9.5の日本語マニュアルで、リリースノートと索引を除いたページです。

PostgreSQL 9.5.4文書
http://www.postgresql.jp/document/9.5/html/index.html

マニュアルのページ自体は、日本PostgreSQLユーザ会のマニュアルをオンラインで参照する形になっています。

検索キーワードを入力すると、キーワードが合致した回数が多いページから順に表示します。検索結果のページタイトルの右側に表示されている数字は、キーワードが合致した回数です。1回でも合致したページはすべて表示します。

また、複数のキーワードをスペースで区切って入力することで、OR検索またはAND検索ができます。OR検索では、「いずれかのキーワードを含むページ」を表示します。AND検索では「すべてのキーワードを含むページ」を表示します。

OR検索の場合は「単語単語」のように単にスペースで区切って入力してください。AND検索の場合には「単語 +単語」のように、いずれかの単語の最初にプラス記号を付加してください。プラス記号を付加するとAND検索に切り替わります。

「性能」や「パフォーマンス」などのように言い換えた言葉で記載されている個所をすべて確認したい場合にはOR検索を使うといいでしょう。一方で、「ログ」、「アーカイブ」のように異なる単語で絞り込む場合には、AND検索を使うといいでしょう。

単語の出現回数をもとにランキングする仕組みであるため、ページの長さでnormalize（単語の出現回数を文書の長さで割る）しようかとも思ったのですが、「長いページである」というのもひとつの情報であり、そのままにした方がより価値があると考えたためnormalizeはしていません。

■検索の仕組み

この検索システムは、Python用のWebフレームワークであるFlaskを使ったWebアプリケーションとして実装されています。

以前のエントリで紹介したように、PostgreSQLのマニュアルをwgetで取得、データベースに格納して、HTMLからプレーンテキストに変換し、それを検索対象としています。

TF-IDFでデータベース内の類似テキストを検索する Part 2 （実践編）
http://pgsqldeepdive.blogspot.jp/2016/07/tf-idf-part-2.html

今回のテーブルは以下の通りです。

testdb=> \d pgdoc
                            Table "public.pgdoc"
  Column  |  Type   |                       Modifiers
----------+---------+-------------------------------------------------------
 docid    | integer | not null default nextval('pgdoc_docid_seq'::regclass)
 filename | text    | not null
 html     | text    | not null
 plain    | text    |
 title    | text    |
Indexes:
    "pgdoc_pkey" PRIMARY KEY, btree (docid)

ページごとのスコア（キーワード出現回数）を取得するために、ドキュメントとキーワードのtext配列を渡すと、合致した回数をスコアとして返却するPL/PythonのSQL関数を作成します。

CREATE OR REPLACE FUNCTION pgdoc_score(doc text, q text[])
  RETURNS float8
AS $$
  import re

  score = 0
  for t in q:
    f = re.findall(t, doc, flags=re.IGNORECASE)
    score += len(f)

  return score
$$
LANGUAGE 'plpython2u';

この関数は、以下のように大文字小文字を違いを無視して、キーワードが合致した回数を返却します。

testdb=> SELECT pgdoc_score('foo foo bar bar bar', '{"foo", "BAR"}');
 pgdoc_score
-------------
           5
(1 row)

testdb=>

このSQL関数を使って、指定したキーワードに対するスコアを取得し、それをスコア順に並べ替えて表示しています。

testdb=> SELECT docid,filename,title,pgdoc_score(plain, '{WAL}') FROM pgdoc ORDER BY 4 DESC LIMIT 5;
 docid |            filename             |                           title                            | pgdoc_score
-------+---------------------------------+------------------------------------------------------------+-------------
  1181 | continuous-archiving.html       | 24.3. 継続的アーカイブとポイントインタイムリカバリ（PITR） |          85
   178 | runtime-config-wal.html         | 18.5. ログ先行書き込み（WAL）                              |          77
    69 | warm-standby.html               | 25.2. ログシッピングスタンバイサーバ                       |          70
   907 | wal-configuration.html          | 29.4. WALの設定                                            |          56
  1011 | runtime-config-replication.html | 18.6. レプリケーション                                     |          40
(5 rows)

■pg_bigmを試してみる

なお、全文検索ということでpg_bigmを使ってみたのですが、これくらいの文章量だと少なすぎてインデックスをうまく使ってくれませんでした。実行プランを見ても、シーケンシャルスキャンの方がコストが低いと判断されているようです。

testdb=> CREATE INDEX pgdoc_plain_idx ON pgdoc USING GIN (plain gin_bigm_ops);
CREATE INDEX
testdb=> EXPLAIN ANALYZE SELECT
testdb->   docid,
testdb->   title,
testdb->   filename,
testdb->   pgdoc_score(plain,'{ログ,アーカイブ}')
testdb-> FROM
testdb->   pgdoc
testdb-> WHERE
testdb->   docid in (SELECT docid FROM pgdoc WHERE (plain ILIKE '%ログ%' AND plain ILIKE '%アーカイブ%')  AND filename NOT LIKE 'release-%' AND filename <> 'bookindex.html')
testdb-> ORDER BY
testdb->   4 DESC;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=170.75..170.75 rows=1 width=755) (actual time=201.327..201.362 rows=38 loops=1)
   Sort Key: (pgdoc_score(pgdoc.plain, '{ログ,アーカイブ}'::text[]))
   Sort Method: quicksort  Memory: 30kB
   ->  Nested Loop  (cost=0.28..170.74 rows=1 width=755) (actual time=2.166..201.201 rows=38 loops=1)
         ->  Seq Scan on pgdoc pgdoc_1  (cost=0.00..162.18 rows=1 width=4) (actual time=1.387..169.877 rows=38 loops=1)
               Filter: ((plain ~~* '%ログ%'::text) AND (plain ~~* '%アーカイブ%'::text) AND (filename !~~ 'release-%'::text) AND (filename <> 'bookindex.html'::text))
               Rows Removed by Filter: 1271
         ->  Index Scan using pgdoc_pkey on pgdoc  (cost=0.28..8.30 rows=1 width=755) (actual time=0.004..0.006 rows=1 loops=38)
               Index Cond: (docid = pgdoc_1.docid)
 Planning time: 1.654 ms
 Execution time: 201.443 ms
(11 行)

enable_seqscanパラメータをoffにして強制的にインデックスを使うようにしてみましたが、それでもpg_bigmのインデックスではなく主キーのインデックスが使われてしまいました。しかも推定コストはこちらの方が高く、実際の実行時間もほとんど変わりませんでした。

testdb=> set enable_seqscan TO off;
SET
testdb=> EXPLAIN ANALYZE SELECT
testdb->   docid,
testdb->   title,
testdb->   filename,
testdb->   pgdoc_score(plain,'{ログ,アーカイブ}')
testdb-> FROM
testdb->   pgdoc
testdb-> WHERE
testdb->   docid in (SELECT docid FROM pgdoc WHERE (plain ILIKE '%ログ%' AND plain ILIKE '%アーカイブ%')  AND filename NOT LIKE 'release-%' AND filename <> 'bookindex.html')
testdb-> ORDER BY
testdb->   4 DESC;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=381.05..381.05 rows=1 width=755) (actual time=200.820..200.854 rows=38 loops=1)
   Sort Key: (pgdoc_score(pgdoc.plain, '{ログ,アーカイブ}'::text[]))
   Sort Method: quicksort  Memory: 30kB
   ->  Nested Loop  (cost=0.56..381.04 rows=1 width=755) (actual time=3.918..200.702 rows=38 loops=1)
         ->  Index Scan using pgdoc_pkey on pgdoc pgdoc_1  (cost=0.28..372.48 rows=1 width=4) (actual time=3.136..169.352 rows=38 loops=1)
               Filter: ((plain ~~* '%ログ%'::text) AND (plain ~~* '%アーカイブ%'::text) AND (filename !~~ 'release-%'::text) AND (filename <> 'bookindex.html'::text))
               Rows Removed by Filter: 1271
         ->  Index Scan using pgdoc_pkey on pgdoc  (cost=0.28..8.30 rows=1 width=755) (actual time=0.004..0.005 rows=1 loops=38)
               Index Cond: (docid = pgdoc_1.docid)
 Planning time: 1.672 ms
 Execution time: 200.934 ms
(11 行)

（10/02 14:25追記）

と思っていたのですが、よくよく調べてみたら、pg_bigmは「LIKE」には対応しているけれども、「ILIKE」には対応していないようです。

pg_bigm 1.1 ドキュメント
http://pgbigm.osdn.jp/pg_bigm-1-1.html

それが原因で、pg_bigmのインデックスが使われなかったようです。

そのため、検索対象となるカラム plain のアルファベットをすべて小文字に変換して、クエリで検索する文字列もすべて小文字に揃えた上で、「ILIKE」から「LIKE」に変更したら、無事にpg_bigmのインデックスを使ってくれるようになりました。

testdb=> EXPLAIN ANALYZE SELECT
testdb->   docid,
testdb->   title,
testdb->   filename,
testdb->   pgdoc_score(plain,'{ログ,アーカイブ}')
testdb-> FROM
testdb->   pgdoc
testdb-> WHERE
testdb->   docid in (SELECT docid FROM pgdoc WHERE (plain LIKE '%ログ%' AND plain LIKE '%アーカイブ%')  AND filename NOT LIKE 'release-%' AND filename <> 'bookindex.html')
testdb-> ORDER BY
testdb->   4 DESC;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=76.59..76.59 rows=1 width=625) (actual time=53.287..53.321 rows=38 loops=1)
   Sort Key: (pgdoc_score(pgdoc.plain, '{ログ,アーカイブ}'::text[]))
   Sort Method: quicksort  Memory: 30kB
   ->  Nested Loop  (cost=64.28..76.58 rows=1 width=625) (actual time=1.553..53.171 rows=38 loops=1)
         ->  Bitmap Heap Scan on pgdoc pgdoc_1  (cost=64.00..68.02 rows=1 width=4) (actual time=0.427..21.555 rows=38 loops=1)
               Recheck Cond: ((plain ~~ '%ログ%'::text) AND (plain ~~ '%アーカイブ%'::text))
               Rows Removed by Index Recheck: 2
               Filter: ((filename !~~ 'release-%'::text) AND (filename <> 'bookindex.html'::text))
               Rows Removed by Filter: 45
               Heap Blocks: exact=38
               ->  Bitmap Index Scan on pgdoc_plain_idx  (cost=0.00..64.00 rows=1 width=0) (actual time=0.124..0.124 rows=85 loops=1)
                     Index Cond: ((plain ~~ '%ログ%'::text) AND (plain ~~ '%アーカイブ%'::text))
         ->  Index Scan using pgdoc_pkey on pgdoc  (cost=0.28..8.30 rows=1 width=625) (actual time=0.004..0.006 rows=1 loops=38)
               Index Cond: (docid = pgdoc_1.docid)
 Planning time: 1.264 ms
 Execution time: 53.417 ms
(16 rows)

というわけで、検索も高速化（200ms→50ms）したことなので、pg_bigmを使うように修正しました。

■まとめ

今回は、PostgreSQLで構築したPostgreSQLの日本語マニュアルの検索システムをご紹介しました。

PostgreSQLを使いこんでくると、パラメータ名などでマニュアルを隅々まで検索したい、といったニーズが出てくることがあります。

Googleなどの検索エンジンでざっくりと検索するだけでは検索の精度が足りないケースが出てくると思いますので、そのような場合に活用していただければと思います。

では、また。

PostgreSQL Deep Dive

2016年10月1日

PostgreSQL 9.5日本語マニュアルの検索システムをリリースしました（追記あり）

■マニュアル検索システムの機能

■検索の仕組み

■pg_bigmを試してみる

■まとめ

0 件のコメント:

コメントを投稿