mecabを使ってテキストから名詞の出現頻度を求めるcgiスクリプト

タイトルのまんまのプログラム。形態素解析ソフトウェアのmecabがデフォルトの文字コードがutf8にてインストールされている必要があります。あとは、cgi-bin直下に置かれている必要があります。

mecabおもしろい。どういう仕組みで動いてるんだろ。

とりあえず、試しにどっかからひろってきた創世記の日本語テキストを入力してみた。

#! /usr/bin/python
# coding: utf-8
# split_and_count.py

"""mecabを使って形態素解析を行い、名詞の出現頻度を表示する
"""
import popen2
import cgi
import cgitb; cgitb.enable()
form = cgi.FieldStorage()
text = form.getvalue("text", "")

CMD = "mecab | grep -v EOS"
TMPL = """content-type: text/html; charset=utf8

<html>
  <head>
    <title>split and count</title>
  </head>
  <body>
    <h1>split and count</h1>
    <form action="/cgi-bin/split_and_count.py" method="post">
      <textarea name="text"></textarea>
      <input type="submit" value="split"/>
    </form>
    <table>
    <tr><th>word</th><th>count</th></tr>
    %(words)s
    </table>
  </body>
</html>
"""

def wordcount(lst):
    dic = {}
    for n in lst:
        if n in dic:
            dic[n] += 1
        else:
            dic[n] = 1
    return dic

import re
def helper(sout):
    lines = sout.read().splitlines()
    lst = []
    for line in lines:
        if re.match(".*?名詞", line):
            lst.append( line.split("\t")[0] )
    return lst

def main():
    sout,sin = popen2.popen2(CMD)
    sin.write(text)
    sin.close()
    
    c = helper(sout)
    dic = wordcount(c)
    
    lst = ""
    for n in sorted(dic, lambda x,y: cmp(dic[y], dic[x])):
        lst += "<tr><td>" + n + "</td><td>" + str(dic[n]) + "</td></tr>"
    print TMPL % {"words":lst} 
       

main()