SudachiPy の辞書を SudachiDict_full で update と設定

sudachi の辞書の定期 update が行われました。https://twitter.com/sorami/status/1210148792189640704

と、いうことで、SudachiPy の辞書も SudachiDict_full で update します。GitHub の SudachiPy の README.md の Easy Setup のStep 2: Install SudachiDict_core の一部改変で、"core" を "full" に置換します。ターミナルから、

% pip3 install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_full-20191224.tar.gz
% sudachipy link -t full

上記の2行で出来ちゃうハズです。SudachiDict_full の方でも tar.gz で 125 Mb 弱なので、私的には、それほど大きな辞書ではない印象です。

GitHub の SudachiPy の README.md の Install dict packages は、`SudachiDict_full-20190718.tar.gz` と古いのになっているので注意です。`$ pip3 install SudachiDict_full-20191224.tar.gz` で行けるかは試していません。

お試し

% python3
Python 3.7.5 (default, Nov  1 2019, 02:16:32)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sudachipy import tokenizer
>>> from sudachipy import dictionary
>>>
>>>
>>> tokenizer_obj = dictionary.Dictionary().create()
>>> mode = tokenizer.Tokenizer.SplitMode.C
>>> [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
['国家公務員']
>>> [m.surface() for m in tokenizer_obj.tokenize("外国人参政権", mode)]
['外国人参政権']
>>> mode = tokenizer.Tokenizer.SplitMode.B
>>> [m.surface() for m in tokenizer_obj.tokenize("外国人参政権", mode)]
['外国人', '参政権']
>>> mode = tokenizer.Tokenizer.SplitMode.A
>>> [m.surface() for m in tokenizer_obj.tokenize("外国人参政権", mode)]
['外国', '人', '参政', '権']

SudachiPy の辞書を SudachiDict_full で update と設定

このブログを検索

自己紹介

タグ

人気の投稿

Image J で特定の色域の面積を測る方法

LaTeX 温度表現

Rで、条件 (時に複数条件) にあうデータを取り出す方法

Image J を使った細胞種類ごとの細胞数の手動カウント

R で累積相対度数分布 (累積分布関数) を描く方法