Query.jl と dplyr 様のパイプ処理

DataFrames.jl のドキュメントには、Query.jl の紹介があります。Query.jl は、本来、Language INtegrated Query; LINQ (リンクと読むらしい) (様)の様式で、DataFrame 等の形式のデータの内部情報に問いかけるためのパッケージです。

R の dplyr は、DataTable のデータ操作に特化したパッケージという意味では、ご存知で、使っていらっしゃる方も多いと考えます。

なんと、Query.jl は 2017年9月の 0.7x から、dplyr 様のパイプ処理ができるようになっていたようです。

Query.jl は、2016年ごろ使った感触では、やたらに遅い印象でした。さすがに、dplyr 様のパイプ処理ができるようになったとのことなので、また、試してみたくなりました。特定の条件を満たす行を抜き出す filter (@where) に特化して試してみます。

はじめに、DataFrames.jl のおさらい

まずは、10⁸ × 3 の DataFrame を乱数で作ります。

df = DataFrame(
  col1 = randn(10^8),
  col2 = randn(10^8),
  col3 = randn(10^8)
)

つぎに、ここから、col1 >0 かつ col2>0 となる行を抜き出してみましょう。

df[collect(df[:col1].>0) .& collect(df[:col2].>0) , :]

または

df[.&(collect(df[:col1].>0) , collect(df[:col2].>0)),:]

です。ひと昔前だと、findall() で true の行数を抜き出した方が早かったのですが、一般的には、あまり影響ないようです。ただし、findall() は ture が極端に少いと効くかもしれません。

では、試運転の後、@time で速度を測定してみましょう。for ループで単純に5回測定します。

julia> for i in 1:5
       @time df[collect(df[:col1].>0) .& collect(df[:col2].>0),:];
       end
  2.203278 seconds (13.83 k allocations: 799.406 MiB, 7.78% gc time)
  2.281177 seconds (108 allocations: 798.697 MiB, 13.19% gc time)
  2.179442 seconds (108 allocations: 798.697 MiB, 13.22% gc time)
  2.158846 seconds (108 allocations: 798.697 MiB, 13.55% gc time)
  2.176902 seconds (108 allocations: 798.697 MiB, 13.30% gc time)

さすがに1億行は、骨が折れますね。

ひと昔前だと、Array の方が、早かったので、Array でも試してみましょう。

julia> ar = [randn(10^8) randn(10^8) randn(10^8)];

julia> for i in 1:5
         @time ar[.&( collect(ar[:,1].>0) ,  collect(ar[:,2].>0) ) ,:];
       end
  3.772590 seconds (13.80 k allocations: 2.271 GiB, 10.08% gc time)
  3.240747 seconds (82 allocations: 2.270 GiB, 10.89% gc time)
  3.222589 seconds (82 allocations: 2.270 GiB, 12.49% gc time)
  3.231135 seconds (82 allocations: 2.270 GiB, 12.79% gc time)
  3.182824 seconds (82 allocations: 2.270 GiB, 12.44% gc time)

なんと、DataFrame 形式は、Array 形式に圧勝です。DataFrame 形式は、地道に進歩を重ねているようです。

Query.jl で、filter してみる

では、いよいよ Query.jl を使って速度を調べてみましょう。

データは、先程と同じ、

df = DataFrame(
  col1 = randn(10^8),
  col2 = randn(10^8),
  col3 = randn(10^8)
)

で、生成します。

ここから、やはり同じcol1 >0 かつ col2>0 となる行を抜き出してみましょう。せっかくなので、dplyr 様のパイプ処理をしましょう。

julia> for i in 1:5
       @time df |> @filter(_.col1 .>0 && _.col2 .>0)  |> DataFrame
       end
 11.511096 seconds (100.02 M allocations: 1.518 GiB, 9.33% gc time)
  8.618113 seconds (100.00 M allocations: 1.516 GiB, 8.71% gc time)
  8.800657 seconds (100.00 M allocations: 1.516 GiB, 8.63% gc time)
  8.711341 seconds (100.00 M allocations: 1.516 GiB, 8.85% gc time)
  8.646488 seconds (100.00 M allocations: 1.516 GiB, 8.73% gc time)

なんということでしょう。やはり、とてつもなく遅いです。

では、気をとりなおして、本来の LINQ 様の形式で再挑戦してみます。

julia> @time x = @from i in df begin
           @where i.col1>0 && i.col2>0
           @select i
           @collect DataFrame
         end
  8.839257 seconds (100.02 M allocations: 1.517 GiB, 8.85% gc time)

なんということでしょう。やっぱり遅すぎます。もっと小さな DataFrame なら使えそうですが、大きいと使いにくそうです。

ところが、ちょっとタイプミスで、DataFrame にまとめるのを忘れてみます。

julia> for i in 1:5
       @time df |> @filter(_.col1 .>0 && _.col2 .>0) # |> DataFrame
       end
  0.011512 seconds (2.50 k allocations: 142.140 KiB)
  0.000281 seconds (120 allocations: 6.219 KiB)
  0.000164 seconds (120 allocations: 6.219 KiB)
  0.000162 seconds (120 allocations: 6.219 KiB)
  0.000349 seconds (120 allocations: 6.219 KiB)

異次元の速度が出ました。Query.jl の内部では、ちゃんとデータの filter はかかっています。

julia> df |> @filter(_.col1 .>0 && _.col2 .>0)
?x3 query result
col1      │ col2      │ col3
──────────┼───────────┼──────────
4.28397   │ 0.512331  │ 1.89934
0.225391  │ 0.346354  │ -0.736344
0.243745  │ 0.893401  │ -0.462783
0.288021  │ 1.93481   │ -0.209131
0.844744  │ 0.0102264 │ 0.878457
0.601574  │ 1.02324   │ 1.54026
0.0409227 │ 0.437636  │ 1.24912
1.38892   │ 2.04037   │ -1.8799
1.30475   │ 0.588223  │ 1.19938
0.664451  │ 0.235347  │ -1.42207
... with with more rows

これを見る限りは、Query の内部から、DataFrame にまとめ直すのに、とんでもない時間を要しているのだと考えられます。

こりずに、比較的新しい JuliaDB の IndexedTables でも,速度測定をしてみよう

比較的新しい JuliaDB の IndexedTables でも,速度測定をしてみます。

julia> using JuliaDB

julia> t = table(randn(10^8), randn(10^8), randn(10^8),  names=[:a, :b, :c])
Table with 100000000 rows, 3 columns:
a          b          c
────────────────────────────────
0.724895   0.370638   -0.322566
-0.64493   0.836357   0.0599931
-0.474399  -0.362728  0.247372
-1.61543   0.315327   -2.03863
-0.899386  0.65499    -0.296671
1.41488    -1.5424    -0.41358
0.398743   -1.15674   -0.966664
-1.26013   -1.56635   -1.35638
0.57865    -0.637002  -1.00945
1.2417     -0.929902  -0.514331
⋮
-1.21047   0.995362   -1.84056
0.630685   -0.714683  0.298112
0.915369   0.81218    -0.0479515
0.244752   1.57568    -0.814339
1.70909    -1.16757   -0.0482597
0.600579   -1.2224    -0.960865
-0.232848  1.40478    -0.284237
0.867175   -0.363188  -0.249756

行列を作るだけでも結構時間がかかります。

では、改めて、

julia> for i in 1:5
         @time filter(p -> p.a>0 && p.b>0 , t);
       end

  2.666423 seconds (12.50 k allocations: 858.927 MiB, 7.39% gc time)
  2.660388 seconds (99 allocations: 858.295 MiB, 7.32% gc time)
  2.609718 seconds (99 allocations: 858.295 MiB, 7.37% gc time)
  2.622066 seconds (99 allocations: 858.295 MiB, 7.45% gc time)
  2.934680 seconds (99 allocations: 858.295 MiB, 7.30% gc time)

今のところ、Relational database をめざさないなら、Indexed.Table に手を出す必要はなさそうです。

まとめ

Query 内部の速度は圧巻でした。早く、DataFrame の様式が固まって、組み直しが早くなると良いですね。

密かに dplyr 様のパイプ処理を可能にしていた julia の Query.jl の異常な速度の不思議

Query.jl と dplyr 様のパイプ処理

はじめに、DataFrames.jl のおさらい

Query.jl で、filter してみる

こりずに、比較的新しい JuliaDB の IndexedTables でも,速度測定をしてみよう

まとめ

このブログを検索

自己紹介

タグ

人気の投稿

Image J で特定の色域の面積を測る方法

LaTeX 温度表現

Rで、条件 (時に複数条件) にあうデータを取り出す方法

Image J を使った細胞種類ごとの細胞数の手動カウント

R で累積相対度数分布 (累積分布関数) を描く方法