【Pandas DataFrame】forループで1行/1列ずつ取得 [iterrows, itertuples, items]

たっきん(Twitter)です！

今回は、DataFrameからforループを使って１行または、１列ずつデータを取り出す方法について説明してきます。

時系列データの順次処理が必要な場合など、forループを使う機会も多いと思いますので、ここでしっかり使い方を覚えていきましょう！

また、Seriesからforループを使ってデータを１ずつ取り出す方法についても説明していきます。

forループで順番にデータを取得するには専用のメソッドが用意されており、次の表の通りになります。

【DataFrame】

メソッド	用途
`DataFrame.iterrows()`	１行ずつ`Series`型で取り出し
`DataFrame.itertuples()`	１行ずつ`tuple`型で取り出し
`DataFrame.items()`	１列ずつ`Series`型で取り出し

【Series】

メソッド	用途
`Series.items()`	１ずつ要素を取り出し

DataFrameから１列ずつ取得するメソッド、Seriesから１要素ずつ取り出すメソッドとして、iteritems()がありましたが、version 1.5.0より非推奨となりました。
将来的には削除されるメソッドになるので、使用しないようにしましょう！

【API Reference】
　・pandas.DataFrame.iteritems
　・pandas.Series.iteritems

これらのメソッド１つ１つについて、サンプルコードを使って説明していきます。

Pandasに関する記事を多数書いてます

Pandas記事全容（目次）になります。
Pandasについて、他に困りごとある場合はこちらで答えが見つかるかもしれませんので、よかったら参考にしてみてください。

pandasデータ処理ドリル Pythonによるデータサイエンスの腕試し

created by Rinker

「DataFrame」のforループ処理
1. １行ずつ取り出し
  1. iterrows()
  2. itertuples()
2. １列ずつ取り出し
  1. items()
「Series」のforループ処理
処理速度の比較
参考

「DataFrame」のforループ処理

DataFrameには１行ずつデータを取り出すメソッドと１列ずつデータを取り出すメソッドが用意されています。

それぞれサンプルコードを使って動作を確認していきましょう！

動作を確認するにあたり、次のDataFrameから行・列をforループを使って１つずつ取り出していく流れで説明していきます。

import pandas as pd

ohlc_data = [
    ["2020/4/6", 108.432, 119.393, 98.403, 109.252],
    ["2020/4/7", 109.252, 119.294, 98.680, 108.799],
    ["2020/4/8", 108.799, 119.109, 98.514, 108.854],
]

df = pd.DataFrame(ohlc_data, columns=["datetime", "open", "high", "low", "close"])
df.set_index("datetime", inplace=True)
print(df)
"""
             open     high     low    close
datetime                                   
2020/4/6  108.432  119.393  98.403  109.252
2020/4/7  109.252  119.294  98.680  108.799
2020/4/8  108.799  119.109  98.514  108.854
"""

１行ずつ取り出し

１行ずつ取り出してforループ処理するにはiterrows()とitertuples()があります。

違いを次の表にまとめました。

メソッド	取得データ型	処理速度
`iterrows()`	`Series`	遅い
`itertuples()`	タプル	速い

処理速度についてですが、行数が膨大になってくると２つのメソッド間で顕著にその差が表れてきます。
どのくらい差があるかは本記事で後述しているのでそちらを参照してほしいのですが、膨大なデータを扱う場合、特別な理由がない限りは処理速度が速いitertuples()を使うようにしたほうが良いです。

iterrows()

iterrows()を使うと各行のインデックス（行ラベル）とSeries型の行データのペア(index, Series)が取得できます。

【API reference】pandas.DataFrame.iterrows

Series型で取得できるため、取り出した行データにSeriesのメソッドをすぐに適用できるメリットがありますが、その代償としてitertuples()よりも取り出し速度が遅いです。

# １行ずつ取り出して処理する [iterrows()]
counter = 0
for index, sr_row in df.iterrows():
    print("\n*** Loop:[{}] ********************".format(counter))
    print("--- index -----")
    print(index)
    print("--- content -----")
    print(sr_row)
    print("--- 型を確認 -----")
    print(type(sr_row))
    counter += 1
"""
*** Loop:[0] ********************
--- index -----
2020/4/6
--- content -----
open     108.432
high     119.393
low       98.403
close    109.252
Name: 2020/4/6, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>

*** Loop:[1] ********************
--- index -----
2020/4/7
--- content -----
open     109.252
high     119.294
low       98.680
close    108.799
Name: 2020/4/7, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>

*** Loop:[2] ********************
--- index -----
2020/4/8
--- content -----
open     108.799
high     119.109
low       98.514
close    108.854
Name: 2020/4/8, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>
"""

itertuples()

itertuples()を使うと各行のデータを名前付きタプルで取得できます。

【API reference】pandas.DataFrame.itertuples

iterrows()よりも取り出し速度が速いため、取り出した行データを参照するだけの処理であれば、こちらを使用した方が良いです。

名前付きタプルのため、インデックスではなく、名前で参照できる点も良いですね！

# １行ずつ取り出して処理する [itertuples()]
counter = 0
for n_tuple in df.itertuples():
    print("\n*** Loop:[{}] ********************".format(counter))
    print("--- 型を確認 -----")
    print(type(n_tuple))
    print("--- 名前で取得 -----")
    print("index: {}".format(n_tuple.Index))
    print("open : {}".format(n_tuple.open))
    print("high : {}".format(n_tuple.high))
    print("low  : {}".format(n_tuple.low))
    print("close: {}".format(n_tuple.close))
    print("--- インデックスで取得 -----")
    print("index: {}".format(n_tuple[0]))
    print("open : {}".format(n_tuple[1]))
    print("high : {}".format(n_tuple[2]))
    print("low  : {}".format(n_tuple[3]))
    print("close: {}".format(n_tuple[4]))
    counter += 1
"""
*** Loop:[0] ********************
--- 型を確認 -----
<class 'pandas.core.frame.Pandas'>
--- 名前で取得 -----
index: 2020/4/6
open : 108.432
high : 119.393
low  : 98.403
close: 109.252
--- インデックスで取得 -----
index: 2020/4/6
open : 108.432
high : 119.393
low  : 98.403
close: 109.252

*** Loop:[1] ********************
--- 型を確認 -----
<class 'pandas.core.frame.Pandas'>
--- 名前で取得 -----
index: 2020/4/7
open : 109.252
high : 119.294
low  : 98.68
close: 108.799
--- インデックスで取得 -----
index: 2020/4/7
open : 109.252
high : 119.294
low  : 98.68
close: 108.799

*** Loop:[2] ********************
--- 型を確認 -----
<class 'pandas.core.frame.Pandas'>
--- 名前で取得 -----
index: 2020/4/8
open : 108.799
high : 119.109
low  : 98.514
close: 108.854
--- インデックスで取得 -----
index: 2020/4/8
open : 108.799
high : 119.109
low  : 98.514
close: 108.854
"""

ちなみに、引数にname=Noneを指定すると通常のタプルとして取得できます。

取り出し速度も名前付きタプルと比較して若干早くなりますが、インデックスでしかデータ参照ができなくなるため、あまりお勧めはしません。

# １行ずつ取り出して処理する [itertuples(name=None)]
counter = 0
for r_tuple in df.itertuples(name=None):
    print("\n*** Loop:[{}] ********************".format(counter))
    print("--- 型を確認 -----")
    print(type(r_tuple))
    print("--- インデックスで取得 -----")
    print("index: {}".format(r_tuple[0]))
    print("open : {}".format(r_tuple[1]))
    print("high : {}".format(r_tuple[2]))
    print("low  : {}".format(r_tuple[3]))
    print("close: {}".format(r_tuple[4]))
    counter += 1
"""
*** Loop:[0] ********************
--- 型を確認 -----
<class 'tuple'>
--- インデックスで取得 -----
index: 2020/4/6
open : 108.432
high : 119.393
low  : 98.403
close: 109.252

*** Loop:[1] ********************
--- 型を確認 -----
<class 'tuple'>
--- インデックスで取得 -----
index: 2020/4/7
open : 109.252
high : 119.294
low  : 98.68
close: 108.799

*** Loop:[2] ********************
--- 型を確認 -----
<class 'tuple'>
--- インデックスで取得 -----
index: 2020/4/8
open : 108.799
high : 119.109
low  : 98.514
close: 108.854
"""

名前付きタプルの“名前で参照”するメリットはデータの格納位置を把握する必要がないため、変更容易性が高くなることです。
インデックスで参照していると、仕様変更でデータの配置が変わった場合、インデックス参照している個所を全て見直すハメになり、バグを生みやすくなります。

１列ずつ取り出し

１行ずつ取り出してforループ処理するにはitems()メソッドを使います。

取得データ型はSeriesになるため、iterrows()の列取得版メソッドになります。

しかし、現時点でタプル型で取得するitertuples()の列取得版メソッドは存在しません。

items()

items()を使うと各列のカラム（列ラベル）とSeries型の列データのペア(label, Series)が取得できます。

【API reference】pandas.DataFrame.items

counter = 0
for label, sr_col in df.items():
    print("\n*** Loop:[{}] ********************".format(counter))
    print("--- label -----")
    print(label)
    print("--- content -----")
    print(sr_col)
    print("--- 型を確認 -----")
    print(type(sr_col))
    counter += 1
"""
*** Loop:[0] ********************
--- label -----
open
--- content -----
datetime
2020/4/6    108.432
2020/4/7    109.252
2020/4/8    108.799
Name: open, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>

*** Loop:[1] ********************
--- label -----
high
--- content -----
datetime
2020/4/6    119.393
2020/4/7    119.294
2020/4/8    119.109
Name: high, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>

*** Loop:[2] ********************
--- label -----
low
--- content -----
datetime
2020/4/6    98.403
2020/4/7    98.680
2020/4/8    98.514
Name: low, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>

*** Loop:[3] ********************
--- label -----
close
--- content -----
datetime
2020/4/6    109.252
2020/4/7    108.799
2020/4/8    108.854
Name: close, dtype: float64
--- 型を確認 -----
<class 'pandas.core.series.Series'>
"""

pandasデータ処理ドリル Pythonによるデータサイエンスの腕試し

created by Rinker

「Series」のforループ処理

Seriesにも１つずつ要素データを取り出すメソッドitems()が用意されてます。

items()を使うと各行のインデックス（ラベル）と要素データの値のペア(index, value)が取得できます。

【API reference】pandas.Series.items

こちらも、次のSeriesから順に各要素を取り出す例をサンプルコードを使って動作を確認していきましょう！

import pandas as pd

# Seriesの生成
o_data = [108.432, 109.252, 108.799]
o_index = ["2020/4/6", "2020/4/7", "2020/4/8"]
sr = pd.Series(data=o_data, index=o_index, name="open")
print(sr)
"""
2020/4/6    108.432
2020/4/7    109.252
2020/4/8    108.799
Name: open, dtype: float64
"""

# 要素を１つずつ取り出し
for index, value in sr.items():
    print("index: {}, value: {}".format(index, value))
"""
index: 2020/4/6, value: 108.432
index: 2020/4/7, value: 109.252
index: 2020/4/8, value: 108.799
"""

処理速度の比較

ここまで、DataFrame.iterrows()は処理速度が遅いとか、DataFrame.itertuples()は処理速度が速いとか説明してきましたが、実際のどの程度速度差が出るのか確認してみましょう！

今回、比較用として次のような膨大なOHLC(open, high, low, close)データを用意しました！

import pandas as pd
import datetime as dt
import numpy as np
import time

indexes = pd.date_range(
    start=dt.datetime(2000, 1, 1, 0, 0),
    end=dt.datetime(2023, 12, 31, 23, 0),
    freq="1H"
)
mat = np.arange(indexes.size * 4).reshape((indexes.size, 4)) / 100
df = pd.DataFrame(
    mat,
    columns=["open", "high", "low", "close"],
    index=indexes
)
print(df)
""""
                        open     high      low    close
2000-01-01 00:00:00     0.00     0.01     0.02     0.03
2000-01-01 01:00:00     0.04     0.05     0.06     0.07
2000-01-01 02:00:00     0.08     0.09     0.10     0.11
2000-01-01 03:00:00     0.12     0.13     0.14     0.15
2000-01-01 04:00:00     0.16     0.17     0.18     0.19
...                      ...      ...      ...      ...
2023-12-31 19:00:00  8415.16  8415.17  8415.18  8415.19
2023-12-31 20:00:00  8415.20  8415.21  8415.22  8415.23
2023-12-31 21:00:00  8415.24  8415.25  8415.26  8415.27
2023-12-31 22:00:00  8415.28  8415.29  8415.30  8415.31
2023-12-31 23:00:00  8415.32  8415.33  8415.34  8415.35

[210384 rows x 4 columns]
""""

行数は210384行です。

取り出した各行データからopenとcloseの値を参照する処理で比較してみます。

次のコードで実測してみました。

# forループで１行ずつ処理(iterrows)
start = time.perf_counter()
for index, series in df.iterrows():
    last_op = series.iat[0]
    last_cl = series.iat[3]
duration = time.perf_counter() - start
print("処理時間:{:.3f} last_op:{} last_cl:{}".format(duration, last_op, last_cl))
"""
処理時間:8.356 last_op:8415.32 last_cl:8415.35
"""

# forループで１行ずつ処理(itertuples) ※名前付きタプル
start = time.perf_counter()
for n_tuple in df.itertuples():
    last_op = n_tuple[1]
    last_cl = n_tuple[4]
duration = time.perf_counter() - start
print("処理時間:{:.3f} last_op:{} last_cl:{}".format(duration, last_op, last_cl))
"""
処理時間:0.386 last_op:8415.32 last_cl:8415.35
"""

# forループで１行ずつ処理(itertuples) ※通常タプル
start = time.perf_counter()
for r_tuple in df.itertuples(name=None):
    last_op = r_tuple[1]
    last_cl = r_tuple[4]
duration = time.perf_counter() - start
print("処理時間:{:.3f} last_op:{} last_cl:{}".format(duration, last_op, last_cl))
"""
処理時間:0.290 last_op:8415.32 last_cl:8415.35
"""

結果です。

【処理時間】

iterrows()：8.356[s]
itertuples()：0.386[s]
itertuples(name=None)：0.290[s]

この結果を見てもわかる通り、iterrows()の処理時間はitertuples()よりも20倍以上かかることがわかりました。

また、itertuples()では、名前付きタプルよりも、通常タプル(name=None)のほうが若干高速であったことも確認できました。

さいごに、おまけ程度になりますがさらに高速化する方法もあります。

参照したい列を事前にリストで取り出しておき、zip()やインデックスで参照する方法です。

コードと実測結果は次のようになります。

（この方法は実質pandasを使用していないことになりますが・・・）

# forループで１行ずつ処理 <さらに高速化１>
open_list = df["open"].to_list()
close_list = df["close"].to_list()
start = time.perf_counter()
for op, cl in zip(open_list, close_list):
    last_op = op
    last_cl = cl
duration = time.perf_counter() - start
print("処理時間:{:.3f} last_op:{} last_cl:{}".format(duration, last_op, last_cl))
"""
処理時間:0.039 last_op:8415.32 last_cl:8415.35
"""

# forループで１行ずつ処理 <さらに高速化２>
open_list = df["open"].to_list()
close_list = df["close"].to_list()
start = time.perf_counter()
for index in range(df.shape[0]):
    last_op = open_list[index]
    last_cl = close_list[index]
duration = time.perf_counter() - start
print("処理時間:{:.3f} last_op:{} last_cl:{}".format(duration, last_op, last_cl))
"""
処理時間:0.037 last_op:8415.32 last_cl:8415.35
"""

各処理時間のまとめになりますので参考に。