NumPy 載入文字檔


之前的資料,都是由程式生成,如果是來自檔案的資料呢?例如,不少資料會存為 CSV,像是你可以至臺灣證券交易所,以 CSV 格式下載發行量加權股價指數歷史資料,檔案是 Big5 編碼…Orz

若是以純文字編輯器開啟,會有以下內容:

"110年01月 發行量加權股價指數歷史資料"
"日期","開盤指數","最高指數","最低指數","收盤指數",
"110/01/04","14,720.25","14,937.13","14,720.25","14,902.03",
"110/01/05","14,913.64","15,000.03","14,861.99","15,000.03",
"110/01/06","15,145.85","15,197.68","14,837.00","14,983.13",
"110/01/07","15,059.52","15,270.40","15,049.86","15,214.00",
"110/01/08","15,365.13","15,463.95","15,275.38","15,463.95",
"110/01/11","15,425.58","15,557.30","15,395.73","15,557.30",
"110/01/12","15,549.53","15,642.03","15,421.23","15,500.70",
"110/01/13","15,550.16","15,778.52","15,550.16","15,769.98",
"110/01/14","15,651.20","15,760.15","15,620.96","15,707.19",
"110/01/15","15,987.16","16,041.59","15,615.11","15,616.39",
"110/01/18","15,676.45","15,676.45","15,320.97","15,612.00",
"110/01/19","15,716.64","15,928.14","15,716.64","15,877.37",
"110/01/20","15,934.85","16,004.32","15,745.48","15,806.18",
"110/01/21","15,775.73","16,238.46","15,775.73","16,153.77",
"110/01/22","15,984.89","16,138.04","15,973.02","16,019.03",
"110/01/25","16,006.21","16,014.96","15,772.63","15,946.54",
"110/01/26","15,955.16","16,023.87","15,589.21","15,658.85",
"110/01/27","15,711.76","15,837.47","15,642.12","15,701.45",
"110/01/28","15,519.85","15,557.02","15,367.45","15,415.88",
"110/01/29","15,544.36","15,603.41","15,138.31","15,138.31",

想讀取 CSV 檔案,可以使用 Python 內建的 csv 模組:

>>> import csv
>>> with open('MI_5MINS_HIST.csv', encoding = 'Big5') as f:
...     print([row for row in csv.reader(f)])
...
[['110年01月 發行量加權股價指數歷史資料'], ['日期', '開盤指數', '最高指數', '最低指數', '收盤指數', ''], ['110/01/04', '14,720.25', '14,937.13', '14,720.25', '14,902.03', ''], ['110/01/05', '14,913.64', '15,000.03', '14,861.99', '15,000.03', ''], ['110/01/06', '15,145.85', '15,197.68', '14,837.00', '14,983.13', ''], ['110/01/07', '15,059.52', '15,270.40', '15,049.86', '15,214.00', ''], ['110/01/08', '15,365.13', '15,463.95', '15,275.38', '15,463.95', ''], ['110/01/11', '15,425.58', '15,557.30', '15,395.73', '15,557.30', ''], ['110/01/12', '15,549.53', '15,642.03', '15,421.23', '15,500.70', ''], ['110/01/13', '15,550.16', '15,778.52', '15,550.16', '15,769.98', ''], ['110/01/14', '15,651.20', '15,760.15', '15,620.96', '15,707.19', ''], ['110/01/15', '15,987.16', '16,041.59', '15,615.11', '15,616.39', ''], ['110/01/18', '15,676.45', '15,676.45', '15,320.97', '15,612.00', ''], ['110/01/19', '15,716.64', '15,928.14', '15,716.64', '15,877.37', ''], ['110/01/20', '15,934.85', '16,004.32', '15,745.48', '15,806.18', ''], ['110/01/21', '15,775.73', '16,238.46', '15,775.73', '16,153.77', ''], ['110/01/22', '15,984.89', '16,138.04', '15,973.02', '16,019.03', ''], ['110/01/25', '16,006.21', '16,014.96', '15,772.63', '15,946.54', ''], ['110/01/26', '15,955.16', '16,023.87', '15,589.21', '15,658.85', ''], ['110/01/27', '15,711.76', '15,837.47', '15,642.12', '15,701.45', ''], ['110/01/28', '15,519.85', '15,557.02', '15,367.45', '15,415.88', ''], ['110/01/29', '15,544.36', '15,603.41', '15,138.31', '15,138.31', '']]

如果只使用 NumPy,做得到嗎?NumPy 有個 loadtxt 函式,如果你的資料使用空白、換行排列儲存在 data.csv,例如:

123 456 789
987 654 321
135 789 321

那麼可以如下讀入:

import numpy as np

data = np.loadtxt('data.csv')
print(data)

loadtxt 預設會以空白、換行來切出列行,並轉換為 np.float64 形態,結果會是:

[[123. 456. 789.]
 [987. 654. 321.]
 [135. 789. 321.]]

如果文字檔中使用其他分隔符號(例如逗號),可以使用 delimiter 來指定,轉換後的型態使用 dtype 指定(可以指定單一形態或多個形態),文字編碼使用 encoding 指定,而像方才的發行量加權股價指數歷史資料,首兩行只是文字描述,讀取後可以跳過,這可以使用 skiprows 指定行數。

現在問題來了,發行量加權股價指數歷史資料中,以逗號區隔,指數資料使用的數字格式中也有逗號作為千分位分隔,該怎麼切資料呢?

方式之一是將每一列當成字串讀入,並結合 csv 模組剖析每一列字串,例如:

import numpy as np
import csv

def parse_csv(s):
    return list(csv.reader([s[:-1]]))    # s[:-1] 是因為不需要最後一個逗號

parse_csv = np.frompyfunc(parse_csv, 1, 1)

hist = np.loadtxt('MI_5MINS_HIST.csv', 
                  encoding = 'Big5', dtype = np.str, skiprows = 2)

# parse_csv(hist) 傳回的陣列中,每個元素是 list
# 使用 np.concatenate 將這些 list 串接為二維陣列
hist = np.concatenate(parse_csv(hist))   

另一個方式是透過 fromregex,它可以指定規則表示式,捕捉想要的資料:

import numpy as np

hist = np.fromregex('MI_5MINS_HIST.csv', 
                    r'"(.+)","([\d,.]+)","(.+)","(.+)","(.+)",', 
                    encoding = 'Big5', dtype = np.str)

因為規則表示式 "(.+)","([\d,.]+)","(.+)","(.+)","(.+)",,其中的 "([\d,.]+)" 只有指數格式才會符合,也就不需要自行略過中文的列數。

假設資料更進一步地轉換整理,方才談到的 loadtxt 可以指定 converter,可指定各行的資料該如何轉換,作為示範,底下故意用較笨拙的方式來讀出歷史資料:

import numpy as np

# 指定行 0 與 4 的轉換器
converters = {
    0: lambda s: s[1:],  # 去頭
    4: lambda s: s[:-2]  # 去尾
}

hist = np.loadtxt('MI_5MINS_HIST.csv', 
                  encoding = 'Big5', dtype = np.str, skiprows = 2, 
                  delimiter = '","',       # 故意指定 "," 作為分隔符
                  converters = converters)

接下來的問題是,如何取得各行?根據〈NumPy 陣列索引〉,可以使用 [:,n] 的方式來取得第 n 行的資料,例如,若結合 fromregex,想取得日期與開盤指數,可以如下:

import numpy as np

hist = np.fromregex('MI_5MINS_HIST.csv', 
                    r'"(.+)","([\d,.]+)","(.+)","(.+)","(.+)",', 
                    encoding = 'Big5', dtype = np.str)

date = hist[:,0]
opening = hist[:,1]

print(date)
print(opening)

這可以顯示以下的結果:

['110/01/04' '110/01/05' '110/01/06' '110/01/07' '110/01/08' '110/01/11'
 '110/01/12' '110/01/13' '110/01/14' '110/01/15' '110/01/18' '110/01/19'
 '110/01/20' '110/01/21' '110/01/22' '110/01/25' '110/01/26' '110/01/27'
 '110/01/28' '110/01/29']
['14,720.25' '14,913.64' '15,145.85' '15,059.52' '15,365.13' '15,425.58'
 '15,549.53' '15,550.16' '15,651.20' '15,987.16' '15,676.45' '15,716.64'
 '15,934.85' '15,775.73' '15,984.89' '16,006.21' '15,955.16' '15,711.76'
 '15,519.85' '15,544.36']