textChunkDS
Syntax
textChunkDS(filename, chunkSize, [delimiter], [schema], [skipRows=0])
Arguments
filename is a string indicating the input text file name with its absolute path. Currently only .csv files are supported.
chunkSize is an integer between 1 and 2048 (MB) indicating the size of a file chunk.
delimiter (optional) is a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (',').
schema (optional) a table. It can have the following columns, among which "name" and "type" columns are required.
column name | data type | meaning |
---|---|---|
name | STRING scalar | column name |
type | STRING scalar | data type |
format | STRING scalar | the format of temporal columns |
col | INT scalar or vector | the columns to be loaded |
skipRows (optional) is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.
Details
To load an extremely large text file into DolphinDB database, we can first use
function textChunkDS
to divide the text file into multiple data
sources with the size of each data source specified by chunkSize, then use
function mr to load data.
When loading data files in DolphinDB, a random sample of the data is analyzed to
determine the data type for each column. However, this sampling method does not
always accurately determine the column types. It is recommend to use the
extractTextSchema
function to check the schema of the input
file before loading the data. You can specify the intended data type for each column
in the "type" field of the schema. For date or time columns particularly, if
DolphinDB does not recognize the correct data types, you need to set the temporal
type in the "type" field, and provide the date/time format string (e.g.
"MM/dd/yyyy") in the "format" field. Refer to Parsing and Format of Temporal Variables for temproal formats in
DolphinDB.
Examples
Use the following script to generate the data file of about 3.2GB:
n=30000000
workDir = "/home/DolphinDB"
if(!exists(workDir)) mkdir(workDir)
trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym, 2000.01.01+rand(365,n) as date, 10.0+rand(2.0,n) as price1, 100.0+rand(20.0,n) as price2, 1000.0+rand(200.0,n) as price3, 10000.0+rand(2000.0,n) as price4, 10000.0+rand(3000.0,n) as price5, 10000.0+rand(4000.0,n) as price6, rand(10,n) as qty1, rand(100,n) as qty2, rand(1000,n) as qty3, rand(10000,n) as qty4, rand(10000,n) as qty5, rand(10000,n) as qty6)
trades.saveText(workDir + "/trades.txt");
Load the file into a DFS database with functions textChunkDS
and
mr
:
db=database("dfs://db1",VALUE, `IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S)
pt=db.createPartitionedTable(trades,`pt,`sym)
ds=textChunkDS(workDir + "/trades.txt",500)
mr(ds,append!{pt},,,false)
mr
to false, otherwise an exception might be
thrown.