mirror of
https://github.com/saymrwulf/zipline.git
synced 2026-05-16 21:10:11 +00:00
368 lines
14 KiB
ReStructuredText
368 lines
14 KiB
ReStructuredText
.. _data-bundles:
|
|
|
|
Data Bundles
|
|
------------
|
|
|
|
A data bundle is a collection of pricing data, adjustment data, and an asset
|
|
database. Bundles allow us to preload all of the data we will need to run
|
|
backtests and store the data for future runs.
|
|
|
|
.. _bundles-command:
|
|
|
|
Discovering Available Bundles
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Zipline comes with a few bundles by default as well as the ability to register
|
|
new bundles. To see which bundles we have available, we may run the
|
|
``bundles`` command, for example:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ zipline bundles
|
|
my-custom-bundle 2016-05-05 20:35:19.809398
|
|
my-custom-bundle 2016-05-05 20:34:53.654082
|
|
my-custom-bundle 2016-05-05 20:34:48.401767
|
|
quandl <no ingestions>
|
|
quantopian-quandl 2016-05-05 20:06:40.894956
|
|
|
|
The output here shows that there are 3 bundles available:
|
|
|
|
- ``my-custom-bundle`` (added by the user)
|
|
- ``quandl`` (provided by zipline, though deprecated)
|
|
- ``quantopian-quandl`` (provided by zipline, the default bundle)
|
|
|
|
The dates and times next to the name show the times when the data for this
|
|
bundle was ingested. We have run three different ingestions for
|
|
``my-custom-bundle``. We have never ingested any data for the ``quandl`` bundle
|
|
so it just shows ``<no ingestions>`` instead. Finally, there is only one
|
|
ingestion for ``quantopian-quandl``.
|
|
|
|
.. _ingesting-data:
|
|
|
|
Ingesting Data
|
|
~~~~~~~~~~~~~~
|
|
|
|
The first step to using a data bundle is to ingest the data.
|
|
The ingestion process will invoke some custom bundle command and then write the data to a standard location that zipline can find.
|
|
By default the location where ingested data will be written is ``$ZIPLINE_ROOT/data/<bundle>`` where by default ``ZIPLINE_ROOT=~/.zipline``.
|
|
The ingestion step may take some time as it could involve downloading and processing a lot of data.
|
|
To ingest a bundle, run:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ zipline ingest [-b <bundle>]
|
|
|
|
|
|
where ``<bundle>`` is the name of the bundle to ingest, defaulting to ``quantopian-quandl``.
|
|
|
|
Old Data
|
|
~~~~~~~~
|
|
|
|
When the ``ingest`` command is used it will write the new data to a subdirectory
|
|
of ``$ZIPLINE_ROOT/data/<bundle>`` which is named with the current date. This
|
|
makes it possible to look at older data or even run backtests with the older
|
|
copies. Running a backtest with an old ingestion makes it easier to reproduce
|
|
backtest results later.
|
|
|
|
One drawback of saving all of the data by default is that the data directory
|
|
may grow quite large even if you do not want to use the data. As shown earlier,
|
|
we can list all of the ingestions with the :ref:`bundles command
|
|
<bundles-command>`. To solve the problem of leaking old data there is another
|
|
command: ``clean``, which will clear data bundles based on some time
|
|
constraints.
|
|
|
|
For example:
|
|
|
|
.. code-block:: bash
|
|
|
|
# clean everything older than <date>
|
|
$ zipline clean [-b <bundle>] --before <date>
|
|
|
|
# clean everything newer than <date>
|
|
$ zipline clean [-b <bundle>] --after <date>
|
|
|
|
# keep everything in the range of [before, after] and delete the rest
|
|
$ zipline clean [-b <bundle>] --before <date> --after <after>
|
|
|
|
# clean all but the last <int> runs
|
|
$ zipline clean [-b <bundle>] --keep-last <int>
|
|
|
|
|
|
Running Backtests with Data Bundles
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Now that the data has been ingested we can use it to run backtests with the
|
|
``run`` command. The bundle to use can be specified with the ``--bundle`` option
|
|
like:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ zipline run --bundle <bundle> --algofile algo.py ...
|
|
|
|
|
|
We may also specify the date to use to look up the bundle data with the
|
|
``--bundle-timestamp`` option. Setting the ``--bundle-timestamp`` will cause
|
|
``run`` to use the most recent bundle ingestion that is less than or equal to
|
|
the ``bundle-timestamp``. This is how we can run backtests with older data.
|
|
``bundle-timestamp`` uses a less-than-or-equal-to relationship so that we can
|
|
specify the date that we ran an old backtest and get the same data that would
|
|
have been available to us on that date. The ``bundle-timestamp`` defaults to
|
|
the current day to use the most recent data.
|
|
|
|
Default Data Bundles
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. _quandl-data-bundle:
|
|
|
|
Quandl WIKI Bundle
|
|
``````````````````
|
|
|
|
By default zipline comes with the ``quantopian-quandl`` data bundle which uses quandl's `WIKI dataset <https://www.quandl.com/data/WIKI>`_.
|
|
The quandl data bundle includes daily pricing data, splits, cash dividends, and asset metadata.
|
|
Quantopian has ingested the data from quandl and rebundled it to make ingestion much faster.
|
|
To ingest the ``quantopian-quandl`` data bundle, run either of the following commands:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ zipline ingest -b quantopian-quandl
|
|
$ zipline ingest
|
|
|
|
Either command should only take a few seconds to download the data.
|
|
|
|
.. note::
|
|
|
|
Quandl has discontinued this dataset.
|
|
The dataset is no longer updating, but is reasonable for trying out Zipline without setting up your own dataset.
|
|
|
|
Writing a New Bundle
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Data bundles exist to make it easy to use different data sources with
|
|
zipline. To add a new bundle, one must implement an ``ingest`` function.
|
|
|
|
The ``ingest`` function is responsible for loading the data into memory and
|
|
passing it to a set of writer objects provided by zipline to convert the data to
|
|
zipline's internal format. The ingest function may work by downloading data from
|
|
a remote location like the ``quandl`` bundle or it may just
|
|
load files that are already on the machine. The function is provided with
|
|
writers that will write the data to the correct location transactionally. If an
|
|
ingestion fails part way through the bundle will not be written in an incomplete
|
|
state.
|
|
|
|
The signature of the ingest function should be:
|
|
|
|
.. code-block:: python
|
|
|
|
ingest(environ,
|
|
asset_db_writer,
|
|
minute_bar_writer,
|
|
daily_bar_writer,
|
|
adjustment_writer,
|
|
calendar,
|
|
start_session,
|
|
end_session,
|
|
cache,
|
|
show_progress,
|
|
output_dir)
|
|
|
|
``environ``
|
|
```````````
|
|
|
|
``environ`` is a mapping representing the environment variables to use. This is
|
|
where any custom arguments needed for the ingestion should be passed, for
|
|
example: the ``quandl`` bundle uses the environment to pass the API key and the
|
|
download retry attempt count.
|
|
|
|
``asset_db_writer``
|
|
```````````````````
|
|
|
|
``asset_db_writer`` is an instance of :class:`~zipline.assets.AssetDBWriter`.
|
|
This is the writer for the asset metadata which provides the asset lifetimes and
|
|
the symbol to asset id (sid) mapping. This may also contain the asset name,
|
|
exchange and a few other columns. To write data, invoke
|
|
:meth:`~zipline.assets.AssetDBWriter.write` with dataframes for the various
|
|
pieces of metadata. More information about the format of the data exists in the
|
|
docs for write.
|
|
|
|
``minute_bar_writer``
|
|
`````````````````````
|
|
|
|
``minute_bar_writer`` is an instance of
|
|
:class:`~zipline.data.minute_bars.BcolzMinuteBarWriter`. This writer is used to
|
|
convert data to zipline's internal bcolz format to later be read by a
|
|
:class:`~zipline.data.minute_bars.BcolzMinuteBarReader`. If minute data is
|
|
provided, users should call
|
|
:meth:`~zipline.data.minute_bars.BcolzMinuteBarWriter.write` with an iterable of
|
|
(sid, dataframe) tuples. The ``show_progress`` argument should also be forwarded
|
|
to this method. If the data source does not provide minute level data, then
|
|
there is no need to call the write method. It is also acceptable to pass an
|
|
empty iterator to :meth:`~zipline.data.minute_bars.BcolzMinuteBarWriter.write`
|
|
to signal that there is no minutely data.
|
|
|
|
.. note::
|
|
|
|
The data passed to
|
|
:meth:`~zipline.data.minute_bars.BcolzMinuteBarWriter.write` may be a lazy
|
|
iterator or generator to avoid loading all of the minute data into memory at
|
|
a single time. A given sid may also appear multiple times in the data as long
|
|
as the dates are strictly increasing.
|
|
|
|
``daily_bar_writer``
|
|
````````````````````
|
|
|
|
``daily_bar_writer`` is an instance of
|
|
:class:`~zipline.data.bcolz_daily_bars.BcolzDailyBarWriter`. This writer is
|
|
used to convert data into zipline's internal bcolz format to later be read by a
|
|
:class:`~zipline.data.bcolz_daily_bars.BcolzDailyBarReader`. If daily data is
|
|
provided, users should call
|
|
:meth:`~zipline.data.minute_bars.BcolzDailyBarWriter.write` with an iterable of
|
|
(sid dataframe) tuples. The ``show_progress`` argument should also be forwarded
|
|
to this method. If the data source does not provide daily data, then there is
|
|
no need to call the write method. It is also acceptable to pass an empty
|
|
iterable to :meth:`~zipline.data.minute_bars.BcolzMinuteBarWriter.write` to
|
|
signal that there is no daily data. If no daily data is provided but minute data
|
|
is provided, a daily rollup will happen to service daily history requests.
|
|
|
|
.. note::
|
|
|
|
Like the ``minute_bar_writer``, the data passed to
|
|
:meth:`~zipline.data.minute_bars.BcolzMinuteBarWriter.write` may be a lazy
|
|
iterable or generator to avoid loading all of the data into memory at once.
|
|
Unlike the ``minute_bar_writer``, a sid may only appear once in the data
|
|
iterable.
|
|
|
|
``adjustment_writer``
|
|
`````````````````````
|
|
|
|
``adjustment_writer`` is an instance of
|
|
:class:`~zipline.data.adjustments.SQLiteAdjustmentWriter`. This writer is
|
|
used to store splits, mergers, dividends, and stock dividends. The data should
|
|
be provided as dataframes and passed to
|
|
:meth:`~zipline.data.adjustments.SQLiteAdjustmentWriter.write`. Each of
|
|
these fields are optional, but the writer can accept as much of the data as you
|
|
have.
|
|
|
|
``calendar``
|
|
````````````
|
|
|
|
``calendar`` is an instance of
|
|
:class:`zipline.utils.calendars.TradingCalendar`. The calendar is provided to
|
|
help some bundles generate queries for the days needed.
|
|
|
|
``start_session``
|
|
`````````````````
|
|
|
|
``start_session`` is a :class:`pandas.Timestamp` object indicating the first
|
|
day that the bundle should load data for.
|
|
|
|
``end_session``
|
|
```````````````
|
|
|
|
``end_session`` is a :class:`pandas.Timestamp` object indicating the last day
|
|
that the bundle should load data for.
|
|
|
|
``cache``
|
|
`````````
|
|
|
|
``cache`` is an instance of :class:`~zipline.utils.cache.dataframe_cache`. This
|
|
object is a mapping from strings to dataframes. This object is provided in case
|
|
an ingestion crashes part way through. The idea is that the ingest function
|
|
should check the cache for raw data, if it doesn't exist in the cache, it should
|
|
acquire it and then store it in the cache. Then it can parse and write the
|
|
data. The cache will be cleared only after a successful load, this prevents the
|
|
ingest function from needing to re-download all the data if there is some bug in
|
|
the parsing. If it is very fast to get the data, for example if it is coming
|
|
from another local file, then there is no need to use this cache.
|
|
|
|
``show_progress``
|
|
`````````````````
|
|
|
|
``show_progress`` is a boolean indicating that the user would like to receive
|
|
feedback about the ingest function's progress fetching and writing the
|
|
data. Some examples for where to show how many files you have downloaded out of
|
|
the total needed, or how far into some data conversion the ingest function
|
|
is. One tool that may help with implementing ``show_progress`` for a loop is
|
|
:class:`~zipline.utils.cli.maybe_show_progress`. This argument should always be
|
|
forwarded to ``minute_bar_writer.write`` and ``daily_bar_writer.write``.
|
|
|
|
|
|
``output_dir``
|
|
``````````````
|
|
|
|
``output_dir`` is a string representing the file path where all the data will be
|
|
written. ``output_dir`` will be some subdirectory of ``$ZIPLINE_ROOT`` and will
|
|
contain the time of the start of the current ingestion. This can be used to
|
|
directly move resources here if for some reason your ingest function can produce
|
|
it's own outputs without the writers. For example, the ``quantopian:quandl``
|
|
bundle uses this to directly untar the bundle into the ``output_dir``.
|
|
|
|
Ingesting Data from .csv Files
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Zipline provides a bundle called ``csvdir``, which allows users to ingest data
|
|
from ``.csv`` files. The format of the files should be in OHLCV format, with dates,
|
|
dividends, and splits. A sample is provided below. There are other samples for testing
|
|
purposes in ``zipline/tests/resources/csvdir_samples``.
|
|
|
|
.. code-block:: text
|
|
|
|
date,open,high,low,close,volume,dividend,split
|
|
2012-01-03,58.485714,58.92857,58.42857,58.747143,75555200,0.0,1.0
|
|
2012-01-04,58.57143,59.240002,58.468571,59.062859,65005500,0.0,1.0
|
|
2012-01-05,59.278572,59.792858,58.952858,59.718571,67817400,0.0,1.0
|
|
2012-01-06,59.967144,60.392857,59.888573,60.342857,79573200,0.0,1.0
|
|
2012-01-09,60.785713,61.107143,60.192856,60.247143,98506100,0.0,1.0
|
|
2012-01-10,60.844284,60.857143,60.214287,60.462856,64549100,0.0,1.0
|
|
2012-01-11,60.382858,60.407143,59.901428,60.364285,53771200,0.0,1.0
|
|
|
|
Once you have your data in the correct format, you can edit your ``extension.py`` file in
|
|
``~/.zipline/extension.py`` and import the csvdir bundle, along with ``pandas``.
|
|
|
|
.. code-block:: python
|
|
|
|
import pandas as pd
|
|
|
|
from zipline.data.bundles import register
|
|
from zipline.data.bundles.csvdir import csvdir_equities
|
|
|
|
We'll then want to specify the start and end sessions of our bundle data:
|
|
|
|
.. code-block:: python
|
|
|
|
start_session = pd.Timestamp('2016-1-1', tz='utc')
|
|
end_session = pd.Timestamp('2018-1-1', tz='utc')
|
|
|
|
And then we can ``register()`` our bundle, and pass the location of the directory in which
|
|
our ``.csv`` files exist:
|
|
|
|
.. code-block:: python
|
|
|
|
register(
|
|
'custom-csvdir-bundle',
|
|
csvdir_equities(
|
|
['daily'],
|
|
'/path/to/your/csvs',
|
|
),
|
|
calendar_name='NYSE', # US equities
|
|
start_session=start_session,
|
|
end_session=end_session
|
|
)
|
|
|
|
To finally ingest our data, we can run:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ zipline ingest -b custom-csvdir-bundle
|
|
Loading custom pricing data: [############------------------------] 33% | FAKE: sid 0
|
|
Loading custom pricing data: [########################------------] 66% | FAKE1: sid 1
|
|
Loading custom pricing data: [####################################] 100% | FAKE2: sid 2
|
|
Loading custom pricing data: [####################################] 100%
|
|
Merging daily equity files: [####################################]
|
|
|
|
# optionally, we can pass the location of our csvs via the command line
|
|
$ CSVDIR=/path/to/your/csvs zipline ingest -b custom-csvdir-bundle
|
|
|
|
|
|
If you would like to use equities that are not in the NYSE calendar, or the existing zipline calendars,
|
|
you can look at the ``Trading Calendar Tutorial`` to build a custom trading calendar that you can then pass
|
|
the name of to ``register()``.
|