Decoupled storage and compute
One of the features that OrioleDB provides is the ability to decouple storage and compute. This is achieved by storing the data in a separate storage layer, such as S3, and running the compute layer on a separate instance. This allows for better scalability and flexibility in managing the data and compute resources.
This feature is currently experimental and should be used with caution.
S3 storage is not required to use OrioleDB. It is an optional feature that can be enabled by setting the appropriate configuration parameters.
S3 database storage
OrioleDB has experimental support for the storage of all tables and materialized views data in the S3 bucket. It is useful for splitting compute and data storage instances, for increasing data safety, and for scaling and changing the architecture of compute instances preserving all data.
Local storage implements caching of the data most often accessed. Also, it ensures that adding and updating data will be done at the speed of writing to local storage, rather than the S3 transfer rate. Data are synced with S3 asynchronously. However, all requirements of data integrity are ensured for all the data on S3 storage as well. So you can re-connect to the S3 bucket by another empty PostgreSQL instance (initialized with the utility described below) with the OrioleDB extension configured to use S3 with this bucket and get back all the data from S3 in the state of the last PostgreSQL checkpoint.
To use S3 functionality, the following parameters should be set before creating orioledb tables and materialized views:
orioledb.s3_mode
-- whether to use S3 mode. It could beon
andoff
. The default isoff
archive_library = 'orioledb'
-- set it to use s3 modearchive_mode = on
-- set it to use S3 modeorioledb.s3_region
-- specify S3 region, where the S3 bucket is created.orioledb.s3_host
-- access endpoint address for S3 bucket (withouthttps://
prefix). E.g. mybucket.s3-accelerate.amazonaws.comorioledb.s3_accesskey
-- specify AWS access key to authenticate the bucket.orioledb.s3_secretkey
-- specify AWS secret key to authenticate the bucket.orioledb.s3_num_workers
-- specify the number of AWS workers syncing data to S3 bucket. More workers could make sync faster. 20 - is a recommended value that is enough in most cases.orioledb.s3_desired_size
-- This parameter defines the total desired size of OrioleDB tables on the local storage. Once this limit is exceeded, OrioleDB's background workers will begin evicting local data to the S3 bucket. This mechanism ensures efficient use of local storage and seamless data transfer to S3. Effective support for this limit requires a filesystem that supports sparse files.max_worker_processes
-- PostgreSQL limit for maximum number of workers. Should be set to accommodate extraorioledb.s3_num_workers
and all other Postgres workers. To start set it toorioledb.s3_num_workers
plus the previousmax_worker_processes
value.
After setting the GUC parameters above restart the postmaster. Then all tables and materialized views created using orioledb
will be synced with the S3 bucket.
CREATE TABLE s3_test
(
id int8 NOT NULL,
value1 float8 NOT NULL,
value2 text NOT NULL,
PRIMARY KEY(id)
) USING orioledb
In S3 mode all the tables and materialized views created with using orioledb
are synchronized with S3 incrementally. That is, only modified blocks are to be put into S3 bucket. The table/materialized view not created with using orioledb
will be saved completely at every checkpoint. So, it's recommended to use S3 mode when you store the majority of your data using the OrioleDB engine.
For best results, it's recommended to turn on Transfer acceleration
in General AWS S3 bucket settings (endpoint address will be given with s3-accelerate.amazonaws.com
suffix) and have the bucket and compute instance within the same AWS region. Even better is to use Directory AWS bucket within the same AWS region and sub-region as the compute instance.
As mentioned above S3 mode is currently experimental. The major limitations of this mode are the following.
- While OrioleDB tables and materialized views are stored incrementally in the S3 bucket, the history is kept forever. There is currently no mechanism to safely remove the old data.
- In the primary/replica setup, each should have a separate S3 bucket.
- The user is responsible for ensuring that only one instance is working with the same S3 bucket. Connecting multiple database instances simultaneously to the same bucket leads to data corruption.
All of the limitations above are temporary and will be removed in further releases.
S3 loader utility
The S3 loader utility allows getting data from the S3 bucket to any local machine into the specified directory.
To use it you need to install boto3
and testgres
into your python:
pip install boto3 testgres
Run the script with the same parameters as from your S3 Postgres cluster config:
AWS_ACCESS_KEY_ID
- same asorioledb.s3_accesskey
AWS_SECRET_ACCESS_KEY
- same asorioledb.s3_secretkey
AWS_DEFAULT_REGION
- same asorioledb.s3_region
--endpoint
- same asorioledb.s3_host
(full URL withhttps://
prefix) E.g--endpoint=https://mybucket.s3-accelerate.amazonaws.com
or--endpoint=https://mybucket.s3.amazonaws.com
--bucket-name
- S3 bucket name fromorioledb.s3_host
E.g--bucket-name=mybucket
--data-dir
- destination directory on the local machine you want to write data to. E.g.--data-dir=mydata/
--verbose
- optionally print extended info.
AWS_ACCESS_KEY_ID=<your access key> AWS_SECRET_ACCESS_KEY='<your secret key>' AWS_DEFAULT_REGION=<your region> python orioledb_s3_loader.py --endpoint=https://<your-bucket-endpoint> --bucket-name=<your-bucket-name> --data-dir='orioledb_data' --verbose