Amazon Redshift SQL translation guide
Data types
Users moving data between ClickHouse and Redshift will immediately notice that ClickHouse offers a more extensive range of types, which are also less restrictive. While Redshift requires users to specify possible string lengths, even if variable, ClickHouse removes this restriction and burden from the user by storing strings without encoding as bytes. The ClickHouse String type thus has no limits or length specification requirements.
Furthermore, users can exploit Arrays, Tuples, and Enums - absent from
Redshift as first-class citizens (although Arrays/Structs can be imitated
with SUPER
) and a common frustration of users. ClickHouse additionally
allows the persistence, either at query time or even in a table, of
aggregation states. This will enable data to be pre-aggregated, typically
using a materialized view, and can dramatically improve query performance
for common queries.
Below we map the equivalent ClickHouse type for each Redshift type:
Redshift | ClickHouse |
---|---|
SMALLINT | Int8 * |
INTEGER | Int32 * |
BIGINT | Int64 * |
DECIMAL | UInt128 , UInt256 , Int128 , Int256 , Decimal(P, S) , Decimal32(S) , Decimal64(S) , Decimal128(S) , Decimal256(S) - (high precision and ranges possible) |
REAL | Float32 |
DOUBLE PRECISION | Float64 |
BOOLEAN | Bool |
CHAR | String , FixedString |
VARCHAR ** | String |
DATE | Date32 |
TIMESTAMP | DateTime , DateTime64 |
TIMESTAMPTZ | DateTime , DateTime64 |
GEOMETRY | Geo Data Types |
GEOGRAPHY | Geo Data Types (less developed e.g. no coordinate systems - can be emulated with functions) |
HLLSKETCH | AggregateFunction(uniqHLL12, X) |
SUPER | Tuple , Nested , Array , JSON , Map |
TIME | DateTime , DateTime64 |
TIMETZ | DateTime , DateTime64 |
VARBYTE ** | String combined with Bit and Encoding functions |
UInt8
, UInt32
, UInt32
and UInt64
.
**ClickHouse’s String type is unlimited by default but can be constrained to specific lengths using Constraints.
DDL syntax
Sorting keys
Both ClickHouse and Redshift have the concept of a “sorting key”, which define
how data is sorted when being stored. Redshift defines the sorting key using the
SORTKEY
clause:
Comparatively, ClickHouse uses an ORDER BY
clause to specify the sort order:
In most cases, you can use the same sorting key columns and order in ClickHouse
as Redshift, assuming you are using the default COMPOUND
type. When data is
added to Redshift, you should run the VACUUM
and ANALYZE
commands to re-sort
newly added data and update the statistics for the query planner - otherwise, the
unsorted space grows. No such process is required for ClickHouse.
Redshift supports a couple of convenience features for sorting keys. The first is
automatic sorting keys (using SORTKEY AUTO
). While this may be appropriate for
getting started, explicit sorting keys ensure the best performance and storage
efficiency when the sorting key is optimal. The second is the INTERLEAVED
sort key,
which gives equal weight to a subset of columns in the sort key to improve
performance when a query uses one or more secondary sort columns. ClickHouse
supports explicit projections, which achieve the
same end-result with a slightly different setup.
Users should be aware that the “primary key” concept represents different things in ClickHouse and Redshift. In Redshift, the primary key resembles the traditional RDMS concept intended to enforce constraints. However, they are not strictly enforced in Redshift and instead act as hints for the query planner and data distribution among nodes. In ClickHouse, the primary key denotes columns used to construct the sparse primary index, used to ensure the data is ordered on disk, maximizing compression while avoiding pollution of the primary index and wasting memory.