ClickHouse Retention Analysis Tool One Billion Data Second-level Query Scheme
Principle and Application of Efficient Bitmap Compression
Reservation function (reservation)
Generally speaking, the way to find the retention rate is to find the intersection of users within two days, and the speed of joining will be slower. If each user can be represented as a 32-bit unsigned integer and stored as a bitmap, then the intersection process of S 1 and S2 is a direct bit comparison process, so the speed will be greatly improved. Roaringbitmap compresses data, and its intersection speed is faster than bitmap in most cases, so here we consider using Roaringbitmap to optimize the process of calculating reservation.
Here you can refer to bitmap coding (application of bitmap coding in CDP)
/developer/news /683 175
Detail circle function
(1). Generate user mapping.
The mapping table mem_mapping_tf is established to map all kinds of UIDs into a globally unique 32-bit unsigned integer. There are two problems involved here. One is the problem of idmapping (global data access) to ensure accuracy. Of course, we map other ids, such as device ids. it is quite easy to do so. It is still very difficult to do it well. Idmapping is a project (refer to how Ce Shen's data is done, and it's good to have seen it before). The second question is, how does this mapping table achieve global uniqueness? (construction of id system)
How to establish a global continuous unique digital ID for one billion users?
(2). Data transformation
Map uid in the original behavior data to oneid.
The transformation of this step is completed in spark/hive.
(3) Import ck and compress data.
What pits may there be? But I don't know, I lost my data in clickhouse some time ago ... I checked that it is the primary key, and other problems need to be practiced.
(4) Inquiry
Application of bitmap function in ck
This function takes a set of conditions as parameters, ranging from 1 to 32 parameters of type UInt8, to indicate whether the event meets certain conditions.
An expression that returns the result. The return values include:
1, the condition is met.
0, the condition is not met.
Compared with bitmap function, bitmap function is more convenient. ...
Consider solving the problem of retention analysis from data modeling: zipper table
Step 1: The session table dw.traffic_aggr_session calculates the guid of the user who logged in today.
Step 2: Use today's daily active list to completely join yesterday's active list; Calculation rules:
The rules of first_dt guid range_start are consistent. As long as there is a yesterday, it is yesterday, otherwise it is today (in this case, a new user).
Range_end rule: If you log in yesterday and don't log in today, it is yesterday's date. If you didn't log in yesterday, you are today's (new user). Otherwise, it is yesterday's date (if the user does not log in today, the closing interval remains the same).
Step 3: A case is not fully logged in: the previous user logged in today (max(range_end)! =' nine thousand nine hundred and ninety-nine-12-31'), so union all is needed.
Get the guid of this user from the activity table and the first_dt join with the daily activity table.
Using bitmaps in ClickHouse