What is Skew?
Data skew occurs when the data is unevenly divided across the partitions
The equation used to calculate the skew of a data or flow partition -
The skew of a data or flow partition is the amount by which its size deviates from the average partition size, expressed as a percentage of the largest partition:
skew of a data or flow partition = ((partition size – average partition size) / size of largest partition) X 100
The skew of a data partition is negative if the size of the partition is smaller than the average partition size, and is positive if the size of the partition is larger. The sum of all skews for a dataset or flow is 0%.
For example, suppose a flow carrying 200 MB of data has four flow partitions of the following sizes: 30 MB, 40 MB, 50 MB, and 80 MB. This gives the flow an average partition size of 50 MB, with the largest partition carrying 80 MB. The first flow partition, partition 0, carries 30 MB of data. The following formula calculates the skew of partition 0:
skew of partition = (30 MB - 50 MB) / 80 MB = -20 MB / 80 MB = -0.25 X 100 = -25%
How to handle the skew issue?
Change the partitioning key to reduce the data skew on the flow between the components.
No comments:
Post a Comment