#393

UTF-8 Validation

Medium
ArrayBit ManipulationBit ManipulationArray
LeetCode ↗

Approaches

Brute ForceOptimal
Complexity Comparison
Brute ForceOptimal Solution
Time
O(n²)
O(n)
Space
O(1)
O(1)
💡

Intuition

Time O(n)Space O(1)

The optimal solution processes each byte in a single pass, checking the leading bits to determine the number of bytes in the current character and validating the continuation bytes accordingly. This reduces the time complexity significantly.

⚙️

Algorithm

5 steps
  1. 1Step 1: Initialize an index to track the current byte being processed.
  2. 2Step 2: Loop through each byte in the data array.
  3. 3Step 3: Determine the number of bytes for the current character based on the leading bits.
  4. 4Step 4: Validate the required continuation bytes based on the number of bytes determined.
  5. 5Step 5: If any validation fails, return false. If all bytes are processed successfully, return true.
solution.py27 lines
1# Full working Python code
2
3def validUtf8(data):
4    i = 0
5    while i < len(data):
6        byte = data[i]
7        if (byte & 0x80) == 0:
8            # 1-byte character
9            i += 1
10        elif (byte & 0xE0) == 0xC0:
11            # 2-byte character
12            if i + 1 >= len(data) or (data[i + 1] & 0xC0) != 0x80:
13                return False
14            i += 2
15        elif (byte & 0xF0) == 0xE0:
16            # 3-byte character
17            if i + 2 >= len(data) or (data[i + 1] & 0xC0) != 0x80 or (data[i + 2] & 0xC0) != 0x80:
18                return False
19            i += 3
20        elif (byte & 0xF8) == 0xF0:
21            # 4-byte character
22            if i + 3 >= len(data) or (data[i + 1] & 0xC0) != 0x80 or (data[i + 2] & 0xC0) != 0x80 or (data[i + 3] & 0xC0) != 0x80:
23                return False
24            i += 4
25        else:
26            return False
27    return True

Complexity note: The complexity is O(n) because we only traverse the array once, checking each byte and its continuation bytes in a single pass.

  • 1Understanding the bit manipulation is crucial for identifying the number of bytes in UTF-8 encoding.
  • 2The leading bits of each byte dictate how many continuation bytes are expected.

Solutions and explanations are original Tejav content. Problem titles © LeetCode — use the LeetCode button above for the full problem statement.