How do you solve Find Duplicate File in System on LeetCode?

Find Duplicate File in System (LeetCode #609) can be solved using Brute Force or Optimal Solution. The optimal approach is Optimal Solution with O(n) time complexity and O(n) space complexity. Key insight: Using a hash map allows us to efficiently group files by content.

What algorithm or data structure is used to solve Find Duplicate File in System?

Find Duplicate File in System uses the following concepts: Array, Hash Table, String. The recommended patterns to study are: Hash Map, Array.

What are the interview tips for Find Duplicate File in System?

Always clarify the input format and constraints before diving into the solution. Think about how you can reduce the number of comparisons needed. Practice explaining your thought process as you code.

What are common mistakes when solving Find Duplicate File in System?

Not correctly parsing the file name and content from the input string. Failing to check if the content already exists in the hash map.

Find Duplicate File in System — LeetCode #609 (Medium)

Tags: Array, Hash Table, String

Related patterns: Hash Map, Array

Brute Force approach

Time complexity: O(n²). Space complexity: O(1).

The brute-force approach checks every file against every other file to find duplicates. This is straightforward but inefficient, especially with a large number of files.

The time complexity is O(n²) because for each file, we compare it with every other file. The space complexity is O(1) since we are not using any additional data structures for storage.

Step 1: Initialize an empty list to store the results.
Step 2: For each pair of files, compare their contents.
Step 3: If two files have the same content, add their paths to the results list.

1. Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"] 2. Compare '1.txt(abcd)' with '2.txt(efgh)' → No match. 3. Compare '1.txt(abcd)' with '3.txt(abcd)' → Match found! Add ['root/a/1.txt', 'root/c/3.txt'] to duplicates. 4. Compare '2.txt(efgh)' with '4.txt(efgh)' → Match found! Add ['root/a/2.txt', 'root/c/d/4.txt'] to duplicates. 5. Continue until all pairs are checked. 6. Output: [['root/a/2.txt', 'root/c/d/4.txt'], ['root/a/1.txt', 'root/c/3.txt']]

Optimal Solution approach

Time complexity: O(n). Space complexity: O(n).

The optimal approach uses a hash map to group files by their content. This drastically reduces the number of comparisons needed, allowing us to efficiently find duplicates.

The time complexity is O(n) because we process each file exactly once. The space complexity is O(n) due to the storage of file paths in the hash map.

Step 1: Initialize a hash map to store file contents as keys and their paths as values.
Step 2: For each directory, extract the files and their contents.
Step 3: For each file, add its path to the hash map under its content key.
Step 4: Collect all paths from the hash map that have more than one entry (duplicates).

1. Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)"] 2. Initialize contentMap: {}. 3. Process 'root/a 1.txt(abcd) 2.txt(efgh)': contentMap = {'abcd': ['root/a/1.txt'], 'efgh': ['root/a/2.txt']} 4. Process 'root/c 3.txt(abcd)': contentMap = {'abcd': ['root/a/1.txt', 'root/c/3.txt'], 'efgh': ['root/a/2.txt']} 5. Process 'root/c/d 4.txt(efgh)': contentMap = {'abcd': ['root/a/1.txt', 'root/c/3.txt'], 'efgh': ['root/a/2.txt', 'root/c/d/4.txt']} 6. Output: [['root/a/1.txt', 'root/c/3.txt'], ['root/a/2.txt', 'root/c/d/4.txt']]

Key Insights

Using a hash map allows us to efficiently group files by content.
Identifying duplicates requires careful extraction of file names and contents.

Common Mistakes

Not correctly parsing the file name and content from the input string.
Failing to check if the content already exists in the hash map.

Interview Tips

Always clarify the input format and constraints before diving into the solution.
Think about how you can reduce the number of comparisons needed.
Practice explaining your thought process as you code.

#609

Find Duplicate File in System

Medium

Array↗Hash Table↗String↗Hash Map↗Array↗

LeetCode ↗

Approaches

Brute ForceOptimal

Complexity Comparison

	Brute Force	Optimal Solution★
Time	O(n²)	O(n)
Space	O(1)	O(n)

💡

Intuition

Time O(n)Space O(n)

The optimal approach uses a hash map to group files by their content. This drastically reduces the number of comparisons needed, allowing us to efficiently find duplicates.

⚙️

Algorithm

4 steps

1Step 1: Initialize a hash map to store file contents as keys and their paths as values.
2Step 2: For each directory, extract the files and their contents.
3Step 3: For each file, add its path to the hash map under its content key.
4Step 4: Collect all paths from the hash map that have more than one entry (duplicates).

solution.py12 lines

1# Full working Python code
2from collections import defaultdict
3
4def findDuplicate(paths):
5    content_map = defaultdict(list)
6    for path in paths:
7        parts = path.split(' ')
8        dir_path = parts[0]
9        for file in parts[1:]:
10            name, content = file.split('(')
11            content_map[content].append(f'{dir_path}/{name}')
12    return [group for group in content_map.values() if len(group) > 1]

ℹ

Complexity note: The time complexity is O(n) because we process each file exactly once. The space complexity is O(n) due to the storage of file paths in the hash map.

1Using a hash map allows us to efficiently group files by content.
2Identifying duplicates requires careful extraction of file names and contents.

Solutions and explanations are original Tejav content. Problem titles © LeetCode — use the LeetCode button above for the full problem statement.