7587번 Anagrams

난이도: 실버 4
알고리즘 분류: 문자열, 해시맵
언어: C++
주차: 7주차
풀이 날짜: 2023년 1월 31일

문제

시간 제한	메모리 제한	제출	정답	맞힌 사람	정답 비율
1 초	128 MB	155	97	60	65.217%

문제

Two words are anagrams if they contain the same letters but in a different order. For example ant and tan are anagrams, but ant and ton are not.

In this problem you will be supplied with a list of words. All you have to do is to pick out the word with the most anagrams and report how many anagrams it has.

입력

Input consists of a number of lists. Each list begins with a number, n, which is the number of words in the list (0 < n <= 1000). The last line of input contains the number 0 – this marks the end of input and should not be processed. Each list contains n words, each on a single line. All words consist of lower case letters only. No word will contain more than 8 letters. Every list will contain at least one word that has an anagram in the list.

출력

Output consists of one word from each list, the word with the most anagrams within the list, followed by a space, followed by the number of anagrams. The word displayed will be the first occurrence in the list of the anagram. If more than one word has the same highest number of anagrams, display only the one that comes first in the list of words.

예제 입력 1


6
not
ant
tan
ton
cat
nat
4
time
timer
emit
mote
0

예제 출력 1


ant 2
time 1

풀이

전체 코드


#include <iostream>
#include <unordered_map>
#include <map>
#include <string>
#include <vector>

int main(void)
{
    std::cin.tie(nullptr);
    std::iostream::sync_with_stdio(false);

    int input = 0;
    /// 이 while문은 이벤트 루프임
    while(true) {
        std::cin >> input;
        if (input == 0) break;
        // string -> vector<string> mapping
        // key is anagram hash, and vector is a list of words that fit with anagram hash
        // for ex. key:"a1b2c3", then can be value = ["abbccc", "cabbcc", ...]
        std::unordered_map<std::string, std::vector<std::string> > anagrams;
        anagrams.reserve(input);

        int max_list_size = 0;
        std::string max_list_string = "";

        for (std::size_t i = 0; i < input; i++) {
            std::string text = "";
            std::cin >> text;

            std::map<char, int> alphabet_table;
            for (const auto c : text) {
                alphabet_table[c]++;
            }
            std::string hash_result = "";
            for (const auto& alphabet : alphabet_table) {
                hash_result += alphabet.first + std::to_string(alphabet.second);
            }

            anagrams[hash_result].emplace_back(text);
            if (anagrams[hash_result].size() > max_list_size) {
                max_list_size = anagrams[hash_result].size();
                max_list_string = hash_result;
            }
        }

        std::cout << anagrams[max_list_string].front() <<
            " " << anagrams[max_list_string].size() - 1 << std::endl;
    }

    return 0;
}

아나그램은, ANT와 NAT처럼 순서는 다르지만 같은 알파벳들로 구성되어 있는 단어입니다.

아나그램을 푸는 방법에는 크게 두 가지가 있습니다.

하나는 정렬을 이용하는 방법이고 하나는 해시를 이용하는 방법입니다.

정렬 방법은 직관적이고 쉽지만, m 길이의 n개 단어들에 대해 $O(nm\log m)$의 시간이 소요됩니다.

반면 해시는 구현이 조금 더 복잡하나 $O(nm)$의 시간을 보장합니다.

두 방법 모두 공간 복잡도는 $O(nm)$입니다.

두 방법 모두 해시맵을 이용하는 건 동일합니다.

방법 1. 정렬

[”abc”, “cde”, “bca”, “adc”, “dec”, “ba”] 리스트가 있다고 합시다.

이때 아나그램을 구하기 위해서 해시맵을 사용합니다. C++에는 std::unordered_map이 있고, 파이썬에는 dictionary가 있습니다. 또는 성능은 조금 떨어지나 std::map을 사용할 수도 있겠죠.

hashmap<std::string ⇒ std::string[] >

맨 처음 “abc”가 들어올 때, 이 단어를 정렬합니다. 그러면 똑같이 “abc”가 되겠죠? 이것을 key로 저장합니다. 그리고 원본인 “abc”를 value로 저장합니다.

hashmap[”abc”] = [”abc”]

두 번 째 “cde”가 들어옵니다. 이를 정렬하면 여전히 “cde”입니다. 이를 key로, 그리고 원본을 value로 저장합니다.
hashmap[”abc”] = [”abc”]
hashmap[”cde”] = [”cde”]
3. 세 번 째 “bca”가 들어옵니다. 이를 정렬하면 “abc”가 됩니다. 이건 1번의 키와 똑같아졌죠? “abc”를 key로, 원본을 value로 저장합니다.
hashmap[”abc”] = [”abc”, “bca”]
hashmap[”cde”] = [”cde”]
네 번 째 “adc”가 들어옵니다. 이를 정렬하면 “acd”가 됩니다.
hashmap[”abc”] = [”abc”, “bca”]
hashmap[”cde”] = [”cde”]
hashmap[”acd”] = [”adc”]
다섯 번 째 “dec”가 들어옵니다. 이를 정렬하면 “cde”가 됩니다. 두 번 째와 key가 같습니다.
hashmap[”abc”] = [”abc”, “bca”]
hashmap[”cde”] = [”cde”, “dec”]
hashmap[”acd”] = [”adc”]
여섯 번 째 “ba”가 들어옵니다. 이를 정렬하면 “ab”가 됩니다.
hashmap[”abc”] = [”abc”, “bca”]
hashmap[”cde”] = [”cde”, “dec”]
hashmap[”acd”] = [”adc”]
hashmap[”ab”] = [”ba”]

이렇게 하면 아나그램을 모두 구할 수 있습니다. “abc”의 아나그램은 “bca”이고, “cde”의 아나그램은 “dec”입니다.

그러나 이 방법은 최대의 단점이 있는데, 바로 정렬에 소요되는 시간입니다. 매 번 정렬 시마다 단어의 길이 m에 대해 ${m\log m}$의 시간이 소요됩니다.

이 방법보다 더 좋은 방법이 있습니다.

방법 2. 해시

[”abc”, “cde”, “bca”, “adc”, “dec”, “ba”] 리스트에 대해서, 각 텍스트를 구성하는 알파벳의 갯수를 세서 이를 key로 만드는 것입니다.

예를 들어 “abc”는 a가 하나, b가 하나, c가 하나 있으므로 “a1b1c1” 이런 식으로 만드는 겁니다.

“abc” → “a1b1c1”이므로, 이를 key로, 원본인 “abc”를 value로 저장하면
hashmap[”a1b1c1”] = [”abc”]
“cde” → “c1d1e1”이므로, 이를 key로, 원본인 “cde”를 value로 저장하면
hashmap[”a1b1c1”] = [”abc”]hashmap[”c1d1e1”] = [”cde”]
“bca” → “a1b1c1”
hashmap[”a1b1c1”] = [ “abc”, “bca” ]
hashmap[”c1d1e1”] = [”cde”]
…

이런 식으로 저장할 수 있습니다.

그러면 만약 “aab”, “aba”는 모두 “a2b1”이 key이므로 이 둘은 아나그램입니다.

반면 “aab”, “abc”는 key값이 각각 “a2b1”, “a1b1c1”이므로 이 둘은 아나그램이 아닙니다.

이렇게 되면 정렬할 필요 없이 $O(m)$의 시간만 필요합니다.

코드 분석

코드를 보면 hashmap을 다음과 같이 선언했습니다.

    
// string -> vector<string> mapping
// key is anagram hash, and vector is a list of words that fit with anagram hash
// for ex. key:"a1b2c3", then can be value = ["abbccc", "cabbcc", ...]
std::unordered_map<std::string, std::vector<std::string> > anagrams;
anagrams.reserve(input);

std::unordered_map을 이용하였고, 이는 key : string ⇒ value : string[] (스트링에서 스트링 배열로)의 해시맵입니다.

이제 텍스트를 입력받고, 이에 대한 해시를 추출할 차례입니다. 이때 주의할 점은 unordered_map이 아닌 일반 map을 사용해야 한다는 점입니다.

    
std::string text = "";
std::cin >> text;

std::map<char, int> alphabet_table;
for (const auto c : text) {
    alphabet_table[c]++;
}
std::string hash_result = "";
for (const auto& alphabet : alphabet_table) {
    hash_result += alphabet.first + std::to_string(alphabet.second);
}

왜냐면 unordered_map은 해시맵인데, 해시맵의 가장 큰 특징은 바로 정렬되지 않았다 입니다. 해시맵은 그 특성상 정렬이 되지 않아요. 그래서 해시맵을 이용하게 되면 조금 비효율적이게 됩니다. 위와 동일한 코드를 해시맵으로 바꾸면 막 “k1c3a1”, “a1c3k1” 이런 식으로 잘못 해시가 만들어질 수 있습니다.

맵은 기본적으로 그 키를 AVL tree에 집어넣기 때문에 key가 알아서 정렬이 됩니다. 다만 삽입 삭제 찾기에 O(1)의 시간이 필요한 해시맵과 달리 일반 맵은 $O(\log n)$의 시간은 걸려요.

alphabet_table을 만들고, 텍스트에서 해당 알파벳의 개수를 셉니다. 이후 해당 알파벳 테이블의 key(first)값 + value값을 더해 hash에 저장합니다.


anagrams[hash_result].emplace_back(text);
if (anagrams[hash_result].size() > max_list_size) {
    max_list_size = anagrams[hash_result].size();
    max_list_string = hash_result;
}

마지막으로 anagrams 해시맵에 결과를 키와 밸류 형태로 저장하면 끝입니다.

한편 이 문제는 가장 많은 아나그램 개수와 첫 번째 아나그램의 원본 텍스트를 출력하는 문제이므로, max_list_size와 max_list_string을 통해서 미리 답을 저장할 수 있습니다.

만약 max_list_size보다 anagrams[hash_result].size() 가 크다면 이를 새로 갱신하는 방식으로 최대값을 찾아갑니다.


	std::cout << anagrams[max_list_string].front() <<
    " " << **anagrams[max_list_string].size() - 1** << std::endl;

출력인데… 하… 이 문제에 사소한 애매함이 있습니다. 오류라고 하긴 그런데, 사이즈에서 -1을 해줘야 합니다. 그냥 size로 출력하면 안 돼요.

문제에서 왜 이걸 안 알려주는지… 저도 이거 없어서 뭐지? 했습니다. 정상적으로 풀었다면 위와 같이 사이즈를 -1해줘야 합니다.

제 생각에는 출제자가 아나그램의 개수에 자기 자신(원본)을 포함시키지는 않아서인 것 같아요.

예를 들어서 “abc”, “bca”가 있을 때 “abc”의 아나그램은 몇 개냐 하고 물으면 “bca” 하나이죠.

근데 저는 그냥 아나그램 개수를 구하는 거니깐 “abc”, “bca” 두 개라고 답한 꼴입니다.

총평 및 후기, 배운 점

아나그램을 이용한 재밌는 문제이고, 푸는 방법이 두 가지라서 더 재밌는 문제입니다.

참고로 위에서 소개한 방법 이외에 한 가지 더 있습니다. 방법 2와 마찬가지로 해시 테이블을 이용하는데, 이번에는 두 해시 테이블을 빼는 방법입니다. 이 경우 굳이 해시값 string을 만들지 않아도 되기에 더욱더 효율적인데, 이에 대한 자세한 구현은 추후에 시간이 나면 한 번 더 다루도록 하겠습니다.

문자열 문제가 풀다보면 상당히 재밌는 거 같습니다.

백준 [7587번 Anagrams]