4. Wh- Questionยถ

Open In Colab

3์žฅ์—์„œ๋Š” ๊ฐ ์ง€๋ฌธ์— ๋Œ€ํ•œ True/False ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ 4์žฅ์—์„œ๋Š” T5๋ชจ๋ธ๊ณผ BERT๋ชจ๋ธ๋กœ Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. Cambridge Dictionary์— ์˜ํ•˜๋ฉด Wh- ๋ฌธ์ œ๋Š” what, when, where, who, whom, which, whose, why, how๋กœ ์‹œ์ž‘ํ•˜๋Š” ๋ฌธ์ œ๋ผ๊ณ  ์ •์˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์žฅ์—์„œ๋Š” ์ด๋Ÿฌํ•œ Wh- ๋ฌธ์ œ๋ฅผ ๋‹ค์ง€์„ ๋‹ค(Multi-Choice)๋กœ ์ƒ์„ฑํ•ด ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

4.1์ ˆ์—์„œ๋Š” ๋ชจ๋ธ์˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•  ์—์„ธ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  4.2์ ˆ์—์„œ๋Š” ์ •๋‹ต ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 4.3์ ˆ์—์„œ๋Š” Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํด๋ž˜์Šค, 4.4์ ˆ์—์„œ๋Š” ๋ฌธ์ œ๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํด๋ž˜์Šค, ๊ทธ๋ฆฌ๊ณ  4.5์ ˆ์—์„œ๋Š” ์˜ค๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ํด๋ž˜์Šค๋ฅผ ์ •์˜ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ 4.6์ ˆ์—์„œ๋Š” ์ •์˜ํ•œ ํด๋ž˜์Šค๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์ œ ์—์„ธ์ด ์ง€๋ฌธ์—์„œ Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

4.1 ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œยถ

๋ฌธ์ œ ์ƒ์„ฑ์— ์•ž์„œ 2์žฅ์—์„œ ์ €์žฅํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ถˆ๋Ÿฌ์˜ค๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์˜ฌ๋ฐ”๋ฅธ ํ‘œํ˜„์œผ๋กœ ์ˆ˜์ •ํ•˜์—ฌ ์ €์žฅํ•ด๋‘” ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. 2์žฅ์—์„œ ์ง์ ‘ ์ €์žฅํ•œ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ค๊ฑฐ๋‚˜, ํ˜น์€ ๊ฐ€์งœ์—ฐ๊ตฌ์†Œ ๊นƒํ—ˆ๋ธŒ์— ์ €์žฅ๋œ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. CoNLL+BEA_corrected_essays.pkl ํŒŒ์ผ ์•ˆ์—๋Š” ์ด 170๊ฐœ์˜ ์ง€๋ฌธ์ด ์กด์žฌํ•˜๋ฉฐ, ์ด ๋ฐ์ดํ„ฐ๋Š” Wh- ๋ฌธ์ œ๋ฅผ ๋งŒ๋“ค ์ง€๋ฌธ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils
!python Tutorial-Book-Utils/PL_data_loader.py --data NLP-QG
Cloning into 'Tutorial-Book-Utils'...
remote: Enumerating objects: 30, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (24/24), done.
remote: Total 30 (delta 9), reused 18 (delta 5), pack-reused 0
Unpacking objects: 100% (30/30), done.
CoNLL+BEA_corrected_essays.pkl is done!
import pickle
file_name = "CoNLL+BEA_corrected_essays.pkl"
open_file = open(file_name, "rb")
data = pickle.load(open_file)
open_file.close()
len(data)
170

์ฒซ ๋ฒˆ์งธ ์ง€๋ฌธ์„ ์˜ˆ์‹œ๋กœ ์ถœ๋ ฅํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

print(data[0])
Keeping the Secret of Genetic Testing What is genetic risk? Genetic risk refers  to your chance of inheriting a disorder or disease. People get certain diseases because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results indicate that you have gene changes associated with an increased risk of heart disease, it does not mean that you definitely will develop heart disease. The opposite is also true. If your genetic results show that you do not have changes associated with an increased risk of heart disease, it is still possible that you develop heart disease. However, for some rare diseases, people who have certain gene changes are guaranteed to develop the disease. When we are diagnosed with certain genetic diseases, are we suppose to disclose this result to our relatives? My answer is no. On one hand, we do not want this potential danger havingfrightening effects in our families' later lives. When people around us know that we have certain diseases, their attitude will easily change, whether caring for us too much or keeping away from us. And both are not what we want since most of us just want to live as normal people. Surrounded by such concerns, it is very likely that we are distracted and worry about these problems. It is a concern that will be with us during our whole life, because we  never know when the ''potential bomb'' will explode. On the other hand, if there are ways that can help us to control or cure the disease, we can go through these processes from the scope of the whole family. For  example, if exercising is helpful reducing family potential disease, we can always look for more chances for the family to do exerciseso we keep track of all family members health conditions. At the same time, we are prepared to know when there are other members who have got this disease. Here I want to share Forests'sview on this issue. Although some people feel that an individual who is found to carry a dominant gene for Huntington's disease has an ethical obligation to disclose that fact to his or her siblings, there currently is no legal requirement to do so. In fact, requiring someone to communicate his or her own genetic risk to family members who are therefore also at risk is considered by many to be ethically dubious." Nothing is absolutely right or wrong. If a certain  genetic test is very accurate and it is unavoidable and necessary to get treatment and tell  others, it is OK to disclose the result. Above all, life is more important than secrets.

ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜๊ธฐ ์ „์— 4์žฅ์—์„œ ์‚ฌ์šฉํ•  ํŒจํ‚ค์ง€๋“ค์„ import ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ํŒจํ‚ค์ง€๊ฐ€ ์—†์„ ์ˆ˜ ์žˆ์œผ๋‹ˆ ํ™•์ธํ•˜์‹œ๊ณ  ์„ค์น˜ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. (Colab ํ™˜๊ฒฝ์—์„œ๋Š” benepar ํŒจํ‚ค์ง€๊ฐ€ ์—†์–ด ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค.)

benepar์™€ nltk๋Š” ํ…์ŠคํŠธ ํŒŒ์‹ฑ๊ณผ ํ† ํฌ๋‚˜์ด์ง•์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ pandas๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์œ„ํ•œ ํŒจํ‚ค์ง€์ด๊ณ , numpy๋Š” ์ˆ˜์น˜์—ฐ์‚ฐ์— ์‚ฌ์šฉํ•˜๋Š” ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  torch๋Š” ๋ชจ๋ธ ๊ตฌ์ถ• ํ”„๋ ˆ์ž„์›Œํฌ์ด๊ณ , ๋‹ค์–‘ํ•œ ํ† ํฐํ™” ๊ธฐ๋ฒ•๊ณผ ๋ชจ๋ธ๋“ค์ด ๋‚ด์žฅ๋˜์–ด ์žˆ๋Š” transformers ํŒจํ‚ค์ง€๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

!pip install -q benepar
import benepar
benepar.download('benepar_en3')
benepar_parser = benepar.Parser("benepar_en3")

import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize

import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    AutoModelForSequenceClassification,
    pipeline
)
[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

๋ณธ๊ฒฉ์ ์ธ ๋ชจ๋ธ๋ง์— ์•ž์„œ ์‹คํ—˜ ํ™˜๊ฒฝ์—์„œ GPU๊ฐ€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. cuda๊ฐ€ ์ถœ๋ ฅ๋˜๋ฉด GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹จ, GPU๊ฐ€ ์—†๋‹ค๋ฉด CPU๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

print('cuda' if torch.cuda.is_available() else 'cpu')
cuda

4.2 ์ •๋‹ต ๋‹จ์–ด ์ถ”์ถœยถ

Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์ง€๋ฌธ์—์„œ ์ •๋‹ต์œผ๋กœ ์‚ฌ์šฉํ•  ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ ์ง€๋ฌธ์—์„œ benepar_parser tree๋ฅผ ํ†ตํ•ด ๊ฐœ์ฒด๋ช…์„ ์ธ์‹ํ•˜๊ณ  ๋ช…์‚ฌ๊ตฌ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋“ค๋งŒ ์ถ”์ถœํ•˜์—ฌ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

def get_flattened(t):
    sent_str_final = None
    if t is not None:
        sent_str = [" ".join(x.leaves()) for x in list(t)]
        sent_str_final = [" ".join(sent_str)]
        sent_str_final = sent_str_final[0]
    return sent_str_final

def get_NP(doc):
    answers = []
    trees = benepar_parser.parse_sents(sent_tokenize(doc))
    for sent_idx, tree in enumerate(trees):
        subtrees = tree.subtrees()
        for subtree in subtrees:
            if subtree.label() == "NP":
                answers.append(get_flattened(subtree))
    return answers    

4.3 ๋ฌธ์ œ ์ƒ์„ฑ ํด๋ž˜์Šค ์ •์˜ยถ

์ด๋ฒˆ์—๋Š” Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํด๋ž˜์Šค๋ฅผ ์ •์˜ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์œ„์—์„œ ์ถ”์ถœํ•œ ์ •๋‹ต ๋‹จ์–ด๋“ค๊ณผ ์ง€๋ฌธ์„ T5๊ธฐ๋ฐ˜์˜ Seq2SeqLM๋ชจ๋ธ์— ๋„ฃ์–ด ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ๋œ ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ์€ huggingface๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ(API)์„ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ, AutoModel API๋Š” ํ•™์Šต ๊ฐ€์ค‘์น˜์— ๋„ฃ๊ณ ์ž ํ•˜๋Š” ๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ ์ฐพ์•„์„œ ์ƒ์„ฑํ•ด์ฃผ๋Š” API์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด T5๋ชจ๋ธ์„ ๊ฒฝ๋กœ์— ๋„ฃ์–ด์ฃผ๋ฉด ์ž๋™์œผ๋กœ T5๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑํ•ด ๊ทธ ์•ˆ์— ํ•™์Šต ๊ฐ€์ค‘์น˜๋ฅผ ๋„ฃ์–ด์ฃผ๊ณ , BERT๋ชจ๋ธ์„ ๊ฒฝ๋กœ์— ๋„ฃ์–ด์ฃผ๋ฉด BERT๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑํ•ด ํ•™์Šต ๊ฐ€์ค‘์น˜๋ฅผ ๋„ฃ์–ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ QuestionGenerator ํด๋ž˜์Šค์—์„œ๋Š” iarfmoose/t5-base-question-generator๋ฅผ ๋ถˆ๋Ÿฌ์™”์Šต๋‹ˆ๋‹ค.

class QuestionGenerator():
    def __init__(self):
        QG_PRETRAINED = "iarfmoose/t5-base-question-generator"

        self.ANSWER_TOKEN = "<answer>"
        self.CONTEXT_TOKEN = "<context>"
        self.SEQ_LENGTH = 512
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.qg_tokenizer = AutoTokenizer.from_pretrained(QG_PRETRAINED, use_fast=False)
        self.qg_model = AutoModelForSeq2SeqLM.from_pretrained(QG_PRETRAINED)
        self.qg_model.to(self.device)
        self.qg_model.eval()

    def generate_question(self, answers):
        questions = []

        for ans in answers: 
            qg_input = "{} {} {} {}".format(self.ANSWER_TOKEN, ans, self.CONTEXT_TOKEN, passage)
            
            encoded_input = self.qg_tokenizer(qg_input, padding='max_length', max_length=self.SEQ_LENGTH, truncation=True, return_tensors="pt").to(self.device)
            with torch.no_grad():
                output = self.qg_model.generate(input_ids=encoded_input["input_ids"])
            question = self.qg_tokenizer.decode(output[0], skip_special_tokens=True)
            questions.append(question)
        return questions        
q_generator = QuestionGenerator()

4.4 ๋ฌธ์ œ ํ‰๊ฐ€ ํด๋ž˜์Šค ์ •์˜ยถ

๋‹ค์Œ์œผ๋กœ ์ƒ์„ฑ๋œ ๋ฌธ์ œ์™€ ์ •๋‹ต์Œ์„ ํ‰๊ฐ€ํ•˜๋Š” ํด๋ž˜์Šค๋ฅผ ์ƒ์„ฑํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ์™€ ์ •๋‹ต๊ฐ„์˜ ์˜๋ฏธ์  ๊ด€๋ จ์„ฑ์„ Sequence Classification๋ชจ๋ธ์„ ํ†ตํ•ด ์ ์ˆ˜๋งค๊ธฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. QAEvaluator ํด๋ž˜์Šค์—์„œ๋Š” iarfmoose/bert-base-cased-qa-evaluator๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

class QAEvaluator():
    def __init__(self):
        QAE_PRETRAINED = "iarfmoose/bert-base-cased-qa-evaluator"
        self.SEQ_LENGTH = 512

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.qae_tokenizer = AutoTokenizer.from_pretrained(QAE_PRETRAINED)
        self.qae_model = AutoModelForSequenceClassification.from_pretrained(QAE_PRETRAINED)
        self.qae_model.to(self.device)

    def encode_qa_pairs(self, questions, answers):
        encoded_pairs = []
        for i in range(len(questions)):
            encoded_qa = self.qae_tokenizer(text=questions[i], text_pair=answers[i], padding="max_length", max_length=self.SEQ_LENGTH, truncation=True, return_tensors="pt")
            encoded_pairs.append(encoded_qa.to(self.device))
        return encoded_pairs

    def get_scores(self, encoded_qa_pairs):
        scores = {}
        self.qae_model.eval()
        with torch.no_grad():
            for i in range(len(encoded_qa_pairs)):
                scores[i] = self.qae_model(**encoded_qa_pairs[i])[0][0][1]
        return [k for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)]  
qa_evaluator = QAEvaluator()

4.5 ์˜ค๋‹ต ์ƒ์„ฑ ํด๋ž˜์Šค ์ •์˜ยถ

๋งˆ์ง€๋ง‰์œผ๋กœ ์˜ค๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ํด๋ž˜์Šค๋ฅผ ์ •์˜ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 4.3์ ˆ์—์„œ ์ถ”์ถœํ•œ ์ •๋‹ต ๋‹จ์–ด๋ฅผ ์ œ์™ธํ•˜๊ณ  4๊ฐœ์˜ ๋ณด๊ธฐ ์ค‘ 3๊ฐœ์˜ ์˜ค๋‹ต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ •๋‹ต ๋‹จ์–ด์—์„œ ๋ช…์‚ฌ์— ํ•ด๋‹นํ•˜๋Š” ์ฒซ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•ด ๋งˆ์Šคํ‚นํ•˜๊ณ , BERT๋ชจ๋ธ์„ ํ†ตํ•ด ๊ทธ ๋ถ€๋ถ„์„ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ์ƒ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. DistractorGenerator ํด๋ž˜์Šค์—์„œ๋Š” bert-base-cased๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

class DistractorGenerator():
    def __init__(self):
        self.unmasker = pipeline('fill-mask', model='bert-base-cased')  

    def generate_distractor(self, text, candidate, answers, NNs: list):
        distractor = []
        divided = word_tokenize(text)
        substitute_word = NNs[0]

        mask_index = divided.index(substitute_word)
        divided.pop(mask_index)

        divided.insert(mask_index, '[MASK]')
        text = ' '.join(divided)
        unmasked_result = self.unmasker(text, top_k=10)[candidate]

        text = unmasked_result["sequence"]

        answers = answers.split(' ')
        answer_index = answers.index(substitute_word)
        answers.pop(answer_index)
        answers.insert(answer_index, unmasked_result["token_str"])
        return " ".join(answers)
def get_NN(distractor):
      NNs = []
      tree = benepar_parser.parse(distractor)
      subtrees = tree.subtrees()
      for subtree in subtrees:
          if subtree.label() in ["NN", "NNP", "NNS", "VB"]: #VB for edge case
              NNs.extend(subtree.leaves())       
      return NNs
d_generator = DistractorGenerator()

4.6 Wh- ๋ฌธ์ œ ์ƒ์„ฑยถ

์ง€๊ธˆ๊นŒ์ง€ ์ •์˜ํ•œ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง€๋ฌธ์— ๋Œ€ํ•œ Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋จผ์ € ์ƒ์„ฑํ•œ ๋ฌธ์ œ์™€ ๋ณด๊ธฐ๋“ค์„ ์ €์žฅํ•  ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๊ณ , ์‚ฌ์šฉํ•  ์ง€๋ฌธ 20๊ฐœ๋ฅผ ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.(20๊ฐœ์˜ ๋ฒˆํ˜ธ๋Š” ์ž„์˜๋กœ ์„ ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.)

df_WHQuestions = pd.DataFrame({'id': np.zeros(20),
                               'passage': np.zeros(20),
                               'question': np.zeros(20),
                               'distractor_1': np.zeros(20),
                               'distractor_2': np.zeros(20),
                               'distractor_3': np.zeros(20),
                               'distractor_4': np.zeros(20)})

passage_id_list = [163, 28, 62, 57, 35, 26, 22, 151, 108, 55, 59, 129, 167, 143, 50, 161, 107, 56, 114, 71]

์ •์˜ํ•œ ํด๋ž˜์Šค๋“ค์„ ์ˆœ์„œ๋Œ€๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ค‘๊ฐ„์— ์ •๋‹ต์ด 4๋‹จ์–ด๋ฅผ ๋„˜์ง€ ์•Š๋„๋ก ์ œํ•œํ•˜๋Š” ๊ณผ์ •๊ณผ ์˜ค๋‹ต์„ ์ƒ์„ฑํ•˜๊ธฐ ์ „์— ์ •๋‹ต์ด ์žˆ๋Š” ๋ฌธ์žฅ์„ ์ฐพ์•„์ฃผ๋Š” ๊ณผ์ •์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข… ์ƒ์„ฑํ•œ ๋ฌธ์ œ, ์ •๋‹ต, ์˜ค๋‹ต์€ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

df_idx = 0

for passage_id in passage_id_list:

    passage = data[passage_id]

    answers = get_NP(passage)

    questions = q_generator.generate_question(answers)

    encoded_qa_pairs = qa_evaluator.encode_qa_pairs(questions, answers)
    scores = qa_evaluator.get_scores(encoded_qa_pairs)
    
    ## ์ •๋‹ต์˜ ๋‹จ์–ด ๊ฐœ์ˆ˜ len() <= 4 ์‚ฌ์šฉํ•œ๋‹ค. 
    for i in range(len(scores)):
        index = scores[i]
        if len(answers[index].split(' ')) > 4:
            continue
        break
    
    sentences = nltk.sent_tokenize(passage)
    for sentence in sentences:
        if answers[index] in sentence:
            target_sentence = sentence

    NNs = get_NN(answers[index])

    distractors = []
    for i in range(3):
        distractors.append(d_generator.generate_distractor(target_sentence, 9-i, answers[index], NNs))
    
    df_WHQuestions.loc[df_idx] = [passage_id, passage, questions[index].split("?")[0] + "?", answers[index]] + distractors
    print(f"finished {passage_id}")
    df_idx += 1

์ตœ์ข… ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

df_WHQuestions
id passage question distractor_1 distractor_2 distractor_3 distractor_4
0 163.0 The waters of the culinary seas had been calm ... What is the best definition of a mother who ca... A mother who works A cook who works A kid who works A waitress who works
1 28.0 The world is increasingly becoming flat with a... What are the disadvantages of social network s... The cyber communication The cyber network The cyber environment The cyber community
2 62.0 The best places for y... What country is the best place to visit for yo... a different country a different village a different generation a different town
3 57.0 Puerquitour: A great experience for your mouth... Where did we go to watch the movie Renoir? a movie theatre a new theatre a big theatre a pizza theatre
4 35.0 Nowadays, social media sites are commonly used... What are the advantages of using social media ... Substantial costs Substantial parts Substantial events Substantial frames
5 26.0 Interpersonal skills, like any other skills re... What are the two most popular social media app... Skype and Facetime Word and Facetime Flash and Facetime email and Facetime
6 22.0 Nowadays, with the advancement of technology, ... What is the reason why a person may not have a... anymore contact anymore interactions anymore dealings anymore sex
7 151.0 In this century there have been many technolog... What is the most important aspect of television? The entertainment aspect The technical aspect The broadcasting aspect The political aspect
8 108.0 I met a friend about one week ago, and he aske... What is the feeling that you have now? an awful feeling an awful question an awful think an awful thinking
9 55.0 Dear Sir or Madam,\nI am writing to apply for ... What do you hope to get from this job? valuable experience valuable ##s valuable tips valuable time
10 59.0 Anna knew that it was going to be a very speci... What was the day she was going to meet her mot... a very special day a very special experience a very special week a very special one
11 129.0 On Britain's roads there is an ever-increasing... What would be the easiest solution to this may... an easy solution an easy exit an easy path an easy reaction
12 167.0 According the Lunde, 35% of homicide victims a... How many statistics are greater than Lundes' t... Statistics from 56 Appeals from 56 quotes from 56 gains from 56
13 143.0 "In Vitro fertilisation" is the fertilisation ... What is the definition of a woman who is given... a post-menopausal woman a post-menopausal man a post-menopausal mother a post-menopausal survivor
14 50.0 Dear Mrs. Ashby, \n\nYesterday I was in Green ... What kind of food do you like to serve? Italian pasta Italian sandwiches Italian sauce Italian restaurants
15 161.0 Computers have definitely affected peoples liv... What program does he use to make the calculati... the Communications program the Communications software the Communications Unit the Communications System
16 107.0 Cricket is my passion. I love playing, watchin... What is the best way to learn more about cricket? more about bowling more about India more about themselves more about life
17 56.0 Well, I would like to talk about my school lif... What is the best place to study in the UK? university UCLA USC Purdue
18 114.0 I have been learning English as a second langu... How long ago did I decide to take the Cambridg... One year One hour One decade One weekend
19 71.0 Glad to hear that you've been invited to att... How long will you wait for a candidate who's l... one more minute one more person one more night one more opportunity

์ƒ์„ฑ๋œ ๋ฌธ์ œ์™€ ๋ณด๊ธฐ๋ฅผ ๋ณด๋ฉด ๋ฌธ๋ฒ•์ ์œผ๋กœ๋Š” ์–ด๋Š ์ •๋„ ์ž˜ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•˜๊ฒŒ ๋ช…์‚ฌ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋ณด๊ธฐ๋ฅผ ๋งŒ๋“ค์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ง€๋ฌธ์—์„œ ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ์งˆ๋ฌธ๋“ค๋„ ์žˆ๊ณ  ์—‰๋šฑํ•œ ๋ณด๊ธฐ๋„ ์žˆ์–ด๋ณด์ž…๋‹ˆ๋‹ค. ์ •๋‹ต๊ณผ ์˜ค๋‹ต์„ ์ž˜ ๊ณ ๋ฅธ๋‹ค๋ฉด ์ข€ ๋” ์œ ์˜๋ฏธํ•œ ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•  ๊ฒƒ์ด๋ผ๊ณ  ๊ธฐ๋Œ€๋˜์–ด์ง‘๋‹ˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ Huggingface์—์„œ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ๋“ค๋กœ Wh- ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ SQuAD์™€ ๊ฐ™์€ QA ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋‹ค๋ฉด, ๊ฐ์ž ์ƒ์„ฑํ•˜๊ณ  ์‹ถ์€ ๋„๋ฉ”์ธ์œผ๋กœ ํ•™์Šต์‹œ์ผœ ํŠนํ™”๋œ ์งˆ๋ฌธ์„ ๋งŒ๋“œ๋Š”๋ฐ ํ™œ์šฉํ•ด๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

4.7 ์ฐธ๊ณ ๋ฌธํ—Œยถ