3. MC 문제¢

이전 μž₯μ—μ„œλŠ” μ‹€μŠ΅μ— μ‚¬μš©ν•  데이터셋을 λ‹€μš΄λ‘œλ“œν•˜κ³  μ‹œκ°ν™”ν•˜λ©° μ „μ²˜λ¦¬λ₯Ό μ§„ν–‰ν•΄λ³΄μ•˜μŠ΅λ‹ˆλ‹€. 이번 μž₯μ—μ„œλŠ” ν•΄λ‹Ή 데이터셋을 μ΄μš©ν•˜μ—¬ MC 문제λ₯Ό μƒμ„±ν•˜λŠ” λͺ¨λΈμ„ μ‹€μŠ΅ν•΄λ³΄λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€.

3.1μ ˆμ—μ„œλŠ” 문제λ₯Ό μƒμ„±ν•˜λŠ” 데에 μ‚¬μš©λ˜λŠ” GPT-2와 BERT λͺ¨λΈμ„ μ†Œκ°œν•˜κ³ , 3.2μ ˆμ—μ„œλŠ” 데이터λ₯Ό λΆˆλŸ¬μ™€μ„œ λ‹€μ‹œ ν•œ 번 μ‚΄νŽ΄λ΄…λ‹ˆλ‹€. 이어 3.3μ ˆμ—μ„œλŠ” T5 λͺ¨λΈμ„ μ΄μš©ν•˜μ—¬ Text 데이터λ₯Ό μš”μ•½ν•˜λŠ” μž‘μ—…μ„ μ§„ν–‰ν•˜λ©°, λ§ˆμ§€λ§‰μœΌλ‘œ 3.4μ ˆμ—μ„œλŠ” ν…μŠ€νŠΈ μ†μ˜ λ¬Έμž₯듀을 ν•„ν„°λ§ν•˜κ³  GPT-2와 BERTλ₯Ό μ΄μš©ν•˜μ—¬ MC 문제λ₯Ό 생성해보도둝 ν•˜κ² μŠ΅λ‹ˆλ‹€.

MC 문제λ₯Ό μƒμ„±ν•˜λŠ” 과정은 크게 5λ‹¨κ³„λ‘œ μ„€λͺ…될 수 μžˆμŠ΅λ‹ˆλ‹€. κ°€μž₯ λ¨Όμ € MC 문제λ₯Ό μƒμ„±ν•˜κΈ° μœ„ν•œ 지문이 λ˜λŠ” ν…μŠ€νŠΈλ‘œλΆ€ν„° μ£Όμš” λ¬Έμž₯듀을 μΆ”μΆœν•˜μ—¬ μš”μ•½ν•©λ‹ˆλ‹€. λ‘λ²ˆμ§Έ, μΆ”μΆœλœ μ£Όμš” λ¬Έμž₯듀을 μœ μ‚¬ μ–΄νœ˜μ™€ 어ꡬλ₯Ό ν™œμš©ν•˜μ—¬ λ³€ν™˜ν•˜λ©°(pharaphrasing) μ„Έλ²ˆμ§Έ, 이 λ³€ν™˜λœ λ¬Έμž₯듀을 νŒŒμ‹±ν•©λ‹ˆλ‹€. λ„€λ²ˆμ§Έλ‘œ 이 νŒŒμ‹±λœ λ¬Έμž₯λ“€κ³Ό GPT-2 λͺ¨λΈμ„ μ΄μš©ν•΄ 거짓 λ¬Έμž₯을 μƒμ„±ν•˜κ³ , λ§ˆμ§€λ§‰μœΌλ‘œ μœ μ‚¬λ„λ₯Ό ν‰κ°€ν•΄μ„œ μ •λ‹΅ λ¬Έμž₯κ³Ό κ°€μž₯ μœ μ‚¬ν•˜μ§€ μ•Šμ€ λ¬Έμž₯듀을 μ˜€λ‹΅ μ„ νƒμ§€λ‘œ ν™œμš©ν•©λ‹ˆλ‹€. 이 과정은 3.2μ ˆλΆ€ν„° μˆœμ„œλŒ€λ‘œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.


3.1.1 GPT2ΒΆ

GPT-2λŠ” OpenAIμ—μ„œ κ°œλ°œν•œ GPT-N μ‹œλ¦¬μ¦ˆμ˜ 2번째 μžμ—°μ–΄μ²˜λ¦¬ λͺ¨λΈμž…λ‹ˆλ‹€. GPT-2λŠ” 8백만 개의 μ›ΉνŽ˜μ΄μ§€ 데이터셋과 15μ–΅ 개의 νŒŒλΌλ―Έν„°λ‘œλΆ€ν„° ν•™μŠ΅λœ 트랜슀포머 기반 μžμ—°μ–΄μ²˜λ¦¬ λͺ¨λΈμ΄λ©° μ΄μ „μ˜ 단어듀을 ν¬ν•¨ν•˜λŠ” ν…μŠ€νŠΈ λ‹€μŒμ— 올 단어λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 것을 λͺ©μ μœΌλ‘œ μ§œμ—¬μ§„ λͺ¨λΈμž…λ‹ˆλ‹€. μš°λ¦¬λŠ” MC 문제 생성 νƒœμŠ€ν¬μ—μ„œ 거짓 λ¬Έμž₯을 μƒμ„±ν•˜λŠ” 데에 GPT-2λ₯Ό ν™œμš©ν•˜κ²Œ λ©λ‹ˆλ‹€.

3.1.2 BERTΒΆ

BERTλŠ” Bidirectional Encoder Representations from Transformers 의 μ•½μ–΄λ‘œ κ΅¬κΈ€μ—μ„œ 2018λ…„ κ°œλ°œν•œ μžμ—°μ–΄μ²˜λ¦¬ λͺ¨λΈμž…λ‹ˆλ‹€. BERTλŠ” Transformerλ₯Ό 기반으둜 Sentence Embedding ν˜Ήμ€ Contextual Word Embedding을 κ΅¬ν•˜λŠ” λ„€νŠΈμ›Œν¬λ‘œ, λ¬Έμž₯을 토큰 λ‹¨μœ„λ‘œ μͺΌκ°œμ„œ λ„€νŠΈμ›Œν¬μ— λ„£μœΌλ©΄ 전체 λ¬Έμž₯에 λŒ€ν•œ vector와 λ¬Έμž₯ μ•ˆμ˜ 단어 각각에 λŒ€μ‘λ˜λŠ” vectorλ₯Ό 좜λ ₯ν•©λ‹ˆλ‹€. 이λ₯Ό 기반으둜 Text Classification λ“±μ˜ Taskλ₯Ό ν•™μŠ΅ν•˜μ—¬ μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

  • κ·Έλ¦Ό 3.1 μ „λ°˜μ μΈ BERT의 pre-training κ³Όμ •κ³Ό fine-tuning κ³Όμ • (좜처: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)

NLP Task μ„±λŠ₯ 평가와 κ΄€λ ¨ν•˜μ—¬ λ‹€μ–‘ν•œ NLP Taskλ“€μ˜ μ„±λŠ₯을 λ°”νƒ•μœΌλ‘œ λͺ¨λΈλ“€μ˜ μˆœμœ„λ₯Ό λ§€κΈ°λŠ” GLUE Benchmark(General Language Understatnding Evaluation Benchmark)λΌλŠ” collection이 μžˆλŠ”λ°, BERTλŠ” μ—¬κΈ°μ—μ„œ OpenAI GPT λ“±μ˜ λ‹€λ₯Έ λͺ¨λΈλ“€μ„ 큰 차이둜 μ•žμ„œλ©° κ·Έ λ‹Ήμ‹œ 졜고의 μ„±λŠ₯을 보여 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.

3.2 데이터셋 λ‹€μš΄λ‘œλ“œΒΆ

2μž₯μ—μ„œ λ‚˜μ˜¨ μ½”λ“œλ₯Ό ν™œμš©ν•˜μ—¬ 데이터셋을 λΆˆλŸ¬μ˜€λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€. 데이터셋은 μ˜¬λ°”λ₯Έ λ¬Έλ²•μœΌλ‘œ μˆ˜μ •μ΄ 된 에세이 ν…μŠ€νŠΈ λ°μ΄ν„°μž…λ‹ˆλ‹€. 이 λ°μ΄ν„°λŠ” MC 문제λ₯Ό λ§Œλ“€ μ§€λ¬ΈμœΌλ‘œ μ‚¬μš©λ©λ‹ˆλ‹€. 데이터λ₯Ό 읽어 였기 μœ„ν•΄μ„œ pickle νŒ¨ν‚€μ§€λ₯Ό μ΄μš©ν•©λ‹ˆλ‹€.

import pickle

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils
!python Tutorial-Book-Utils/PL_data_loader.py --data NLP-QG
file_name = "CoNLL+BEA_corrected_essays.pkl"
open_file = open(file_name, "rb")
data = pickle.load(open_file)
!pip install -q benepar
!pip install -q sentence_transformers

import requests
import json
import benepar
import string
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize
from string import punctuation
import re
from random import shuffle
import spacy
import warnings
import torch
import pandas as pd
import numpy as np
import scipy

nlp = spacy.load('en')


benepar_parser = benepar.Parser("benepar_en3")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def preprocess(sentences):
    output = []
    for sent in sentences:
        single_quotes_present = len(re.findall(r"['][\w\s.:;,!?\\-]+[']",sent))>0
        double_quotes_present = len(re.findall(r'["][\w\s.:;,!?\\-]+["]',sent))>0
        question_present = "?" in sent
        if single_quotes_present or double_quotes_present or question_present :
    return output
def get_flattened(t):
    sent_str_final = None
    if t is not None:
        sent_str = [" ".join(x.leaves()) for x in list(t)]
        sent_str_final = [" ".join(sent_str)]
        sent_str_final = sent_str_final[0]
    return sent_str_final
def get_termination_portion(main_string,sub_string):
    combined_sub_string = sub_string.replace(" ","")
    main_string_list = main_string.split()
    last_index = len(main_string_list)
    for i in range(last_index):
        check_string_list = main_string_list[i:]
        check_string = "".join(check_string_list)
        check_string = check_string.replace(" ","")
        if check_string == combined_sub_string:
            return " ".join(main_string_list[:i])
    return None
def get_right_most_VP_or_NP(parse_tree,last_NP = None,last_VP = None):
    if len(parse_tree.leaves()) == 1:
        return get_flattened(last_NP),get_flattened(last_VP)
    last_subtree = parse_tree[-1]
    if last_subtree.label() == "NP":
        last_NP = last_subtree
    elif last_subtree.label() == "VP":
        last_VP = last_subtree
    return get_right_most_VP_or_NP(last_subtree,last_NP,last_VP)

def get_sentence_completions(key_sentences):
    sentence_completion_dict = {}
    for individual_sentence in key_sentences:
        sentence = individual_sentence.rstrip('?:!.,;')
        tree = benepar_parser.parse(sentence)
        last_nounphrase, last_verbphrase =  get_right_most_VP_or_NP(tree)
        phrases= []
        if last_verbphrase is not None:
            verbphrase_string = get_termination_portion(sentence,last_verbphrase)
            if verbphrase_string is not None:
        if last_nounphrase is not None:
            nounphrase_string = get_termination_portion(sentence,last_nounphrase)
            if nounphrase_string is not None:
        longest_phrase =  sorted(phrases, key=len, reverse=True)
        if len(longest_phrase) == 2:
            first_sent_len = len(longest_phrase[0].split())
            second_sentence_len = len(longest_phrase[1].split())
            if (first_sent_len - second_sentence_len) > 4:
                del longest_phrase[1]
        if len(longest_phrase)>0:

    return sentence_completion_dict
def sort_by_similarity(original_sentence, generated_sentences_list):

    sentence_embeddings = bert_model.encode(generated_sentences_list)

    queries = [original_sentence]

    query_embeddings = bert_model.encode(queries)

    number_top_matches = len(generated_sentences_list)

    dissimilar_sentences = []

    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])

        for idx, distance in reversed(results[0:number_top_matches]):
            score = 1-distance
            # print(score)
            if score < 0.99:
    sorted_dissimilar_sentences = sorted(dissimilar_sentences, key=len)
    return sorted_dissimilar_sentences[:2]

def generate_sentences(partial_sentence,full_sentence):
    input_ids = gpt2_tokenizer.encode(partial_sentence, return_tensors='pt') # use tokenizer to encode
    input_ids = input_ids.to(device)
    maximum_length = len(partial_sentence.split())+80 

    sample_outputs = gpt2_model.generate( 
        repetition_penalty  = 10.0,
    for i, sample_output in enumerate(sample_outputs):
        decoded_sentences = gpt2_tokenizer.decode(sample_output, skip_special_tokens=True)
        decoded_sentences_list = tokenize.sent_tokenize(decoded_sentences)
        generated_sentences.append(decoded_sentences_list[0]) # takes the first sentence 
    top_3_sentences = sort_by_similarity(full_sentence, generated_sentences)
    return top_3_sentences
## load models
from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

summarize_tokenizer = AutoTokenizer.from_pretrained("t5-small")
paraphrase_tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws") 
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")

summarize_model = AutoModelWithLMHead.from_pretrained("t5-small")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")
# add the EOS token as PAD token to avoid warnings
gpt2_model = AutoModelWithLMHead.from_pretrained("gpt2", pad_token_id=gpt2_tokenizer.eos_token_id) 


from sentence_transformers import SentenceTransformer
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

df_TFQuestions = pd.DataFrame({'id': np.zeros(20),
                               'passage': np.zeros(20),
                               'distractor_1': np.zeros(20),
                               'distractor_2': np.zeros(20),
                               'distractor_3': np.zeros(20),
                               'distractor_4': np.zeros(20)})

A: raw
A’: paraphrased
A_False: false

  • distractor_1: A’

  • distractor_2: A’_False

  • distractor_3: B’_False

  • distractor_4: C’_False

## main.py

import random

passage_id_list = [163,

for id_idx in range(20):

    # select passage for question generation 
    passage_id = passage_id_list[id_idx]

    passage = data[passage_id]

    # summarize
    inputs = summarize_tokenizer.encode("summarize: " + passage, return_tensors="pt", max_length=512)

    inputs = inputs.to(device)

    outputs = summarize_model.generate(inputs, max_length=300, min_length=100, length_penalty=2.0, num_beams=4, early_stopping=True)

    extractedSentences = summarize_tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    tokenized_sentences = nltk.tokenize.sent_tokenize(extractedSentences)

    filter_quotes_and_questions = preprocess(tokenized_sentences)

    # paraphrase

    paraphrased_sentences = []

    for summary_idx in range(len(filter_quotes_and_questions)):

        sentence = filter_quotes_and_questions[summary_idx]

        inputs = "paraphrase: " + sentence + " </s>"

        encoding = paraphrase_tokenizer.encode_plus(inputs, pad_to_max_length=True, return_tensors="pt")
        input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

        outputs = paraphrase_model.generate(
            input_ids=input_ids, attention_mask=attention_masks,

        paraphrased_sentences.append(paraphrase_tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True))

        if len(paraphrased_sentences) == 3:
            print('3 filled')

        if (summary_idx == (len(filter_quotes_and_questions) - 1)) & (len(paraphrased_sentences) < 3): # λ§ˆμ§€λ§‰μΈλ° μ±„μ›Œμ§€μ§€ μ•Šμ•˜μ„ 경우 μ‘΄μž¬ν•˜λŠ” paraphrased sentence λ°˜λ³΅ν•΄μ„œ false λ¬Έμž₯ 생성
            for paraphrase_idx in range(1, 3):
                paraphrased_sentences.append(paraphrase_tokenizer.decode(outputs[paraphrase_idx], skip_special_tokens=True,clean_up_tokenization_spaces=True))

    sent_completion_dict = get_sentence_completions(paraphrased_sentences)

    df_TFQuestions.loc[id_idx, 'id'] = passage_id
    df_TFQuestions.loc[id_idx, 'passage'] = passage
    df_TFQuestions.loc[id_idx, 'distractor_1'] = list(sent_completion_dict.keys())[0]

    distractor_cnt = 2

    for key_sentence in sent_completion_dict:

        if distractor_cnt == 5:

        partial_sentences = sent_completion_dict[key_sentence]
        false_sentences =[]
        # df_TFQuestions.loc[0, 'id'] = 
        # print_string = "**%s) True Sentence (from the story) :**"%(str(index))
        # printmd(print_string)
        # print ("  ",key_sentence)
        false_sents = []
        for partial_sent in partial_sentences:
            for repeat in range(10):
                false_sents = generate_sentences(partial_sent, key_sentence)
                if false_sents != []:
        df_TFQuestions.loc[id_idx, f'distractor_{distractor_cnt}'] = false_sentences[0]
        distractor_cnt += 1
    print(id_idx, 'complete')
id passage distractor_1 distractor_2 distractor_3 distractor_4
0 163.0 The waters of the culinary seas had been calm ... The microwave is the source of life that most ... The microwave is the source of life that most ... It uses radiation to excite water particles in... The microwave cut cooking time in half, making...
1 28.0 The world is increasingly becoming flat with a... Social network sites provide us with many conv... Social network sites provide us with the tools... We can know her recent news without hanging ou... a piece of research shows that people will unc...
2 62.0 The best places for y... The report looks at the best places to visit f... The report looks at the best places to visit f... It is based on a survey of young people from t... It is based on my own opinion as a permanent r...
3 57.0 Puerquitour: A great experience for your mouth... The place is Tacos La Chule and there are gour... The place is Tacos La Chule and there are two ... The name of the place is Tacos La Chule, and t... The place is sooooo nice and the decoration an...
4 35.0 Nowadays, social media sites are commonly used... 80% of people use social media sites to connec... 80% of people use social media sites to connec... They consist of the function of a particular n... but there are also disadvantages that occur du...
5 26.0 Interpersonal skills, like any other skills re... The growing use of social media has its benefi... The growing use of social media has its benefi... It is a good practice not to constantly add ne... It is a good practice not to use social media."
6 22.0 Nowadays, with the advancement of technology, ... A known genetic risk should not be obligated t... A known genetic risk should not be obligated t... The government should set the law to protect t... However, a carrier of a known genetic risk can...
7 151.0 In this century there have been many technolog... Television has brought other worlds into the l... Television has brought other worlds into the l... Television has the power to bring war into the... In the minds of most Americans, television is ...
8 108.0 I met a friend about one week ago, and he aske... It's about a teen couple who are dying of canc... It's about a teen couple who are dying of canc... It's about a teen couple who are dying of canc... Now, I have an awful feeling about what I am d...
9 55.0 Dear Sir or Madam,\nI am writing to apply for ... Camp counselor is currently advertised on your... Camp counselor is currently advertised on the ... At this moment, I have finished the second yea... I am looking forward to hearing from you XYZ, ...
10 59.0 Anna knew that it was going to be a very speci... She knew that it was going to be a very specia... She knew that it was going to be a very specia... She had known that she had been adopted since ... After her 18th birthday, she felt a sudden nee...
11 129.0 On Britain's roads there is an ever-increasing... The government has started adding a fourth lan... The government has started adding a fourth lan... There appears to be an endless series of roadw... The inability to cope with the volume of traff...
12 167.0 According the Lunde, 35% of homicide victims a... 35% of the homicide victims are killed by some... 35% of the homicide victims are killed by some... Today racial prejudice still exists, but less ... It still exists racial prejudice, but has been...
13 143.0 "In Vitro fertilisation" is the fertilisation ... In a test tube In The egg is taken from the mother and placed in... There are people who are against this, saying ...
14 50.0 Dear Mrs. Ashby, \n\nYesterday I was in Green ... I am very interested in this work and believe ... I am very interested in this work and believe ... I worked a year in London as a waiter at Hard ... I am also very good at dealing with people, I ...
15 161.0 Computers have definitely affected peoples liv... Computers have had a significant impact on peo... Computers have had a significant impact on the... Without the use of a computer, I have to balan... I had to balance my checkbook once a month wit...
16 107.0 Cricket is my passion. I love playing, watchin... Cricket is a team sport, which teaches us team... Cricket is a team sport, which teaches us team... It also teaches us how to overcome individual ... Cricket is going through a rough phase due to ...
17 56.0 Well, I would like to talk about my school lif... I'm a electronics student from Italy, North I'm a electronics student from Michigan. A chance to be a great engineer one day, so I ... I am good at school, my marks prove it ; I hav...
18 114.0 I have been learning English as a second langu... My teachers thought it was better to learn in ... My teachers thought it was better to learn in ... I had decided to take the Cambridge Advanced E... One year ago, I decided to take the Cambridge ...
19 71.0 Glad to hear that you've been invited to att... You've been invited to the last round of inter... You've been invited to the last round of inter... Here are some tips on how to make sure that yo... First, the state's top elected officials are i...
df_TFQuestions.to_csv('TFQuestions.csv', index=False)