💻 실습: 아파치 스파크

Contents

💻 실습: 아파치 스파크#

파티션, transformation, action 등 확인#

from pyspark import SparkContext, SparkConf
sc = SparkContext.getOrCreate()

sc

SparkContext

Version: v3.4.1
Master: local[*]
AppName: pyspark-shell

data = [1,2,3,4,5]

# 파티션의 개수를 설정가능. 일반적으로 스파크는 자동으로 개수 설정
distData = sc.parallelize(data, 10)     # 10개의 파티션으로 수행
distData

ParallelCollectionRDD[3] at readRDDFromFile at PythonRDD.scala:287

res = distData.reduce(lambda a, b : a + b)
res

외부 데이터셋 가져오기#

로컬 파일, HDFS, S3 등 하둡이 지원하는 스토리지로부터 분산 데이터셋을 불러올 수 있습니다.

# lines는 현재 메모리에 로드되지 않고 해당 파일을 가르키는 포인터임
lines = sc.textFile('./printed.txt',)

# map이라는 변환을 취한 후 결과값(연산되지 않는 상태)
lineLengths = lines.map(lambda l: len(l)) # PythonRDD[16] at RDD at PythonRDD.scala:53

# reduce라는 액션을 취함으로써 병렬 처리를 하면서 작업 연산을 수행
# 결과값만 driver program에게 반환
totalLength = lineLengths.reduce(lambda a, b: a+b)

Deep learning#

(image classification) - 일부 예시입니다.

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])

model = p.fit(train_images_df)    # train_images_df is a dataset of images and labels

# Inspect training error
df = model.transform(train_images_df.limit(10)).select("image", "probability",  "uri", "label")
predictionAndLabels = df.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))