Download the starter notebook for this exam here: mid2026.ipynb If your browser shows the file inline, you can right click on the displayed file and choose “Save As” to save a copy of this ipynb file. Important: please save it as a new name mid2026-your-name.ipynb before uploading it to Databricks. Use the file upload box below to submit your answer (HTML export of your Databricks notebook) 2026 MSBA 6331 Midterm Exam, Part 2 (40 points) Name __________________________________________ Read these instructions first! Download mid2026.ipynb and RENAME it to mid2026-your-name.ipynb (replace “your-name” with your actual name) before uploading it to Databricks. Answer your questions using mid2026-your-name.ipynb. The notebook has no question text on purpose – we try to minimize the risk of exam leakage. You must enter your answers in the right cells as identified by sections/question numbers. Please ensure the steps to be idempotent — i.e., they are re-run safe and generate the same result if you run them again. If you get stuck on one question, move on to other questions and come back to this later (also there are aways of bypassing a step). Do not leave questions unanswered. Providing an answer, even if you cannot test it. Provide partial answers if you don’t know all of it. Each step counts! At the end of the exam, export your notebook as an HTML and upload it using the File Upload box of the mid-term exam assignment. A. Using Spark SQL to explore the Retail Dataset (20 points) This synthetic retail dataset is located at /databricks-datasets/retail-org/. It contains a collection of datasets (folders) representing different dimensions and facts for a retail organization. Among these: Sales Orders (under sales_orders): records the customers’ originating purchase orders. Products (under products): contains products that the company sells. Active Promotions (under active_promotions): shows how customers are progressing toward becoming eligible for promotions. As the first step, we want to inspect the data file format so that we know how to load it. 1. Set up the volume Set up a volume midterm under the schema spark of the catalog msbabigdata. 2. Display the files in /databricks-datasets/retail-org/products 3. Display the first 2000 bytes of products.csv This helps us determine how to load the csv file. 4. Create a new dataframe products using the file(s) in products Then, verify the dataframe by displaying the first 10 rows. 5. Save products as a Spark SQL table named products in the midterm database You need to create the midterm database first. After the table is created, verify it by selecting the first 10 rows from the table. Next, we will inspect and load active_promotions. 6. Display the files in /databricks-datasets/retail-org/active_promotions This helps you determine how to load this data source. 7. Create a new dataframe promotions using the file(s) in active_promotions Verify the dataframe by displaying the first 10 rows. 8. Join the two dataframes products and promotions Note that products.product_id corresponds to promotions.promo_item. Use the joined dataframes to calculate the promotion quantity for each product category (label: promo_quantity), and the total number of promotion sales line items for the category (label: num_entries). Order results by descending promo_quantity. 9. Save the joined dataframe as CSV files under the midterm volume, in the subfolder category_promotions. The CSV file needs to have a header row. You may verify it visually by opening it in the Unity Catalog. B. Using Spark MLLib to Predict Churn (20 points) In this task, we ask you to train a logisticRegression model with Spark’s MLlib (pyspark.ml) and test it on the test data. The data we use is a telecom customer churn dataset. It consists of customer activity data (features), along with a churn label specifying whether the customer canceled their wireless service subscription. We provide you with two datasets: the larger set (train.csv) for model training, and the smaller dataset (test.csv) for testing the model’s performance on unseen data. 10. Download data files and upload them to the midterm volume Download train.csv and test.csv from https://idsdl.csom.umn.edu/c/share/train.csv and https://idsdl.csom.umn.edu/c/share/test.csv, respectively. Upload them to the midterm volume in the Unity catalog. Verify the uploads by displaying the files in this volume. 11. Load train.csv and test.csv into data frames traindata and testdata, respectively. Then verify by displaying 10 rows from each DataFrame. 12. Create a predictive modeling pipeline that does the following: Use a feature transformer to rename columns — replace spaces with underscores. For example, Voice mail plan should be renamed to Voice_mail_plan If you use Spark SQL, you need to quote special column names with apostrophes (“`”) The last stage of the pipeline should be a crossvalidator that: tunes a logistic regression model that predicts churn All other columns should be used as features LogisticRegression can be found in pyspark.ml.classification explores a parameter grid that consists of three different values for the Logistic Regression’s regParam: 0.0, 0.01, 0.02. uses 3-fold cross-validation uses area under the ROC curve (under BinaryClassificationEvaluator) as the evaluation criterion. After building the pipeline, train it on the traindata to obtain a pipeline model. Although we do not require you to test each component of your pipeline, you may want to do that to make debugging easier, if there is any problem. You may not need to set up the caching directory for the crossvalidator, but just in case, please run the following. import os # Define your UC Volume path # Format: /Volumes//// uc_volume_path = “/Volumes/msbabigdata/spark/midterm” # Set the environment variable for SparkML caching os.environ[‘SPARKML_TEMP_DFS_PATH’] = uc_volume_path 13. Use the trained pipeline model to obtain predictions for the test dataset testdata. Then present the first 10 rows of the resulting DataFrame with the following fields: churn prediction 14. What is the area under the ROC curve for the test dataset? 15. Clean Up Drop all tables, databases, and volumes you have created in this exam.
Category: Uncategorized
What were the consequences delivered following the relevant …
What were the consequences delivered following the relevant and irrelevant responses taught to participants in experiment 2 in the Carr and Durand (1985) study? (2 pt)
Did you experience any issues during your testing session?
Did you experience any issues during your testing session?
After successfully implementing FCT and the client’s respond…
After successfully implementing FCT and the client’s responding reaches the mastery and generalization criteria, what is extremely important for the behavior analyst to do? Why is this important? (Hint: consider reinforcement) (2 pt)
Jules is a 9-year-old girl who engages in spitting in the cl…
Jules is a 9-year-old girl who engages in spitting in the classroom setting. A functional analysis determined that Jules’ spitting is maintained by attention from her teacher. During baseline, Cara, the BCBA, collected data on the frequency of spitting and inter-response time (IRT) during Jules’ free time period at the end of the day. Free time involves 30 min of free access to preferred items, attention from peers, and no demands. During the free period, the teacher’s attention is dedicated to completing end-of-day tasks (e.g., filling out daily logs, cleaning up the classroom). When Jules engages in spitting, the teacher will deliver a reprimand and then return back to the end-of-day tasks. Below are the data that Cara collected during seven baseline (BL) sessions of 30 min each. BL Session # Frequency of Spitting Average IRT 1 5 6 min 2 4 8 min 3 6 5 min 4 10 3 min 5 4 7 min 30 s 6 12 2 min 30 s 7 8 3 min 45 s If Cara wanted the teacher to implement a noncontingent reinforcement (NCR) procedure, how would she determine the initial NCR schedule? Describe the calculations needed to determine the schedule. (2 pt) How frequently would the teacher need to deliver attention? Provide the schedule of reinforcement (e.g., FT-1 min, VT-5 min) (1 pt). (Note: Calculator is available on Honorlock. Only use the calculator provided through Honorlock. Do not use your own calculator or phone.)
Behavior analysts should program for __________ when determ…
Behavior analysts should program for __________ when determining the initial teaching conditions such as who will implement the FCT, where it will be conducted, and how many responses to be taught.
Permitting [____1____] as an antecedent to responding may fu…
Permitting [____1____] as an antecedent to responding may function as a [____2____] operation that increases the value of the consequence. 2 pt (Hint: think back to synchronous class discussion of Peterson et al. 2016)
Consider the following Wireshark trace: What type of messag…
Consider the following Wireshark trace: What type of message is highlighted?
Performance: Delay. Consider the network shown in the figur…
Performance: Delay. Consider the network shown in the figure below, with three links, each with a transmission rate of 2 Mbps, and a propagation delay of 3 msec per link. Assume the length of a packet is 1000 bits. What is the end-end delay of a packet from when it first begins transmission on link 1, until is it received in full by the server at the end of link 3. You can assume that queueing delays and packet processing delays are zero, but make sure you include packet transmission time delay on all links. Assume store-and forward packet transmission.
Spaghetti that is cooked ahead of service should be held in…
Spaghetti that is cooked ahead of service should be held in cold water to keep it from sticking and then reheated to order.