# CLOSED-LOOP CONTROL OF ADDITIVE MANUFAC## TURING VIA REINFORCEMENT LEARNING **Anonymous authors** Paper under double-blind review ABSTRACT Additive manufacturing suffers from imperfections in hardware control and material consistency. As a result, the deposition of a wide range of materials requires on-the-fly adjustment of process parameters. Unfortunately, learning the inprocess control is challenging. The deposition parameters are complex and highly coupled, artifacts occur after long time horizons, available simulators lack predictive power, and learning on hardware is intractable. In this work, we demonstrate the feasibility of learning a closed-loop control policy for additive manufacturing. To achieve this goal, we assume that the perception of a deposition device is limited and can capture the process only qualitatively. We leverage this assumption to formulate an efficient numerical model that explicitly includes printing imperfections. We further show that in combination with reinforcement learning, our model can be used to discover control policies that outperform state-of-the-art controllers. Furthermore, the recovered policies have a minimal sim-to-real gap. We showcase this by implementing a single-layer self-correcting printer. 1 INTRODUCTION A critical component of manufacturing is identifying process parameters that consistently produce, high-quality structures. In commercial devices, this is typically achieved by expensive trial-and-error experimentation (Gao et al., 2015). To make such an optimization feasible, a critical assumption is made: there exists a set of parameters for which the relationship between process parameters and process outcome is predictable. However, such an assumption does not hold in practice because all manufacturing processes are stochastic in nature. Specifically in additive manufacturing, variability in both materials and intrinsic process parameters can cause geometric errors leading to imprecision that can compromise the functional properties of the final prints. Therefore, transition to closed-loop control is indispensable for industrial adoption of additive manufacturing (Wang et al., 2020). Recently, we have seen promising progress in learning policies for interaction with amorphous materials (Li et al., 2019b; Zhang et al., 2020). Unfortunately, in the context of additive manufacturing, discovering effective control strategies is significantly more challenging. The deposition parameters have a non-linear coupling to the dynamic material properties. To assess the severity of deposition errors, we need to observe the material over long time horizons. Available simulators either lack predictive power (Mozaffar et al., 2018) or are too complex for learning (Tang et al., 2018; Yan et al., 2018). Moreover, learning on hardware is intractable as we require tens of thousands of printed samples. These challenges are further exaggerated by the limited perception of printing hardware, where typically, only a small in-situ view is available to assess the deposition quality. In this work, we propose the first closed-loop controller for additive manufacturing based on reinforcement learning deployed on real hardware. To achieve this we formulate a custom numerical model of the deposition process. Motivated by the limited hardware perception we make a key assumption: to learn closed-loop control it is sufficient to model the deposition only qualitatively. This allows us to replace physically accurate but prohibitively slow simulations with efficient approximations. To ameliorate the sim-to-real gap, we enhance the simulation with a data-driven noise distribution on the spread of the deposited material. We further show that careful selection of input and action space is necessary for hardware transfer. Lastly, we leverage the privileged information about the deposition process to formulate a reward function that encourages policies that account for material changes over long horizons. Thanks to the above advancements, our control policy can ----- be trained exclusively in simulation with a minimal sim-to-real gap. We demonstrate that our policy outperforms baseline deposition methods in simulation and physical hardware with low or high viscosity materials. Furthermore, our numerical model can serve as an essential building block for future research in optimal material deposition, and we plan to make the source code available. 2 RELATED WORK To identify process parameters for additive manufacturing, it is important to understand the complex interaction between a material and a deposition process. This is typically done through trial-anderror experimentation (Kappes et al., 2018; Wang et al., 2018; Baturynska et al., 2018). Recently, optimal experiment design and, more specifically, Gaussian processes have become a tool for efficient use of the samples to understand the deposition problem (Erps et al., 2021). However, even though Gaussian Processes model the deposition variance, they do not offer tools to adjust the deposition on-the-fly. Another approach to improve the printing process is to design closed-loop controllers. One of the first designs was proposed by Sitthi-Amorn et al. (2015) that monitors each layer deposited by a printing process to compute an adjustment layer. Liu et al. (2017) built upon the idea and trained a discriminator that can identify the type and magnitude of observed defects. A similar approach was proposed by Yao et al. (2018) that uses handcrafted features to identify when a print significantly drops in quality. The main disadvantage of these methods is that they rely on collecting the in-situ observations to propose one corrective step by adjusting the process parameters. However, this means that the prints continue with sub-optimal parameters, and it can take several layers to adjust the deposition. In contrast, our system runs in-process and reacts to the in-situ views immediately. This ensures high-quality deposition and adaptability to material changes. Recently machine learning techniques sparked a new interest in the design of adaptive control policies (Mnih et al., 2015). A particularly successful approach for high-quality in-process control is to adopt the Model Predictive Control paradigm (MPC) (Gu et al., 2016; Silver et al., 2017; Oh et al., 2017; Srinivas et al., 2018; Nagabandi et al., 2018). The control scheme of MPC relies on an observation of the current state and a short-horizon prediction of the future states. By manipulating the process parameters, we observe the changes in future predictions and can pick a future with desirable characteristics. Particularly useful is to utilize deep models to generate differentiable predictors that provide derivatives with respect to control changes (de Avila Belbute-Peres et al., 2018; Schenck & Fox, 2018; Toussaint et al., 2018; Li et al., 2019a). However, addressing the uncertainties of the deposition process with MPC is challenging. In a noisy environment, we can rely only on the expected prediction of the deposition. This leads to a conservative control policy that effectively executes the mean action. Moreover, reacting to material changes over time requires optimizing actions for long time horizons which is a known weakness of the MPC paradigm (Garcia et al., 1989). As a result, MPC is not suitable for in-process control in noisy environments. Another option to derive control policies is to leverage deep reinforced learning (Rajeswaran et al., 2017; Liu & Hodgins, 2018; Peng et al., 2018; Yu et al., 2019; Lee et al., 2019; Akkaya et al., 2019). The key challenge in the design of such controllers is formulating an efficient numerical model that captures the governing physical phenomena. As a consequence, it is most commonly applied to rigid body dynamics and rigid robots where such models are readily available (Todorov et al., 2012; Bender et al., 2014; Coumans & Bai, 2016; Lee et al., 2018). In contrast, learning with non-rigid objects is significantly more challenging as the computation time for deformable materials is higher and relies on some prior knowledge on the task (Clegg et al., 2018; Elliott & Cakmak, 2018; Ma et al., 2018; Wu et al., 2019). Recently Zhang et al. (2020) proposed a numerical model for training control policies where a rigid object interacts with amorphous materials. Similarly, in our work a rigid printing nozzle interacts with the fluid-like printing material. However, our model is specialized for the printing hardware and models not only the deposition but also its variance. We demonstrate that this is an important component in minimizing the sim-to-real gap and design control policies that are readily applicable to the physical hardware. 3 HARDWARE PRELIMINARIES The choice of additive manufacturing technology constraints the subsequent numerical modeling. To keep the applicability of our developed system as wide as possible, we opted for a direct write ----- needle deposition system mounted on a 3-axis Cartesian robot (inset). The robot allows us to freely control the acceleration and position of the dispenser. The dispenser can process a wide range of viscous materials, and the deposition is very similar to fused deposition modeling. We further enhance the apparatus with two camera modules. The cameras lie on the opposite sides of the nozzle to allow our apparatus to perceive the location around the deposition. It is this locality of the in-situ view that we will leverage to formulate our numerical model. Material Nozzle Camera 3.1 BASELINE CONTROLLER To control the printing apparatus, we employ a baseline slicer. The input to the slicer is a three-dimensional object. The output is a series of locations the printing head visits to reproduce the model as closely as possible. To generate a single slice of the object, we start by intersecting the 3D model with a Z-axis aligned plane (please note that this does not affect the generalizability since the input can be arbitrarily rotated). The slice is represented by a polygon that marks the outline of the printout (Figure 1 gray). To generate the printing path, we assume a constant width of deposition (Figure 1 red) that acts as a convolution on the printing path. The printing path (Figure 1 blue) is created by offsetting the print boundary by half the width of the material using the Clipper algorithm (Johnson, 2015). The infill pattern is generated by tracing a zig-zag line through the area of the print (Figure 1 green). |Col1|Material Width Outline Path Infill Path Target| |---|---| Figure 1: Baseline slicer. 4 REINFORCEMENT LEARNING FOR ADDITIVE MANUFACTURING The baseline control strictly relies on a constant width of the material. To discover policies that can adapt to the in-situ observations, we formulate the search in a reinforcement learning framework. The control problem is described by a Markov decision process (S, A, P, R), where S is the observation space, A is a continuous action space, P = P (s[′]|s, a) is the transition function, and _R(s, a) →_ R is the reward function. To learn a control policy we take a model free approach by learning directly from printing. Unfortunately, learning on a physical device is challenging. The interaction between various process parameters can lead to deposition errors that require manual attention. As such discovering control policies directly on the hardware has too steep sample complexity to be practical. A potential solution is to learn the control behavior in simulation and transfer to the physical device. However, transfer from simulation to real world is a notoriously hard problem that hinges on applicability of the learned knowledge. In this work, we propose a framework for preparing numerical models for additive manufacturing that facilitate the sim-to-real transfer. Our model has three key components that facilitate the generalization of the learned control policies. The first component is the design of the observation space. To facilitate the transfer of learning between simulation and a physical device, we rely on an abstraction of the observation space (Kaufmann et al., 2020). Rather than using the direct appearance feed from our camera module we process the signal into a heightmap. A heightmap is a 2D image where each pixel stores the height of the deposited material. For each height map location, the height is measured as a distance from the building plate to the deposited material. This allows our system to generalize to many different sensors such as cameras, depth sensors, or laser profilometers. However, unlike Kaufmann et al. (2020), we do not extract the feature vectors manually. Instead, similarly to OpenAI et al. (2018), we learn the features directly from the heightmap. In contrast to OpenAI et al. (2018), we do not randomize the observation domain. Additional randomization is not necessary in our case thanks to the controlled observation conditions of the physical apparatus. A key insight of our approach is that the engineered observation space coupled with learned features can significantly help with policy learning. A careful design of the observation space can facilitate ----- the sim-to-real transfer, make the hardware design more flexible by enabling the use of a range of sensors that compute similar observations, and remove the need to hand-craft the features. It is therefore worth wile to invest in the design of observation spaces. The second component of our system is the design of the action space. Instead of directly controlling the motors of the printer we rely on a high-level control scheme and tune coupled parameters such as velocity or offset from the printing path. This idea is similar in spirit to OpenAI et al. (2018). OpenAI et al. (2018) suggest not using direct sensory inputs from the mechanical hand as observations due to their noisiness and lack of generalization across environments. Instead, they use image data to track the robotic hand. Similarly, but instead in action space, we do not control the printer by directly inputting the typically noisy and hardware-specific voltages that actuate the motors of the apparatus. Instead, we control the printer by setting the desired velocity and offset and letting the apparatus match them to the best of its capabilities. This translation layer allows us to utilize the controller on a broader range of devices without per-device training. This idea could also be generalized to other robotic tasks, for example, by applying a hierarchical divide and conquer approach to the action space. The control policies could output only high-level actions such as desired locations for robots actuators or deviations from a baseline behavior. Lowlevel controllers could then execute these higher-level actions. Such a control hierarchy can facilitate training by decoupling the higher-level goals from low-level inputs and transferring existing control policies to new devices through specialized low-level controllers. The third and last component of our system is an approximative transition function. Rather than modelling the deposition process exactly we propose to approximate it qualitatively. A qualitative approximation allows us to design an efficient computational model. To facilitate the transfer of the simulated model to the physical device we reintroduce the device uncertainty in a data-driven fashion. This is similar to OpenAI et al. (2018), but instead of covering a large array of options, we specialize the randomization. Inspired by Chebotar et al. (2019), we designed a data-driven LPC filter that matches the statistical distribution of variations observed during a typical printing process. This noise enables our control policies to adapt to changing environments and, to some extent, to changes in material properties such as viscosity. Our approximative transition function shows that it is not necessary to reproduce the physical world in simulation perfectly. A qualitative approximation is sufficient as long as we learn behavior patterns that translate to real-world experiences. This is an important observation for any task where we manipulate objects and elastic or frictional forces dominate the behavior. Relying on computationally more affordable simulations allows for applying existing learning algorithms to a broader range of problems where precise numerical modeling has prohibitive computational complexity. Moreover, by leveraging a numerical model it is possible to utilize privileged information that would be challenging if not impossible to collect in the real world. For full description of our methods please see Appendix A. 5 RESULTS In this section, we provide results obtained in both virtual and physical environments. We first show that an adaptive policy can outperform baseline approaches in environments with constant deposition. Next, we showcase the in-process monitoring and the ability of our policy to adapt to dynamic environments. Finally, we demonstrate our learned controllers transferring to the physical world with a minimal sim-to-real gap. 5.1 COMPARISON WITH BASELINE CONTROLLER We evaluate the optimized control scheme on a selection of freeform and CAD models sampled from Thingy10k (Zhou & Jacobson, 2016) and ABC (Koch et al., 2019) datasets (Appendix A.6). In total, we have 113 unseen slices corresponding to 96 unseen geometries. We report our findings in Figure 2. For each input slice, we report improvement on the printed boundary as the average offset. The average offset is defined as a sum of areas of under and over deposited material normalized by the outline length. More specifically, given an image of the target slice T, printed canvas C, a weight ----- mask W, and the length of the outline l, the average offset O is computed as: = [(1][ −] _[C][)][TW]_ _O_ _l_ + _[C][(1][ −]_ _[T]_ [)] (1) The improvement is calculated as a difference between the baseline and our policy. Therefore, a value higher than zero indicates that our control policy outperformed the baseline. As we can see, our policy achieved better performance in all considered models. 25 20 15 (microns) 10 Average Offset Improvement Validation Slices Figure 2: The relative improvement of our policy over baseline. Next, we investigate the shapes where our control policy achieves the highest and the lowest gain, respectively (Figure 3). Best performance is achieved in smooth regions. The reason is that our policy is capable of adjusting the printing parameters based on the curvature while the baseline’s constant speed is more suitable for a limited range of curvatures. Conversely, our policy achieves the weakest performance on objects with sharp features. This is natural as the width of the deposited material in sharp regions is too large for the desired feature scale, leading to over-deposition. If such thin features are desired to print regularly, a thinner material nozzle can alleviate this issue. Highest Gain Lowest Gain Figure 3: Representative deposited patterns from the evaluation dataset. Finally, we compare our control policy with fine-tuned baseline. The baseline controller uses the same parameters for each slice. Different process parameters may be optimal for different slices. To this end, we choose two slices, a freeform slice of a bird and a CAD slice of a bolt and optimize their process parameters using Bayesian optimization, Figure 26 (for numerical details see Appendix B). We can observe that the two control schemes require drastically different velocities (1.46 SU/s vs. 0.89 SU/s) to maximize performance. Moreover, we can see that the control parameters are not interchangeable, Appendix B. When switching the control parameters, we can observe a loss in performance. This loss is caused by each set of control parameters exploiting the local properties of the slice used for training. Lastly, we compare the individually optimized control parameters with our policy. Our policy improves upon both baseline solutions while maintaining generalizability. This is possible because our control policy relies on live feedback to adjust the printing parameters on-the-fly. 5.1.1 ABLATION STUDY ON OBSERVATION SPACE Our control policy relies on a live view of the deposition system to select the control parameters. However, the in-situ view is a technologically challenging addition to the printer hardware that requires a carefully calibrated imagining setup. With this ablation study, we verify how important the individual observations are to the final print quality. We consider three cases: (1) no printing bed view, (2) no target view, and (3) no future path view. We analyzed the results from the pre-test (full observation space µ = 9.7, σ = 4.9) and the post-tests (no canvas µ = 8.8, σ = 5.7, no target _µ = 7.2, σ = 5.5, no path µ = 8.4, σ = 4.8) printing task using paired t-tests with Holm-Bonferroni_ correction. The analysis indicates that the availability of all three inputs: the printing bed, the target, and the path improved final printouts (P < 0.01 for all three cases). ----- 5.2 PERFORMANCE IN DYNAMIC ENVIRONMENTS We use an identical random pressure variation profile to perform a quantitative evaluation in environments with varying pressure. We use the same evaluation dataset as for constant-pressure policies and report the overall improvement over the baseline controller, (Figure 4). We can observe that in each of the considered slices, our closed-loop controller outperformed the baseline. 35 30 25 20 15 10 0 Validation Slices Figure 4: The relative improvement of our policy over baseline. We have also evaluated the infill policy in a noisy environment, (Figure 5). We can observe that the deposition noise leads to an accumulation of material. The accumulation eventually results in a bulge of material in the center of the print, complicating the deposition of subsequent layers as the material would tend to slide off. In contrast, our policy dynamically adjusts the printing path to generate a print with significantly better height uniformity. As we can observe, the surface generated by our policy is almost flat and ready for deposition of potentially more layers. 5.3 ABLATION STUDY ON VISCOSITY Overdeposition Baseline Infill Our Control Policy Infill Max Height Min Figure 5: Infill comparison. To verify that our policy can adapt to various materials, we trained three models of varying viscosity, (Figure 6). We can observe that, without an adaptive control scheme, the pressure changes are sufficiently strong to cause local over- or under-deposition. Our trained policy dynamically adjusts the offset and velocity to counterbalance the changes in the deposition. We can see that our policy is particularly good at handling smooth width changes and quickly recovers from a spike in printing width. Baseline Ours Baseline Ours Baseline Ours Viscosity Figure 6: Performance of our policy and baseline with varying viscosity. 5.3.1 ABLATION STUDY ON ACTION SPACE To evaluate the need to tweak both the print- Full Action Space Velocity Only Displacement Only ing velocity and the printing path, we trained two control policies with a limited action set to either alter the velocity or path offset. We analyzed the results from the pre-test (full action space µ = 12.7, σ = 5.7) and the posttests (velocity µ = 7.5, σ = 2.5, displacement Full Action Space Velocity Only Displacement Only _µ = 5.6, σ = 8.3) printing task using paired t-_ Figure 7: Action space ablation study. tests with Holm-Bonferroni correction. The analysis indicates that the availability of the full action space resulted in an improvement in final printouts (P < 0.001 for both cases). The difference in performance depends on the inherent limitations of the individual actions. On the one hand, adjusting velocity is fast (under 6.6 milliseconds) but can cope only with moderate changes in material ----- width. This can be observed as the larger bulges of over-deposited material in Figure 7 middle. On the other hand, while offset can cope with larger material differences but it needs between 0.13 and 1.3 seconds to adjust. As a result, offset adjustment cannot cope with sudden material changes, (Figure 7 right). In contrast, by utilizing the full action space our policy can combine the advantages of the individual actions and minimize over-deposition, (Figure 7 left). 5.3.2 ABLATION STUDY ON REWARD FUNCTION Our reward function uses privileged informa- Privileged Reward Delayed Reward Immediate Reward tion from the numerical simulation to evaluate how material settles over time. However, such information is not readily available on physical hardware. One either evaluates the reward once at the end of each episode to include material flow or at each timestep by disregarding long Privileged Reward Delayed Reward Immediate Reward term material motion. We evaluated how such Figure 8: Reward function ablation study. changes to the reward function would affect our control policies. We analyzed the results from the pre-test (privileged reward µ = 12.7, σ = 5.7) and the post-tests (delayed reward µ = −22.3, _σ = 8.6, immediate reward µ = 9.2, σ = 8.0) printing task using paired t-tests with Holm-_ Bonferroni correction. The analysis indicates that the availability of the privileged information resulted in an improvement in final patterns (P < 0.001 for both cases). The learning process for a delayed reward is significantly slower, and it is unclear if performance similar to our policy can be achieved, Appendix A.6. On the other hand, the immediate reward policy learns faster but cannot handle material changes over longer time horizons, (Figure 8). 5.4 PERFORMANCE ON PHYSICAL HARDWARE Finally, we evaluate our control policies on physical hardware. The policies were trained entirely in simulation without any additional fine-tuning on the printing device. To conduct the evaluation, we equipped our printer with a pressure controller. The pressure control was set to a sinusoidal oscillatory signal to provide a controllable dynamic change in material properties. We used two materials, with high and low viscosity, and used two separate policies pretrained in simulation using those materials. We printed 22 slices, of which 11 corresponded to the simulation training set and 11 to the evaluation set. We monitor the printing process and use the captured images to run our 100 80 60 (microns) 40 20 Average Offset Improvement |to capture quantitative results. We observe that our controllers the baseline print in every scenario, (Figure 9).|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12| |---|---|---|---|---|---|---|---|---|---|---|---| ||||||||||||| |Low Viscosity Material|||||||||High Viscosity Material||| ||||||||||||| ||||||||||||| ||||||||||||| ||||||||||Training Slices Evaluation Slices||| Figure 9: The relative improvement of our policy over baseline. A sample of the fabricated slices can be seen in Figure 10. The print target (white) is overlayed with a map of underdeposited (blue) and overdeposited (red) material. We further plot a histogram of under and over deposition. We can see that our control policy transferred excellently to the physical hardware without any additional training. Our policy consistently achieves smaller over-deposition while not suffering from significant under-deposition. Moreover, in many cases our policy achieves histograms with smaller width suggesting we achieved a tighter control over the material deposition than the baseline. This demonstrates that our numerical model enables learning control policies for additive manufacturing in simulation. ----- 6 CONCLUSION We present the first closed-loop control policy for additive manufacturing recovered via reinforcement learning. To learn an effective control policy, we propose a custom numerical model of the deposition process. During the design of our model, we tackle several challenges. To obtain an efficient approximation of the deposition process, we leverage the limited perception of a printing apparatus and model the deposition only qualitatively. To include non-linear coupling between process parameters and printed materials, we utilize a data-driven predictive model for the deposition imperfections. Finally, to enable long horizon learning with viscous materials, we use the privileged information generated by our numerical model for reward computation. In several ablation studies, we show that these components are required to achieve high-quality printing, effectively react to instantaneous and long horizon material changes, handle materials with varying viscosity, and adapt the deposition parameters to achieve printouts with minimal over-deposition and smooth top layers. We demonstrate that our model can be used to train control policies that outperform baseline controllers, and transfer to physical apparatus with a minimal sim-to-real gap. We showcase this by applying control policies trained exclusively in simulation on a physical printing apparatus. We use our policies to fabricate several prototypes using low and high viscosity materials. The quantitative and qualitative analysis clearly shows the improvement of our controllers over baseline printing. This indicates that our numerical model can guide the future development of closed-loop policies for additive manufacturing. Thanks to its minimal sim-to-real gap, the model democratizes research on learning for additive manufacturing by limiting the need to invest in specialized hardware. Furthermore, by expanding the simulator with other physical phenomena, e.g., abrasion, melting, or heat transfer, our numerical model can serve as a blueprint for learning closed-loop control policies of other manufacturing methods such as milling, direct energy deposition, or selective laser sintering. ----- |scosity Material Baseline|Col2| |---|---| |Baseline Ours|| |Underdeposition|Overdeposition| Baseline Ours Baseline Ours |Baseline Ours|Col2| |---|---| |Underdeposition|Overdeposition| |Baseline Ours|Col2| |---|---| |Underdeposition|Overdeposition| |Underdeposition|Overdeposition| |---|---| |Baseline Ours|Col2| |---|---| |Underdeposition|Overdeposition| Baseline Ours Baseline Ours Baseline Ours Baseline Ours Low Viscosity Material Baseline Ours Underdeposition Overdeposition Baseline Ours Underdeposition Overdeposition Baseline Ours Underdeposition Overdeposition Baseline Ours Underdeposition Overdeposition Baseline Ours Baseline Ours Figure 10: Deposition quality estimation of physical result manufactured with baseline and our learned policy. ----- REFERENCES Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. Ivanna Baturynska, Oleksandr Semeniuta, and Kristian Martinsen. Optimization of process parameters for powder bed fusion additive manufacturing by combination of machine learning and finite element method: A conceptual framework. Procedia Cirp, 67:227–232, 2018. Jan Bender, Matthias M¨uller, Miguel A Otaduy, Matthias Teschner, and Miles Macklin. A survey on position-based simulation methods in computer graphics. In Computer graphics forum, volume 33, pp. 228–251. Wiley Online Library, 2014. John Parker Burg. Maximum Entropy Spectral Analysis. Stanford Exploration Project. Stanford University, 1975. Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8973– 8979. IEEE, 2019. Alexander Clegg, Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Learning to dress: Synthesizing human dressing motion via deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(6):1–10, 2018. Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016. Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. Endto-end differentiable physics for learning and control. Advances in neural information processing _systems, 31:7178–7189, 2018._ Sarah Elliott and Maya Cakmak. Robotic cleaning through dirt rearrangement planning with learned transition models. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1623–1630. IEEE, 2018. Timothy Erps, Michael Foshey, Mina Konakovi´c Lukovi´c, Wan Shou, Hanns Hagen Goetzke, Herve Dietsch, Klaus Stoll, Bernhard von Vacano, and Wojciech Matusik. Accelerated discovery of 3d printing materials using data-driven multi-objective optimization, 2021. Wei Gao, Yunbo Zhang, Devarajan Ramanujan, Karthik Ramani, Yong Chen, Christopher B Williams, Charlie CL Wang, Yung C Shin, Song Zhang, and Pablo D Zavattieri. The status, challenges, and future of additive manufacturing in engineering. Computer-Aided Design, 69: 65–89, 2015. Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989. Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International conference on machine learning, pp. 2829–2838. PMLR, 2016. Angus Johnson. Clipper - an open source freeware library for clipping and offsetting lines and [polygons. http://www.angusj.com/delphi/clipper.php, 2015.](http://www.angusj.com/delphi/clipper.php) Branden Kappes, Senthamilaruvi Moorthy, Dana Drake, Henry Geerlings, and Aaron Stebner. Machine learning to optimize additive manufacturing parameters for laser powder bed fusion of inconel 718. In Proceedings of the 9th International Symposium on Superalloy 718 & Derivatives: _Energy, Aerospace, and Industrial Applications, pp. 595–610. Springer, 2018._ Elia Kaufmann, Antonio Loquercio, Ren´e Ranftl, Matthias M¨uller, Vladlen Koltun, and Davide Scaramuzza. Deep drone acrobatics. arXiv preprint arXiv:2006.05768, 2020. ----- Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. Abc: A big cad model dataset for geometric deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Jeongseok Lee, Michael X Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha S Srinivasa, Mike Stilman, and C Karen Liu. Dart: Dynamic animation and robotics toolkit. Journal _of Open Source Software, 3(22):500, 2018._ Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. Scalable muscle-actuated human simulation and control. ACM Transactions On Graphics (TOG), 38(4):1–13, 2019. Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B. Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In International _Conference on Learning Representations, 2019a._ Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation networks for model-based control under partial observation. In 2019 International _Conference on Robotics and Automation (ICRA), pp. 1205–1211. IEEE, 2019b._ Chenang Liu, David Roberson, and Zhenyu Kong. Textural analysis-based online closed-loop quality control for additive manufacturing processes. In IIE Annual Conference. Proceedings, pp. 1127–1132. Institute of Industrial and Systems Engineers (IISE), 2017. Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018. Pingchuan Ma, Yunsheng Tian, Zherong Pan, Bo Ren, and Dinesh Manocha. Fluid directed rigid body control using deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4): 1–11, 2018. Miles Macklin and Matthias M¨uller. Position based fluids. ACM Transactions on Graphics (TOG), 32(4):1–12, 2013. Larry Marple. A new autoregressive spectrum analysis algorithm. IEEE Transactions on Acoustics, _Speech, and Signal Processing, 28(4):441–454, 1980. doi: 10.1109/TASSP.1980.1163429._ Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. Mojtaba Mozaffar, Arindam Paul, Reda Al-Bahrani, Sarah Wolff, Alok Choudhary, Ankit Agrawal, Kornel Ehmann, and Jian Cao. Data-driven prediction of the high-dimensional thermal history in directed energy deposition processes via recurrent neural networks. Manufacturing letters, 18: 35–39, 2018. Matthias M¨uller, David Charypar, and Markus H Gross. Particle-based fluid simulation for interactive applications. In Symposium on Computer animation, pp. 154–159, 2003. Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE Interna_tional Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018._ Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In NIPS, 2017. M Andrychowicz OpenAI, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2(3):5–1, 2018. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Exampleguided deep reinforcement learning of physics-based character skills. _ACM Transactions on_ _Graphics (TOG), 37(4):1–14, 2018._ ----- Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017. Connor Schenck and Dieter Fox. Spnets: Differentiable fluid dynamics for deep neural networks. In Conference on Robot Learning, pp. 317–335. PMLR, 2018. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. David Silver, Hado Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel DulacArnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In International Conference on Machine Learning, pp. 3191–3199. PMLR, 2017. Pitchaya Sitthi-Amorn, Javier E Ramos, Yuwang Wangy, Joyce Kwan, Justin Lan, Wenshou Wang, and Wojciech Matusik. Multifab: a machine vision assisted platform for multi-material 3d printing. Acm Transactions on Graphics (Tog), 34(4):1–11, 2015. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. In International _Conference on Machine Learning, pp. 4732–4741. PMLR, 2018._ Chao Tang, Jie Lun Tan, and Chee How Wong. A numerical investigation on the physical mechanisms of single track defects in selective laser melting. International Journal of Heat and Mass _Transfer, 126:957–968, 2018._ Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. Marc A Toussaint, Kelsey Rebecca Allen, Kevin A Smith, and Joshua B Tenenbaum. Differentiable physics and stable modes for tool-use and manipulation planning. 2018. Chengcheng Wang, Xipeng Tan, Erjia Liu, and Shu Beng Tor. Process parameter optimization and mechanical properties for additively manufactured stainless steel 316l parts by selective electron beam melting. Materials & Design, 147:157–166, 2018. Chengcheng Wang, XP Tan, SB Tor, and CS Lim. Machine learning in additive manufacturing: State-of-the-art and perspectives. Additive Manufacturing, pp. 101538, 2020. Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. Learning to manipulate deformable objects without demonstrations. arXiv preprint arXiv:1910.13439, 2019. Wentao Yan, Ya Qian, Wenjun Ge, Stephen Lin, Wing Kam Liu, Feng Lin, and Gregory J Wagner. Meso-scale modeling of multiple-layer fabrication process in selective electron beam melting: inter-layer/track voids formation. Materials & Design, 141:210–219, 2018. Bing Yao, Farhad Imani, and Hui Yang. Markov decision process for image-guided additive manufacturing. IEEE Robotics and Automation Letters, 3(4):2792–2798, 2018. Ri Yu, Hwangpil Park, and Jehee Lee. Figure skating simulation from video. In Computer graphics _forum, volume 38, pp. 225–234. Wiley Online Library, 2019._ Yunbo Zhang, Wenhao Yu, C Karen Liu, Charlie Kemp, and Greg Turk. Learning to manipulate amorphous materials. ACM Transactions on Graphics (TOG), 39(6):1–11, 2020. Qingnan Zhou and Alec Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv _preprint arXiv:1605.04797, 2016._ ----- A METHODS A.1 HARDWARE SETUP In this work, we developed a direct write 3D printing platform with an optical feedback system that can measure the dispensed material real-time, in-situ. The 3D printer is comprised of a pressure driven syringe pump and pressure controller, a 3-axis Cartesian robot, optical imaging system, backlit build platform, 3D-printer controller and CPU Figure 11. The 3-axis Cartesian robot is used to locate the build platform in x and y-direction and the print carriage in the z-direction. The pressure driven syringe pump and pressure controller are used to dispense and optically opaque material onto the back-lit build platform. The back-lit platform is used to illuminate the dispensed material. The movement of the robot, actuation of the syringe pump and timing of the cameras are controlled via the controller. The CPU is used to process the images after they are acquired and compute updated commands to send to the controller. Figure 11: The printing apparatus consisting of a 3-axis Cartesian robot, a direct write printing head, and a camera setup. A.1.1 CALIBRATION To enable realtime control of the printing process we implemented an in-situ view of the material deposition. Ideally we would capture a top-down view of the deposited material. Unfortunately, this is not possible since the material is obstructed by the dispensing nozzle. As a result the camera has to observe the printing bed from an angle. Since the nozzle would obstruct the view of any single camera we opted to use two cameras. More specifically, we place two CMOS cameras (Basler AG, Ahrensburg, Germany) at 45 degrees on each side of the dispensing nozzle, Figure 11. We calibrate the camera by collecting a set of images and estimating its intrinsic parameters, Figure 12 calibration. To obtain a single top-down view we capture a calibration target aligned with the image frames of both cameras, Figure 12 homography. By calculating the homography between the captured targets and an ideal top-down view we can stitch the images into a single view from a virtual over-the-top camera. Finally, we mask the location of each nozzle in the image (Figure 12 nozzle masks) and obtain the final in-situ view, Figure 12 stitched image. The recovered in-situ view is scaled to attain the same universal scene unit size as our control policies are trained in. Since we seek to model the deposition only qualitatively it is sufficient to rescale the in-situ view to match the scale of the virtual environments. We identify this scaling factor separately for each material. To calibrate a single material we start by depositing a straight line at maximum velocity. The scaling factor is then the ratio required to match the observed thickness of the line with simulation. To extract the thickness of the deposited material we rely on its translucency properties. More precisely, we correlate material thickness with optical intensity. We do this be depositing the ----- Left Right Combined Stitched Nozzle Image Stitched Locations Homography Nozzle Calibration Image Masks Stitched Image Figure 12: The calibration of the imaging setup. First intrinsic parameters are estimated from calibration patterns. Next we compute the extrinsic calibration by calculating homographies between the cameras and an overhead view. We extract the masks by thresholding a photo of the nozzle. The final stitched image consists of 4 regions: (1) view only in left camera, (2) view only in right camera, (3) view in both cameras, (4) view in no camera. The final stitched image is show on the right. material at various thickness and taking a picture with our camera setup. The optical intensity then decays exponentially with increased thickness which is captured by a power law fit. 1.00 0.75 0.50 0.25 |Col1|1mm|Col3|Col4|Col5| |---|---|---|---|---| |||||| |0.4mm 0.1mm||||| ||||0.19mm 0.05mm|| |||||| |0.75mm||||| |||||| 0.00 Intensity Figure 13: Calibration images for correlating deposited material thickness with optical intensity and the corresponding fit. The last assumption of our control policy is that the deposition needle is centered with respect to the in-situ view. To ensure that this assumption holds with the physical hardware we calibrate the location of the dispensing needle within the field of view of each camera and with respect to the build platform. First, a dial indicator is used to measure the height of the nozzle in z and the fine adjustment stage is adjusted until the nozzle is 254 microns above the print platform. Next, using a calibration target located on the build platform and the fine adjustment stage, the nozzle is centered in the field of view of each camera. This calibration procedure is done each time the nozzle is replaced during the start of each printing session. A.1.2 BASELINE CONTROLLER To calibrate the baseline control we follow the same procedure in simulation and physical hardware. We start be depositing a straight line at a constant velocity, Figure 14. Next, we measure the width of the deposited line at various locations to estimate the mean width. We use the width to generate the offset for outline printing and spacing of the infill pattern. ----- Print Boundary Nozzle Path Width w Width w Width Estimation Figure 14: Baseline controller starts by estimating the width w of the deposited material. A control sequence for the nozzle is estimated by offsetting the desired shape by half the size of material width. A.2 CONTROL POLICY INPUT STATES To define the input states, we closely follow the constraints of the physical hardware. We model our observation space as a small in-situ view centered at the printing nozzle location. The view has a size of 84 84 pixels which trans_×_ late to roughly 2.95 × 2.95 scene units (SU). The view contains either a heightmap (for infill In-Situ Printing Bed Desired Target Nozzle Path |Occluded by Nozzle|Col2|Col3| |---|---|---| Occluded by Nozzle printing) or material segmentation (for outline printing). Since the location directly under the Figure 15: Control Policy Input. nozzle is obscured for the physical hardware, we mask a small central position in the view equivalent to 0.42 SU or 7[1] [th of the in-situ view. To-] gether with the local view, we also provide the printer with a local image of the desired printing target and the path the control policy will take in the environment. To further minimize data redundancy, we rotate the in-situ view such that the printer moves along the positive X-axis in the image. These three inputs are stacked together into a 3-channel image, (Figure 15). A.3 ACTION SPACE The selection of action space plays a critical role in adapting a controller to the actual hardware. One possibility is to control and directly modify the voltage input of individual motors. However, such an approach is not readily transferable between printing devices. The controls are tied too tightly to the hardware selection and would exaggerate the sim-to-real gap. Moreover, directly affecting the motor voltage would mean that the control policy must learn how to trace Velocity print inputs. Instead, we propose a strategy that leverages the body of work on designing baseline controllers. Similar to baseline, our control policy follows a path generated by a slicer. However, we enable dynamic modification of the path. At each state, the printer Displacement Velocity can modify two actions: (1) the velocity at which the printing head In-Situ Printing Bed is moving and (2) displacement of the printing head in a direction perpendicular to the motion, Figure 16. Such a formulation allows Figure 16: The action space. us to decouple the hardware parameters from the control scheme and apply the same policy in both simulation and physical hardware by scaling the input units appropriately. In our simulation, we limit our velocity to the range of [0.2, 2] SU/s and the displacement to 0.2666 SU. A.4 TRANSITION FUNCTION The transition function takes a state-action pair and outputs a new state of the environment. In our setting, this means we need to numerically model the fabrication process, which is a notoriously ----- difficult problem. Here we leverage our assumption that the observation space is so localized that it can identify the deposited materials only qualitatively. Therefore, we can trade physical realism for visual fidelity and efficiency. This description fits the Position-Based-Dynamics (PBD) (Macklin & M¨uller, 2013) framework, which is a geometrical approximation to the equations of motion. To model the interaction of the deposited material with the printing apparatus we rely on PositionBased Dynamics (PBD). PBD approximates rigid, viscous, and fluid objects as collections of particles. To represent the fluid we assume a set of N particles where each particle is defined by its position p, velocity v, mass m, and a set of constraints C. In our setting we consider two constraints: (1) collision with the nozzle and (2) incompressibility of the fluid material. We model the collision with the nozzle as a hard inequality constraint: _Ci(pi) = (pi_ **qc)** **nc,** (2) _−_ _·_ where qc is the contact point of a particle with the nozzle geometry along the direction of particles motion v and nc is the normal at the contact location. To ensure that our fluids remain incompressilbe we follow (Macklin & M¨uller, 2013) and formulate a density constraint for each particle: _Ci(p1, ..., pn) =_ _[ρ][i]_ 1, (3) _ρ0_ _−_ _mjW_ (pi **pj, h),** (4) _−_ _ρi =_ where ρ0 is the rest density and ρi is given by a Smoothed Particle Hydrodynamics estimator (M¨uller et al., 2003) in which W is the smoothing kernel defined by the smoothing scale h. We further tune the simulation parameters to achieve a wide range of viscosity properties. More specifically, we couple the effects of viscosity, adhesion, and energy dissipation into a single setting. By coupling these parameters we obtain materials with optically different viscosity properties. Moreover, we noticed that the number of solving substeps has a significant effect on viscosity and surface tension of the simulated fluids. Therefore, we also tweak the number of substeps from 2 for liquid-like materials to 5 for highly-viscous materials. We replicate our printing apparatus in the simulation, inset. We model the nozzle as a collision object with a hard contact constraint Nozzle on the fluid particles. Since modeling a pressurized reservoir is computationally costly as it requires us to have many particles in Material constant contact, we chose to approximate the deposition process Emitter at the peak of the nozzle. More specifically, we model the depo- Deposited sition as a particle emitter. To set the volume and velocity of the Material particles, we use a flow setting. The higher the flow, the more particles with higher initial velocities are generated. This qualitatively Nozzle Material Emitter Deposited Material approximates the deposition process with a pressurized reservoir. Printing Bed The particle emitter is placed slightly inside the nozzle to allow for realistic material buildup and a delayed stop, similar to extrusion processes. Finally, we consider the printer to have only a finite acceleration per timestep. To accelerate to target velocity, we employ a linear acceleration scheme. Another important choice for the numerical model is the used dis- Minimum Velocity Maximum Velocity cretization. We have two options: (1) time-based and (2) distancebased. We originally experimented with time-based discretization. However, we found out that time discretization is not suitable for printer modeling. As the velocity in simulation approaches zero, Time Based the difference in deposited material becomes progressively smaller until the gradient information completely vanishes, Figure 17 left. Moreover, a time-based discretization allows the policy to affect the number of evaluations of the environment directly. As a result, it can avoid being punished for bad material deposition by quickly rushing the environment to finish. Considering these factors we Distance Based opted for distance-based discretization, Figure 17 right. The policy New Material Between Timesteps New Material Between Timesteps specifies the desired velocity at each interaction point, and the en Figure 17: Discretization. vironment travels a predefined distance (0.2666 SU) at the desired speed. This helps to regularize the reward function and enable learning of varying control policies. ----- An interesting design element is the orientation of the control polygons created by the slicer. When the outline is defined as points given counter-clockwise, then due to the applied rotation, each view is split roughly into two half-spaces, (Figure 18). The bottom one corresponds to outside i.e., generally black, and the upper one corresponds to inside i.e., generally white. However, the situation changes when outlining a hole. When printing a hole the two halfspaces swap location. We can remove this disambiguity by changing the orientation of the polylines defining holes in the model. By orienting them clockwise, we will effectively swap the two halfspaces to the same orientation as when printing the outer part. As a result, we achieve a better usage of trajectories and a more robust control scheme that does not need to be separately trained for each print’s outer and inner parts. Outline Hole Figure 18: Orientation. To design a realistic virtual printing environment, the model needs to capture the deposition imperfections. The source of these imperfections is the complex non-linear coupling between the dynamic material properties and the deposition parameters. Analytical modeling of this coupling is challenging as it requires a deep understanding of these interactions. Instead, we adopted a data-driven model. We observe that the final effect of the deposition error is a varying width of the deposited material. To recover such a model for our apparatus, we start by printing a reference slice over multiple iterations, (Figure 19 left). At each iteration, we measure the width of the deposited material at specified cross-sections, (Figure 19 middle). This yields us observations of how the material width evolves in time, (Figure 19 left). To formulate a predictive generative model, we employ a tool from speech processing called Linear Predictive Coding (LPC) (Marple, 1980). The model assumes that a signal is generated by a buzz filtered by an auto-correlation filter. We use this assumption to recover filter coefficients that transform white Gaussian noise into realistic pressure samples, (Figure 19 left). To formulate a predictive generative model we employ a tool from speech processing called Linear Predictive Coding (LPC) (Marple, 1980). We can predict the next sample of a signal as a weighted sum of M past output samples and a noise term: _aM,mxn_ _m + ϵn,_ (5) _−_ _m=1_ X _xn = −_ where x are the signal samples, ϵ is the noise term, and aM,m are the parameters of M -th order auto-correlation filter. To find these coefficients Burg (1975) propose to minimize the following energies: _N_ _−m_ _fM,k_ + _|_ _|[2]_ _k=1_ X _N_ _−m_ _bM,k_ _,_ (6) _|_ _|[2]_ _k=1_ X _eM =_ _fM,k =_ _aM,ixk+M_ _−i,_ (7) _i=0_ X _M_ _a[∗]M,i[x][k][+][i][,]_ (8) _i=0_ X _bM,k =_ where ∗ denotes the complex conjugate. After finding the filter coefficients with Equation 6 we can synthesize new width variations with similar frequency composition to the physical hardware by filtering a buzz modeled as a white Gaussian noise. Since we sampled the width variation at discrete intervals we further find a smooth interpolating curve that corresponds model the observed pressure variation. We use the proposed model to drive the flow setting of our simulator. This directly influences the width of the deposited material similarly to the imperfections in the deposition. A.5 REWARD FUNCTION Viscous materials take significant time to settle after deposition. Therefore, to assess deposition errors, it is needed to observe the deposition over long horizons. However, the localized nature ----- Measurement LPC Model Start End Calibration Printouts Sample Locations Figure 19: We performed nine printouts and measured the width variation at specified locations. We fit the measured data with an LPC model. Please note that since our model is generative, we do not exactly match the data or any observed resemblance is a testament to the quality of our predictor. of the in-situ view makes such observations impossible on the physical hardware. As a result, learning long-horizon planning has infeasible sample complexity. To tackle this issue, we leverage the fact that we utilize a numerical approximation of the deposition process with access to privileged information. At each simulation step, we model the entire printing bed. This allows us to formulate the reward function as a global print quality metric. More specifically, our metric is composed of two terms: (1) a reward term for depositing material inside the desired slice and (2) a punishment term for depositing material outside of the slice. To keep the values consistent across slices of varying size, we normalize them by the length of the outline or the infill area, respectively. We provide dense rewards as the difference between the metrics evaluated at two subsequent timesteps to accelerate the training further. We consider two reward function in our setting one for outline printing and one for infill printing. Each reward function evaluates the print quality as a whole. To accelerate the learning we provide the algorithm with dense rewards as a delta between the reward in-between steps R = R[n][+1] _−_ _R[n]._ To print the outline we want to follow the boundary as closely as possible without overfilling. To this end we compose our reward function of two terms. Given an image of the current printing bed _C and the desired target T we define the reward as_ _CT_ . While such a formulation rewards the control policy for depositing material inside the printing volume it does not encourage a tight outline fill. Indeed a potential strategy with such a reward would be to offset the printing nozzle as much [P] inside as possible and then move safely within the object bounds. To address this issue we propose to include a weight map W that is computed as a thresholded distance transform of the target T . The final reward function is then: R = _CTW_ . Using such a formulation we put the highest weight on depositing directly on the outline boundary. The threshold cutoff then helps preventing a strategy of filling up the shape interior. To ensure that the printer deposits material inside the desired locations [P] we include an additional punishment term P = _C(1 −_ _T_ ). Finally, both reward and punishment is normalized by the length of the outline of our target. For infill printing we compute the reward from the heightfield of the deposited material. We start[P] by estimating how much of the slice was covered. To this end, we use a thresholded version of the canvas and compute the coverage as R = _CT_ . Similarly, we estimate the amount of overdeposited material as P = _C(1 −_ _T_ ). To keep these values consistent across different slices we normalize them by the total area of the print. Finally, to motivate deposition of flat surfaces suitable [P] for 3D printing we add another penalty term as the standard deviation of the canvas heightfield. [P] A.6 TRAINING PROCEDURE To train our control policy we start with g-code generated by a slicer. As inputs to the slicer we consider a set of 3D models collected from the Thingy10k dataset. To train a controller the input models need to be carefully selected. On the one hand, if we pick an object with too low frequency features with respect to the printing nozzle size then any printing errors due to control policy will have negligible influence on the final result. On the other hand, if we pick a model with too high frequency features with respect to the printing nozzle then the nozzle will be physically unable to reproduce these features. As a result we opted for a manual selection of 18 models that span a wide variety of features, Figure 21. Each model is scaled to fit into a printing volume of 18 × 18 SU and sliced at random locations. ----- Outline Infill Target Printout Rewarded Punished Figure 20: The reward function. Figure 21: Models in our curriculum. For a full view of exemplar slices please see the supplementary material. Our policy is represented as a CNN modeled after Mnih et al. (2015). The network imput is a 84 × 84 × 3 image. The image is passed through three hidden layers. The convolution layers have the respective parameters: (32 filters, filter size 8, stride 4), (64 filters, filter size 4, stride 2), and (64 filters, filter size 3, stride 1). The final convolved image is linearized and passed through a fullyconnected layer with 512 neurons that is connected to the output action. Each hidden layer uses the nonlinear rectifier activation. We formulate our objective function as: arg maxθ E[C]t  ππθtθ−t (1a(at|ts|st)t) _Aˆt_ _,_ (9) where t is a timestep in the optimization, θ are the hyperparameters of a neural network encoding our policy π that generates an action at based on a set of observations st, _A[ˆ]t is the estimator of the advan-_ tage function and the expectation E[C]t [is an average of a finite batch of samples generated by printing] sliced models from our curriculum C. To maximize Equation 9 we use PPO algorithm (Schulman et al., 2017). Each trajectory consists of a randomly selected mesh slice that is fully printed out before proceeding to the next one. One epoch terminates when we collect 10000 observations. We run the algorithm for a total of 4 million observations but convergence was achieved well before that, Figure 22. For the training parameters we set the entropy coefficient to 0.01 and anneal it towards 0. Similarly we anneal the learning rate from 3e-4 towards zero. Lastly, we picked a discount factor of 0.99 which corresponds to one action having a half time of 70 steps. This is equivalent to roughly 18.6 SU of distance traveled. In our training set this corresponds to 29-80 percent of the total episode length. Full Training No Printing Bed No Path No Target 0.12 0.12 0.12 0.12 0 Iterations 4e⁶ 0 Iterations 4e⁶ 0 Iterations 4e⁶ 0 Iterations Figure 22: Training curves for controllers with constant material flow. 4e⁶ ----- We also experimented with training controllers for materials with varying viscosity, Figure 23. In general we have observed that the change in viscosity did not significantly affect the learning convergence. However, we have observed a drop in performance when training control policies for deposition of liquid materials. The liquid material requires longer time horizons to stabilize and has a wider deposition area making precise tracing of fine features challenging. 0.12 0.12 0.12 4e⁶ Iterations Iterations Iterations Viscosity Figure 23: Training curves for controllers with increasing viscosity in an environment with noisy flow. Lastly, we conducted ablation studies on action space and reward function in the environment with noisy deposition, Figure 24. We can see that employing the delayed reward had a negative effect on convergence and it is unclear if a policy of sufficient quality would be achieved. Velocity Only Iterations Displacement Only Delayed Reward Immediate Reward 0.12 0.12 0.12 Reward 2e⁶ 0 0.12 Iterations 2e⁶ Iterations 2e⁶ Iterations 2e⁶ Figure 24: Training curves for controllers with variable material flow. For evaluation we constructed a separate dataset consisting of freeform and CAD geometries that were not present in the training, Figure 25. Figure 25: Exemplar models from the evaluation dataset. BAYESIAN OPTIMIZATION FOR BASELINE CONTROL While the baseline controller closely follows the printed boundaries is possible that there is a more suitable policy to maximize our objective function. To verify this we use the environment described in Section 4 to search for a velocity and offset that maximizes the reward function. More specifically we optimize a simplified objective of Equation 9 limited to a single shape: arg maxv,d E [πv,d(at|st)], (10) where v and d are the optimized velocity and displacement of the printing policy πv,d, and E reduces to the expected cumulative reward of executing our proposed environment with a single slice. Maximizing Equation 10 even for a single shape is a challenging task due to the high cost associated with evaluating the objective function. Because of this we rely on Bayesian optimization to maximize the objective. We warm-start the optimization with 20 samples acquired through Latin sampling of ----- our 2-dimensional action space. We run the optimization until convergence that we define as not improving upon the found maxima for over 300 iterations. We can see the optimized controllers for a free-form bird model and a CAD model of a bolt compared to our optimized policy in Figure 26. Optimization 1 Optimization 2 Rewards: 10 5.3 8.5 8.9 17 10 Bayesian Optimization Our Policy Figure 26: Printouts realized using control policies recovered with Bayesian optimization (left and middle, blue square marks the optimized slice) compared to our trained policy (right). C ADAPTATION TO VARYING VISCOSITY We evaluate how our learned controllers adapt to varying viscosity, (Figure 27). We can observe that our policy learned on low-viscosity materials consistently under-deposits when used to print at higher viscosities. Conversely, our control policy learned on high-viscosity material over-deposits when applied to materials with lower viscosities. From this observation we conclude that our policy learns the spread of the material post-deposition and uses this information to guide the deposition. Therefore, small viscosity variations are not likely to pose significat challenge for our learned policies. However, if the learned material behavior is significantly violated the in-situ observation space limits the ability of our policy to adapt to a before unseen material. Baseline Our Our Low Viscosity Medium Viscosity Viscosity Figure 27: We compare the baseline policy and our three learned policies on materials with varying viscosity. ----- DETAILED PHYSICAL RESULTS Low Viscosity Material Baseline Ours Baseline Ours High Viscosity Material Baseline Ours Baseline Ours Figure 28: Policy evaluation on physical hardware. -----