Video conferencing has become a commonly used tool for both private and professional remote communication. Despised yet immensely popular, current video conferencing and telepresence solutions have severe drawbacks in the provided quality of communication (including decreased creativity, engagement, social cues, and "zoom fatigue"). This creates the need for new communication technology that allows more natural communication with greater social presence. One way to potentially increase the communication quality is to extend video conferencing towards immersive 3D communication (also called holographic communication or Social Extended Reality, XR). Creating Social XR systems comes with multiple challenges in both technology developments and impact to the user experience. In this dissertation, we present a novel and modular capture, processing, transmission, and rendering pipeline as a reference for the development and research of Social XR applications. Our pipeline outlines the basic building blocks needed for Social XR and their individual performance implications. To outline the different components and to execute detailed in-depth performance evaluations, we consider 3 main areas: First, we present a 3D representation format based on color and depth data (RGBD) and all the necessary steps to capture, process and render users in photorealistic 3D. Our results show an overhead of 5-16% CPU & GPU processing and 300-400ms capture to render delay. Although this is significantly higher than comparable 2D video conferencing, we expect this to be acceptable in an operational environment. Second, we present an example web-based system to evaluate the impact on network performance. Our results show that our grayscale-based transmission of RGBD data performs well under network constraints similar to existing video conferencing solutions. Furthermore, the web-based client has (CPU & GPU) resource demands suitable for any modern (VR-ready) PC. Third, we present a novel central composition component to show optimization of transmission and scalability of simultaneous participants. Our solution allows to combine different user bitstreams into a single decodable video stream. The result increases bitrate by ~16% with a client performance gain of 20% and a server performance gain of 85-90%. Furthermore, the concepts presented in the dissertation were implemented in three prototype demonstrators to show the applicability of our solutions. The prototypes follow a multi-user shared experience use case in VR, an AR social care communication use case, and an AR remote training use case. The results presented in this dissertation can be seen as a new reference framework for the general building blocks of Social XR systems and a point of reference for performance evaluation comparisons. In addition, it opens up many possibilities for controlled user evaluations and Quality of Experience (QoE) studies, as well as further research in technical developments surrounding Social XR (to name a few: 3D user encoding, spatial computing, Social XR metadata, 5G, 6G, and edge compute)