Create Parquet Dataset
๐ Guide: Creating a Dataset with Multiple Parquet Files
Parquet is a popular format for storing large, structured data efficiently. Often, data is split across multiple Parquet files (e.g., Sales mobile Data in one file and Mobile Details in another).
By joining these files, you can create a single, meaningful dataset that brings all the information together.
This guide explains how to create a dataset using multiple Parquet files and run queries to join them.
๐น Step 1: Open the Dataset Section
- Click the Hamburger Menu (โก).
๐ธ Insert Screenshot โ Hamburger Menu - Expand Master Data.
๐ธ Insert Screenshot โ Master Data Menu - Select Datasets.
๐ธ Insert Screenshot โ Datasets Option - Youโll be redirected to the Dataset Page.
- At the bottom, click Create Dataset.
๐ธ Insert Screenshot โ Create Dataset Button
๐ Dataset creation box will appear.
๐น Step 2: Select Parquet Files
- On the left-hand side, go to the Datasource Section.
๐ธ Insert Screenshot โ Datasource Section - Select Parquet Files.
- In the middle panel, youโll see a list of folders and
.parquetfiles.
๐ธ Insert Screenshot - Parquet File List - Select the files you want to use.
๐ Example:
Mobile Details.parquetSales mobile Data.parquet
๐น Step 3: Write Your Query
- After selecting the files, a Query Box will appear on the right.
๐ธ Insert Screenshot โ Query Box - Write your SQL query to join and fetch data.
๐ Example (INNER JOIN):
SELECT
sm.Phone,
md."Price ($)",
md.Status,
sm.Manager,
sm.Month,
sm.Stage,
sm.Deal_Status,
sm.Deal_Size
FROM Sales mobile Data AS sm
INNER JOIN Mobile Details AS md
ON sm.Phone = md.Phone
๐ What this output shows:
- The Price ($) and Status from Mobile Details
- The Phone name, Manager, Month, Stage, Deal Status, and Deal Size from Sales Mobile Data
๐ In short: This query gives you a combined dataset of sales and mobile details, showing only the phones that exist in both Parquet files.
๐ Tip:
- Use
INNER JOINwhen you want records common in both files. - Use
LEFT JOIN,RIGHT JOIN, orFULL JOINdepending on your needs.
๐น Step 4: Define Dataset Name
- Enter a Dataset Name (DS Name) to identify your dataset.
๐ธ Insert Screenshot โ Dataset Name Field
๐น Step 5: Configure Output & Preview
- Go to the Output Columns section to review your selected fields.
๐ธ Insert Screenshot โ Output Columns - Click Preview to check the query results.
๐ธ Insert Screenshot โ Preview Output
๐น Step 6: Create the Dataset
- Once everything looks good, click Create.
๐ธ Insert Screenshot โ Create Button
โจ Your dataset is ready with multiple Parquet file joins. You can now use this dataset for analysis, reporting, or building dashboards.