Field Handbook · No. 01

Data Engineer
Quest.

A friendly, step-by-step guide to , , and the — written so a curious middle schooler can follow along.

Quest Map

By the end of this handbook you will be able to:

  • Explain what data engineers do all day.
  • Tell structured from unstructured data.
  • Run a tiny ETL pipeline in your head.
  • Explain a data lake to your grandparents.
  • Talk to Elasticsearch via its API.
  • Read Go and Java code without panicking.
1

What is Data Engineering?

Imagine your school cafeteria on pizza day. The food does not magically appear on your tray. Someone has to order the ingredients, store them in fridges, cook them, and serve them neatly. Data engineering is exactly the same — but instead of pizza, the ingredient is .

A data engineer is the cafeteria worker of the digital world. They build the that grab raw data (clicks, photos, weather readings, video views) and prepare it so that scientists, students, or apps can actually use it.

In one sentence: Data engineering is the job of moving and shaping data so other people can ask it questions and get useful answers.

The three big jobs

01

Move data

Get data from where it lives (a phone, a sensor, a website) to where it can be analyzed.

02

Clean data

Fix typos, remove duplicates, drop empty rows. Real data is messy — like a teenager's room.

03

Store data

Put it in a place (a or ) where finding it is fast.

Tools you'll hear about

Python SQL Apache Spark Kafka Elasticsearch Airflow

Don't worry — you don't need them all. We'll focus on Elasticsearch (with a little Go and Java).

2

Types of Data: Tidy vs. Messy

Before we touch a single line of code, we need to know what we're dealing with. Data comes in three flavors, kind of like ice cream.

Structured

Lives in neat rows and columns, like a spreadsheet. Every row has the same shape.

Example: a list of students with name, age, grade.

Semi-structured

Has some shape but is bendy. Things like or XML.

Example: a tweet with text + hashtags + author.

Unstructured

Total chaos. Photos, videos, free-text emails, voice notes. No rows. No columns.

Example: 1,000 selfies. 1,000 PDFs.

Fun fact: About 80–90% of all data in the world is unstructured. That's why tools like Elasticsearch (which can search messy text) are so popular.

How do we tame messy data?

The trick is to add a tiny bit of structure. We pull out the important parts of unstructured data and turn them into a JSON document. Watch:

example.json
// Original (unstructured) email"Hi! I lost my red water bottle in room 204 on Tuesday."// After data engineering — semi-structured JSON{  "item":     "water bottle",  "color":    "red",  "location": "room 204",  "day":      "Tuesday",  "status":   "lost"}

Now a computer can search, count, and group all the lost items in a way it never could with raw sentences.

Quick check

A folder of vacation photos is what kind of data?

3

ETL — The Data Recipe

stands for three steps: Extract → Transform → Load. It is basically cooking, but for data.

STEP 01

Extract

Grab raw ingredients from the store.

Pull data from APIs, databases, files, sensors.

STEP 02

Transform

Wash, chop, cook, season the ingredients.

Clean, reformat, combine, validate.

STEP 03

Load

Plate it up and serve.

Save into a database, warehouse, or data lake.

A real (tiny) example

Pretend we run a lemonade stand and our point-of-sale app gives us a CSV file. We want it in our data lake, but cleaner.

etl-example.txt
# 1. EXTRACT — read the filedate,customer,price2026-04-29,Alex,  $4.502026-04-29,jamie ,$3.002026-04-29,Pat,FREE   # Pat is the owner's kid# 2. TRANSFORM — clean it up#   - strip whitespace from names#   - capitalize first letter#   - parse "$4.50" → 4.50 (a number)#   - drop "FREE" rows (not a real sale)# 3. LOAD — write JSON into the data lake{ "date":"2026-04-29", "customer":"Alex",  "price":4.50 }{ "date":"2026-04-29", "customer":"Jamie", "price":3.00 }
ELT vs ETL: Some teams flip the order to ELT — Load first, Transform later inside the data lake itself. Both are valid. ELT is popular for cloud data lakes because storage is cheap.
4

What is a Data Lake?

A data lake is a giant storage area where companies dump all their data — structured, semi-structured, and unstructured — in its raw form. Think of it as a real lake: water flows in from rivers (apps, sensors, websites). It is deep, it is wide, and you can fish out whatever you need later.

Lake vs. Warehouse

Feature Data Lake Data Warehouse
Data typeAnything (raw)Cleaned, structured only
When to cleanLater (read time)Before storing (write time)
Best forData scientists, ML, explorationBusiness reports & dashboards
CostCheapMore expensive
Vibe"Throw it in, sort it later""Everything in its place"

The "zones" of a real data lake

BRONZE

Raw

Exactly what came in. Unmodified. Like a snapshot.

SILVER

Clean

Validated, deduplicated, normalized. Easy to use.

GOLD

Ready

Aggregated and joined. Polished tables for dashboards.

Pro tip: Real data lakes today are usually built on cloud object storage like Amazon S3, Azure ADLS, or Google Cloud Storage. On top of that, search engines like Elasticsearch act as a fast-lookup index over the messy stuff — that is where our story heads next.
5

Meet Elasticsearch

is like Google… for your own data. It can search through millions of documents in milliseconds because it builds a special structure called an .

Six vocabulary words to rule them all

Document — one JSON record (like one student or one sale).
Index — a folder of documents you search together.
Mapping — the schema. Tells Elasticsearch which fields are text, numbers, dates.
Node — one server running Elasticsearch.
Cluster — a group of nodes working together as a team.
Shard — a slice of an index split across nodes for speed.

How it talks: the REST API

You communicate with Elasticsearch using — the same way your browser asks for web pages. Every action is a verb (GET, POST, PUT, DELETE) plus a URL.

elasticsearch-basics.txt
# Create an indexPUT    /students# Add (index) one documentPOST   /students/_doc{ "name": "Riley", "grade": 7, "hobbies": ["chess","art"] }# Search any 7th grader who likes artGET    /students/_search{  "query": { "bool": { "must": [    { "match": { "hobbies": "art" } },    { "term":  { "grade":  7 } }  ]}}}
The full API map lives at the official docs: elastic.co/docs/api/doc/elasticsearch. Bookmark it — you will come back!
6

The API, Explored

Here are the endpoints you will use 90% of the time.

GET /_cluster/health

Is the cluster alive? Returns green / yellow / red — a stoplight for your servers.

PUT /{index}

Create a new index, optionally with a mapping (schema).

POST /{index}/_doc

Index one document. Elasticsearch picks an ID for you.

POST /{index}/_bulk

Index thousands of documents in one trip. The fastest way to load a data lake.

GET /{index}/_search

Search using Query DSL. Match, term, range, fuzzy, geo — all live here.

DELETE /{index}/_doc/{id}

Remove one document. Use carefully (no Ctrl-Z!).

A real bulk-load body

bulk.ndjson
POST /lemonade-sales/_bulk{ "index": { "_id": "1" } }{ "date":"2026-04-29", "customer":"Alex",  "price":4.5 }{ "index": { "_id": "2" } }{ "date":"2026-04-29", "customer":"Jamie", "price":3.0 }# Each line alternates: action, document, action, document...

Yes — one JSON object per line, no commas between them. It is called NDJSON.

7

Loading Data with Go

Go (a.k.a. Golang) is a programming language built at Google for fast network code. Elastic publishes an official Go client. Let's use it to load lemonade sales.

Step 1 · Install the client

terminal
go mod init lemonadego get github.com/elastic/go-elasticsearch/v8@latest

Step 2 · Connect & index a document

main.go
package mainimport (    "bytes"    "context"    "encoding/json"    "log"    es "github.com/elastic/go-elasticsearch/v8")type Sale struct {    Date     string  `json:"date"`    Customer string  `json:"customer"`    Price    float64 `json:"price"`}func main() {    cfg := es.Config{        Addresses: []string{"http://localhost:9200"},        APIKey:    "YOUR_API_KEY", // from Kibana    }    client, err := es.NewClient(cfg)    if err != nil { log.Fatal(err) }    sale := Sale{Date: "2026-04-29", Customer: "Alex", Price: 4.5}    body, _ := json.Marshal(sale)    res, err := client.Index.Index(        client.Index,        bytes.NewReader(body),        client.Index.WithIndex("lemonade-sales"),        client.Index.WithContext(context.Background()),    )    if err != nil { log.Fatal(err) }    defer res.Body.Close()    log.Println("Indexed!", res.Status())}

Step 3 · Search the index

search.go
query := `{ "query": { "match": { "customer": "Alex" } } }`res, _ := client.Search(    client.Search.WithIndex("lemonade-sales"),    client.Search.WithBody(strings.NewReader(query)),)io.Copy(os.Stdout, res.Body) // print results
Bulk indexing in Go: for thousands of docs, use the helper esutil.NewBulkIndexer. It batches and retries automatically.
8

Loading Data with Java

Java is one of the oldest and most-used languages in big enterprises. Elastic provides the Elasticsearch Java API Client — a modern, type-safe client.

Step 1 · Add to pom.xml

pom.xml
<dependency>  <groupId>co.elastic.clients</groupId>  <artifactId>elasticsearch-java</artifactId>  <version>8.13.0</version></dependency>

Step 2 · Connect & index

LemonadeLoader.java
import co.elastic.clients.elasticsearch.ElasticsearchClient;import co.elastic.clients.elasticsearch._types.Refresh;import co.elastic.clients.json.jackson.JacksonJsonpMapper;import co.elastic.clients.transport.rest_client.RestClientTransport;import org.elasticsearch.client.RestClient;import org.apache.http.HttpHost;public class LemonadeLoader {  public record Sale(String date, String customer, double price) {}  public static void main(String[] args) throws Exception {    RestClient rest = RestClient        .builder(new HttpHost("localhost", 9200, "http"))        .build();    RestClientTransport transport =        new RestClientTransport(rest, new JacksonJsonpMapper());    ElasticsearchClient client = new ElasticsearchClient(transport);    Sale sale = new Sale("2026-04-29", "Alex", 4.5);    client.index(i -> i        .index("lemonade-sales")        .document(sale)        .refresh(Refresh.WaitFor)    );    System.out.println("Indexed sale!");  }}

Step 3 · Search

Search.java
SearchResponse<Sale> response = client.search(s -> s    .index("lemonade-sales")    .query(q -> q.match(m -> m        .field("customer")        .query("Alex")    )),    Sale.class);response.hits().hits().forEach(hit ->    System.out.println(hit.source()));
Bulk in Java: use BulkIngester from the same package — it auto-flushes when the buffer fills or a timeout hits.
9

Build Your Mini Data Lake

Time to put it all together. Here is the architecture of a tiny but real data lake using Elasticsearch as the searchable layer.

Mobile App Website IoT Sensors CSV Files ETL Pipeline Go / Java / Python Extract • Transform Object Storage S3 / GCS / Azure Bronze • Silver • Gold raw → clean → ready Elasticsearch Indexed for search Aggregations • Geo Real-time queries Dashboards ML Models Search Apps

A 6-step recipe

  1. Spin up Elasticsearch. Easiest way: docker run -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:8.13.0 — or use Elastic Cloud's free trial.
  2. Design your index mapping. Decide which fields are text, keyword, date, geo_point.
  3. Write the extractor. A small Go or Java program that reads from your sources.
  4. Transform. Validate types, normalize names, drop bad rows.
  5. Bulk-load into Elasticsearch. Use the bulk API for speed.
  6. Visualize. Open Kibana — Elasticsearch's free dashboard tool — and start charting.

A starter mapping

create-index.json
PUT /lemonade-sales{  "mappings": {    "properties": {      "date":     { "type": "date" },      "customer": { "type": "keyword" },      "price":    { "type": "float"  },      "location": { "type": "geo_point" }    }  }}
Heads up: Elasticsearch is great for searching the lake but it is not the only storage. Big lakes use S3 + Apache Iceberg/Delta for the raw bulk and Elasticsearch as a fast index on top.

Final mini-quiz

Which API endpoint loads thousands of documents fastest?

Resources

Hand-picked links to keep your quest going.

Official documentation

Elasticsearch API reference

Every endpoint, every parameter — the source of truth.

Getting Started Guide

The friendliest first 30 minutes with Elasticsearch.

Go client docs

Official go-elasticsearch client guide.

Java client docs

Modern, type-safe Java client by Elastic.

Kibana Guide

Beautiful dashboards on top of Elasticsearch.

Query DSL

All the ways to ask Elasticsearch questions.

YouTube channels for visual learners

Elastic (official)

Tutorials, talks, and demos straight from the team.

TechWorld with Nana

Beginner-friendly DevOps & data infrastructure videos.

codebasics

Approachable data engineering & SQL playlists.

IBM Technology

Whiteboard explainers on Data Lakes vs Warehouses.

Seattle Data Guy

Career and concept videos from a working data engineer.

freeCodeCamp

Full-length free courses on Go, Java, ETL, and more.

Free hands-on practice

Elastic Cloud free trial

Spin up a real cluster in minutes.

A Tour of Go

Interactive Go basics in your browser.

dev.java

Oracle's free Java learning hub.

Books for the brave

Designing Data-Intensive Applications

Martin Kleppmann. The bible of modern data systems.

Fundamentals of Data Engineering

Reis & Housley. A friendlier overview of the whole field.