11.01.2015 Views

PDF Presentation - Uplinq

PDF Presentation - Uplinq

PDF Presentation - Uplinq

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Brew MP<br />

<br />

Performance Tips<br />

Rich Stewart, Sr. Director, Technology<br />

QUALCOMM


Objective<br />

• Provide a general overview of performance<br />

considerations<br />

• Share BMP specific performance tips<br />

2


Agenda<br />

• Performance Concepts<br />

• BMP Architectural Overview<br />

• CPU Utilization Considerations<br />

• Memory Utilization Considerations<br />

• Power Utilization Considerations<br />

• Tools<br />

3


4<br />

Performance Concepts


General Performance<br />

• What is optimal performance<br />

– Execution time<br />

– Response time<br />

– Memory utilization<br />

– Power consumption<br />

• Optimal performance involves achieving a balance of all<br />

performance factors<br />

• Optimal performance involves sharing system resources<br />

• Optimal performance enables an optimal User Experience<br />

• The performance of a ‘crashed’ system is ZERO<br />

– user experience == 0<br />

5


Trading Off Throughput vs. Latency<br />

Signal + Worker Thread Example<br />

• Your application must balance work throughput with latency<br />

– Real-time apps (example: audio/video streams) must process small<br />

amounts of data regularly; low latency at the cost of context-switch<br />

overhead<br />

– Best-effort apps (example: downloads, file compression) are only<br />

concerned with reaching the “finish line” as soon as possible; low<br />

processing overhead at the cost of increased latency<br />

6


Throughput vs. Latency (cont.)<br />

• Example: A TCP throughput test application with configurable delay<br />

• Peak throughput occurs on this system at 10–12ms<br />

– Read too frequently and burning excess CPU<br />

– Read too infrequently then we leave throughput on the table<br />

• TCP Window is filled after 20ms; throughput drops drastically<br />

7


Throughput vs. Latency (cont.)<br />

• Delay as much rendering/processing/parsing as long<br />

as reasonable<br />

– Application context t has higher h priority it than Packet<br />

Services context<br />

– Crunching incoming data before the whole transfer is<br />

done has user-experience penalties<br />

– Try to manage in-transfer screen updates (e.g., show<br />

progress bar instead of progressive rendering)<br />

8


Trading Off CPU vs. Memory Utilization<br />

• Your application must balance CPU and memory usage<br />

(space-time tradeoff)<br />

• Example: data compression<br />

• Advantage―lower filesystem usage<br />

• Disadvantage―higher g resource load times<br />

• The nature of your application determines the critical<br />

resource<br />

• “Is 5x data compression worth a 2x time penalty”<br />

• “Is a 2x time speedup worth a 4x larger data file”<br />

9


Data Cache Performance Implications<br />

• Write cache-friendly code by using temporal and spatial<br />

locality; keep as much data in your caches as possible<br />

• Improves execution time and saves power!<br />

• Example: Process a two-dimensional array<br />

row-by-row (good spatial locality)<br />

for (i = 0; i < numRows; i++)<br />

column-by-column (bad spatial locality)<br />

for (j = 0; j < numColumns; j++)<br />

for (j = 0; j < numColumns; j++)<br />

for (i = 0; i < numRows; i++)<br />

process(array[i][j]);<br />

process(array[i][j]);<br />

10


Event Handling<br />

• A backlog in the event queue translates into perceived delay for the user<br />

– Validate that the event arrival rate of your design, combined with the other system events can be<br />

processed by the system without a queue becoming so deep that it impacts the users perception<br />

of responsiveness, or that the system elects to drop events<br />

– When exceeding this threshold one needs to consider if the "event" being processed is more<br />

optimally treated as data. Touch screen events are a common example. Handwriting recognition<br />

improves with the number of samples per second. Treating every sample as an event can flood<br />

the even handling mechanism. Grouping down events into a list of data, with a time derived event,<br />

can improve performance.<br />

• Event prioritization and/or creating independent UI tasks can have an advantage<br />

– Take the example of a single threaded UI downloading a web page. If single threaded, and polling<br />

a queue, the page download would not be interruptible until the download is complete. This is an<br />

extreme hypothetical, example.<br />

• Do not use the event queue as a polling mechanism<br />

– For example, in Brew ® , do not post an event to yourself, check some conditions in the event<br />

handler, and then re-post the same event if the condition hasn't been satisfied yet. If you must<br />

periodically check some condition, use a timer with a non-zero delay and check the conditions<br />

in the timer callback handler.<br />

11


Strings<br />

• Use AEEstd.h in lieu of AEEStdLib.h<br />

• Unnecessary function calls (empty string check):<br />

– Bad:<br />

• if (0 == std_strlen(psz)) {...}<br />

– Good:<br />

• ('\0' == *psz) {...}<br />

• Buffer string lengths when you need them multiple times<br />

– Bad:<br />

• std_ strlcpy(dest, string1, MAX_ STRING_ LEN);<br />

• std_strlcat(dest, string2, MAX_STRING_LEN - std_strlen(string1));<br />

• std_strlcat(dest, string3, MAX_STRING_LEN - (std_strlen(string1) + std_strlen(string2)));<br />

– Good:<br />

• int length = std_strlcpy(dest, string1, MAX_STRING_LEN);<br />

• length = std_strlcat(dest, t t(d t string2, MAX_STRING_LEN - length);<br />

• std_strlcat(dest, string3, MAX_STRING_LEN - length);<br />

• Concatenation, use strcat(), not strcpy():<br />

– Bad:<br />

• std_strlcpy(dest + sstd_trlen(dest), source, MAX_STRING_LEN - std_strlen(dest));<br />

– Good:<br />

• std_strlcat(dest, source, MAX_STRING_LEN);<br />

12


Strings (cont.)<br />

• Don't call strlen() on #defined strings or strings that are<br />

in-lined<br />

– Use (int)(sizeof(string)-1) or an appropriately defined macro<br />

• See if a built in string function exists to meet your<br />

use case<br />

– There are many built in string functions, some obscure<br />

• Iterate once through the string for complex operations<br />

– Each pass through a string can be viewed as a linear search in cost<br />

– For highly complex string operations consider implementing a<br />

state machine<br />

13


Diagnostic<br />

code<br />

• Code which does not directly contribute to the user’s s experience within<br />

an application reduces the overall user experience of the ‘system’<br />

14


15<br />

BMP Architectural Overview


Device Platform Overview<br />

Brew MP provides an extensive set of APIs into features at all<br />

layers of the device, including:<br />

• OS Services<br />

• General, Multimedia and Modem Services<br />

• Application, Widget, Windowing and UI Services<br />

Brew MP is both modular and extensible. Many of the services<br />

are offered as demand-loaded, digitally signed binary modules.<br />

This serves to reduce static RAM overhead, bolster compatibility<br />

between devices and support flexible configurations.<br />

The platform supports both cooperative and pre-emptive<br />

threading. It also supports process-based memory protection<br />

which complements the already robust least-privileged execution<br />

model of the platform.<br />

Brew MP has been optimized to leverage chipset modem<br />

and multimedia services. These services are exposed through h<br />

immutable object-oriented APIs which support seamless access<br />

to hardware accelerated implementations when available.<br />

A key principal of Brew MP is platform extensibility. New features<br />

and APIs can be added without Qualcomm’s involvement. These<br />

features can be exposed through any supported language.<br />

Applications<br />

Application Environment<br />

General<br />

Services<br />

OS Services<br />

Kernel<br />

3 rd Party Software<br />

Brew MP Software<br />

Chipset Software<br />

3 rd Party Apps<br />

Graphics &<br />

Multimedia<br />

(HW<br />

Accelerated)<br />

Modem &<br />

Networking<br />

16


Leverage the Functionality Provided by<br />

the BMP System<br />

• Components in the Brew MP<br />

System are optimized for<br />

– Power<br />

– Execution performance<br />

– Memory utilization<br />

stem✔Do IT!<br />

• Optimization of Brew MP System<br />

Components is chipset specific<br />

– Using these components will utilize<br />

the underlying hardware where<br />

possible<br />

✔Please utilize Brew MP APIs<br />

where applicable<br />

17


Brew MP API Usage Considerations<br />

• New Brew MP APIs are generally improved over<br />

Brew 3.1.5 legacy interfaces<br />

– BrewMP ISockPort replaces Brew3.1.5’s 315’ ISocket interface<br />

• Reference:<br />

https://brewmobileplatform.qualcomm.com/devnet/docviewer.jspm<br />

ethod=show&id=5790&path=%2FdevEx%2Flibrary%2Ftechguides<br />

%2Fc%2Fnetworking%2Fnetworking_tech_guide%2Fhtml_out_oe<br />

m%2Fframeset.html<br />

– IFileSystem2<br />

– Minimize calls to GETTIMESECONDS and GETUPTIME<br />

• Frequently called by app more often the necessary<br />

• These calls have have a relatively high performance cost<br />

– GETTIMESECONDS has the higher overhead of the two<br />

18


Brew MP API Usage Considerations: Scheduling<br />

Work<br />

• Always use ISHELL_Resume to break an IApplet’s<br />

time-consuming tasks into smaller, interruptible chunks<br />

use<br />

– A common performance mistake is to use<br />

ISHELL_PostEvent() for this purpose, p which works<br />

functionally<br />

– ISHELL_PostEvent() is more costly than<br />

ISHELL_Resume<br />

• ISHELL_PostEvent() destination applet class ID as argument<br />

• Iterates through the applet database to match the class ID<br />

19


20<br />

CPU Utilization Considerations


JPEG Decoding Optimization<br />

• 5Mp image decoding time vastly improved with 1/8 scaling factor<br />

• IDownSample API: Integer DownScaling While decoding (1/8, 1/4, /1/2)<br />

• AEECLSID_GenericViewerAsync: Prevent hogging UI while decoding<br />

• Also a great memory saving as the generated bitmaps are close to QVGA<br />

IDownSample Fidelity Window<br />

Scaled Decoded Image<br />

Factor 1/8 or 1/4 or 1/2 or 1:1<br />

Factor selected against the<br />

fidelity size.<br />

The highest scaling factor<br />

meeting the fidelity minimum size<br />

requirement<br />

ERR_BAIL(IEnv_CreateInstance(pMe CreateInstance(pMe->piEnv, AEECLSID_JPEGDecoder, (void **)&piImageDecoder));<br />

nErr= IIMAGEDECODER_QueryInterface(piImageDecoder,AEEIID_IDownSample,(void**)&piDownSample)<br />

if(AEE_SUCCESS == nErr)<br />

{<br />

if (NULL != piDownSample){<br />

IDownSample_SetTargetFidelity(piDownSample, fidelityX, fidelityY);<br />

}<br />

}<br />

ERR_BAIL(IEnv_CreateInstance(pMe->piEnv, AEECLSID_GenericViewerAsync, (void **)&piImage));<br />

IImage_SetParm(piImage, IPARM_DECODER, (int)piImageDecoder, 0);<br />

21


Resource Usage: Reduce Number of Images<br />

• Reduce number of<br />

images and widgets to<br />

be used (e.g. If you<br />

have a key that shows<br />

lower and upper case,<br />

use one static widget<br />

per key instead of two<br />

to save time to display)<br />

This key can be done<br />

with one static widget<br />

instead of two<br />

22


Resource Usage: Use Bitmap Fonts or Simple<br />

Typeface Fonts<br />

• Font execution<br />

and memory costs<br />

– $ BitMap<br />

– $$ Simple True Type<br />

– $$$ Complex True Type<br />

• Font render quality<br />

– ✚ BitMap<br />

– ✚✚ Simple True Type<br />

– ✚✚✚Complex True<br />

Type<br />

23


Resource Usage: Merge Static Images<br />

With Background Images if Possible<br />

• If there are some static images that are always displayed, merge them<br />

into the background<br />

Screen to show<br />

different shapes on<br />

the top portion only<br />

But the lower<br />

portion remains the<br />

same, but these are<br />

always loaded!<br />

Background has a<br />

plain color<br />

Should use this as<br />

the background<br />

24


Resource Usage: Avoid Blending in Run-time<br />

• If two images can be composited at authoring time rather than runtime,<br />

they probably should be, as this can be a way to reduce use of alpha<br />

blending<br />

If these images are always blended in run time,<br />

and can be composited at authoring time<br />

Then, load this image instead<br />

of loading 2 images and then<br />

blending<br />

25


Resource Usage: Use Fixed Height for Lists<br />

• Use fixed height for<br />

different rows to be<br />

displayed on Forms,<br />

if multiple lines are<br />

needed, try to fit into<br />

a fixed height<br />

Fixed height<br />

to avoid<br />

computation<br />

during<br />

run-time.<br />

26


Resource Usage: Preloading and Reusing<br />

• Pre-loading an application or images if it’s time critical to<br />

start that application or image. This has a negative impact<br />

on boot time<br />

• Pre-loading can be deferred and doesn't need to occur on<br />

app startup. Once the app has started, a background task<br />

can be used to progressively pre-load resources that t will be<br />

needed later on. Or if the image is hidden, don’t load it<br />

during app startup<br />

• Cache the local time offset. Update only on a widely spaced<br />

interval (minutes) or at the beginning of a short duration<br />

use case<br />

27


Resource Usage: JPEG/PNG vs BMP<br />

• PNG and JPG require performance costly decoding<br />

• Non-transparent (opaque) images may be converted to bitmaps<br />

– Increase the size of the bar file, but will lower load time<br />

• BMP (left), vs. PNG (right) format quality comparison<br />

bmp file: 232KBytes<br />

png file: 46KBytes<br />

• Use PNG if you would like to use the transparency feature<br />

A is transparent to B<br />

A is opaque to B<br />

28


Resource Usage: Use PNG Crush<br />

• Encoding PNGs without “rowfilters” faster decodes<br />

– Cost is a large PNG file<br />

– Negative implications for EFS and RAM usage<br />

– Leads to an overall decrease in system performance<br />

• Recommended pngcrush options<br />

– pngcrush -f 0 to minimize RAM and EFS usage<br />

– pngcrush -rem tEXt to remove unneeded text added by<br />

authoring tools<br />

– pngcrush -rem gAMA -rem cHRM -rem iCCP -rem sRGB<br />

• To remove chunks for gamma and color correction tables<br />

• Use the pngcrush –brute for broad spectrum optimizations<br />

29


Resource Usage: Additional PNG<br />

Performance Hints<br />

• Don’t use interlaced PNGs<br />

• ARGB with unused “A” – alpha channel convert<br />

to RGB<br />

• Some images that are 24 or 32 bit can be more<br />

efficiently (and without loss of fidelity) encoded<br />

as a palletized PNG<br />

• Filmstrip type animations can be really large<br />

– Using an animated GIF should be used if an undue<br />

loss of fidelity is not realized<br />

30


31<br />

Memory Utilization<br />

Considerations


Resource Usage: Use Widget Properties<br />

• Use Iwidget to set properties to display<br />

vs loading an image file<br />

– Smaller and faster<br />

– Loading<br />

if_textinput_qwerty_alphabet_bg.png<br />

(320x240 pixels)<br />

– Should use the following for better<br />

performance<br />

• tif_qwerty_alphabet_title.png (320x21 px)<br />

• IWidget_SetGradientStyle(GRADIENT_STYL<br />

E_CENTERVERT)<br />

• IWidget_SetGradientColor(219,<br />

222,227)<br />

• IWidget_SetBGColor(222,227,231)<br />

Background with<br />

a plain color<br />

Background with a plain<br />

color and gradient<br />

32


Storage Optimization: Use a Smaller Image for Tiles<br />

Instead of a Full Image<br />

• Use a smaller image that will be tiled, instead of a large image<br />

Large image: 159K Bytes<br />

Small image: 19K Bytes, can be tiled<br />

33


Database Applications Optimizations<br />

• Database Update — dbc_IStatement<br />

– Disable Journaling when data integrity is not required (SQLite Pragma)<br />

– Minimize db operations (query/update):<br />

• Use transactions<br />

static const char* cpszPragma = "PRAGMA journal_mode = PERSIST;"<br />

// static const char* cpszPragma = "PRAGMA journal_mode = OFF;"<br />

nErr = DBCConnection_New(pszPath, me->piEnv, po, eFlags, ppic);<br />

cpszUpdate = cpszPragma;<br />

ERR_BAIL(dbc_IConnection_CreateStatement(*ppic, cpszUpdate, &nTail, &pis));<br />

ERR_BAIL(dbc_IStatement_ExecuteUpdate(pis))<br />

– Use Cache (control cache size via SQLite Pragmas)<br />

• Database Queries<br />

– Refer to Database Technology Guide —<br />

https://brewmobileplatform.qualcomm.com/devnet/index.jsp#databases<br />

34


Memory Management Considerations<br />

• Avoid fragmenting heap<br />

– See next page for explanation of heap fragmentation<br />

• Consider using block allocator<br />

– for (i=0; i


Brew MP-Specific Techniques to Avoid Heap<br />

Fragmentation<br />

• Example of heap fragmentation<br />

16KB used<br />

48KB free<br />

1:<br />

first alloc<br />

16KB used 16KB used 32KB free<br />

first alloc second alloc<br />

2:<br />

16KB used 16KB used 16KB used 16KB free<br />

3:<br />

first alloc second alloc third alloc<br />

16KB used 16KB free 16KB used 16KB free<br />

4:<br />

first alloc<br />

third alloc<br />

36


Apply Brew MP Best Practices to Avoid Memory<br />

Corruptions<br />

• The performance of a ‘crashed’ system is ZERO<br />

• Release all BMP API’s through IQI_RELEASEIF().IQI_RELEASEIF<br />

– Ensures only memory of a non-null null API pointer is released<br />

– After the release the pointer will be set to NULL<br />

• All the FREE() should be replaced with FREEIF()<br />

– FREEIF() will ensure that only memory of a non-null null pointer is freed<br />

– After the free, the pointer will be set to NULL<br />

• Only use the version of the helper functions that accept “size” as one its<br />

parameters for boundary checking. This includes functions like strncpy,<br />

snprintf, etc… This will avoid any possible buffer-overrun.<br />

37


Some Simple Tips for Memory Leaks<br />

• Analyze any possible memory leak reported by BMP<br />

– Typical message from the logger<br />

• *AEEModule.c 00266 Warning—memory leak, freeing, 0xABCD1010, NONAME,<br />

size 328<br />

– Indicates a memory leak of 328 bytes from memory address 0xABCD1010<br />

– The source of the memory is NONAME because the memory block was<br />

not tagged<br />

– To enable the heap blocks to be tagged with the filename/linenumber, insert<br />

• “#define AEE_DEBUG_HEAP 1” before the app includes AEEStdLib.h.<br />

• Remove this #define before shipping i commercial product<br />

• *AEEModule.c Leverage c_heaptracker app to track app’s heap usage<br />

over time to see whether the available heap keeps decreasing for no<br />

reason<br />

38


39<br />

Power Utilization<br />

Considerations


Power Management<br />

• Avoid/minimize periodic wakeups<br />

• LEDs and LCD burn an enormous amount of power<br />

– Turn off the keyboard backlight when not needed<br />

– Turn off the display when not needed<br />

– Turn off the backlight when not needed<br />

• Apps responding “TRUE” to EVT_APP_NO_SLEEP<br />

defer entry to BLPM by 10 seconds<br />

• Responding “TRUE” burns power<br />

40


Power Management—Minimize Number of Wake-ups<br />

• Avoid using periodic timers to wake up the phone<br />

• An example of waking up every 5 seconds may reduce the standby<br />

time by 29%!<br />

Idle power<br />

consumption<br />

is really low<br />

Each wake up<br />

consumes lots<br />

of power<br />

41


42<br />

Tools


Principles of Code Optimizations Using<br />

Profiling Tools<br />

• Timing data in the profiler report in Simulator is not the<br />

same as that of the phone. Sometimes they are not even<br />

proportional.<br />

• Profiling report by PC tools (e.g. GlowCode) is still helpful<br />

in shedding light into the behavior of you module on target.<br />

– The time spent in a function and its children. This lets you<br />

follow the “hot path”—the series of function calls responsible<br />

for the majority of the execution time.<br />

– The number of hits in a given function. Function with higher hits<br />

could be part of the “inner loop”, and should be looked at with<br />

more carefully.<br />

43


Sample Output for basicmod1app Module<br />

Inner-loop: HandleEvent<br />

Hot-path: dbg_Message: 13.8%<br />

44


CS Heap Tracker (CHT)<br />

• CHT is a utility to debug memory leaks, double-frees, writing past allocated memory, etc.<br />

• Support for simulator and on-target debugging<br />

• CHT provides, in the instance of detected heap corruption, a stack trace of the offending function<br />

• Also supports high watermark tracking<br />

45


46<br />

Conclusion


Summary<br />

• Performance is comprised of many components<br />

– Execution time<br />

– Response time<br />

– Memory utilization<br />

– Power consumption<br />

• Optimal performance is achieved through making tradeoffs<br />

• These tradeoffs are most effectively made through<br />

understanding the application specific considerations and<br />

general system considerations<br />

• Ultimately the application user experience and overall user<br />

experience of the system defines optimal performance<br />

47


48<br />

Additional Information


Resource Usage: Miscellaneous<br />

• Images that are reused several times should be cached<br />

and not re-loaded from the file system<br />

– Make the application as a singleton if it is expected to be<br />

used for multiple applications. E.g. virtual keypad that is<br />

used by messaging, email, dialer applications.<br />

• If the device is performing slowly when flickering,<br />

you may have to increase the Event Expiration Timer<br />

(IObserver_SetPointerEvtExpirationTimer) toimprove<br />

the performance of list flicks (perhaps a value closer to<br />

1 second instead of the default 150ms) to get a longer<br />

sampling rate of the pointer events<br />

49


50<br />

CHT Example: Double-free

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!